OEDI Data Lakes

What is a data lake?

A data lake is a collection of curated and diverse datasets built to accelerate accessibility and collaboration. Data Lakes typically hold raw data files, scientific reports, supporting media, and links to online documents chosen by the contributing researchers. The lake enables sustained access to large data files, often through partnerships with a variety of cloud vendors.

Information flows into the Data Lake from a variety of sources: private industry, laboratories, analytic tools, use cases, research reports, and more. Next, the Data Lake's content is curated so that it is consistent and standardized. Once a Data Lake is curated it flows outward to the public and becomes universally accessible. Public access opens the door to potential collaboration between publication authors, researchers, universities, high schools, startup companies, and other innovators.

The availability and visibility of the Data Lakes remove barriers to innovation. The Data Lakes reduce duplication of effort by having a centralized location and reduces the cost of storage and analytics of large data sets. New insights and innovations flow outward from the lake, creating opportunities for even more rounds of research and development.

Universal Accessibility

Our open architecture is designed for universal access and dissemination of big data. Data Lakes can be accessed via our cloud partners. There are a few ways to utilize Data Lakes. Jupyter notebooks are a common option for utilizing Data Lakes datasets. However, there are many options for processing Data Lakes.

Multiple Ways to Access

- Jupyter notebooks (example)
- Google Earth Engine
- Direct access to Data Lakes (Requires a cloud account.)
- Data Lake Viewer (Currently only AWS)
- Native cloud command line tools (AWS, Google, Azure)
- Mounting the data as a local read-only drive in a cloud-built computer cluster. Requires same availability zone.

Information flows from cloud storage into a data lake, making it more accessible to researchers and analysts.

Typical Costs of Data Access

Data Lakes are accessible in a variety of ways based on user's needs and budgets. Jupyter notebooks can be accessed free of charge and many operations can be run right in the cloud. Depending on the workload and mechanisms, costs can be under $1 USD with AWS Athena. For more resource intensive operations opt for Google's BigQuery, On-demand or Flat-rate pricing, to process SQL queries on terabytes of data in a matter of seconds. SageMaker Studio Lab provides a fully managed machine learning service, free of charge, with no credit card or AWS account needed. See example use cases.

Manual Download and AWS CLI

- Download datasets via data-catalog viewer, users may process data in ways they see fit. Alternatively, batch download via Amazon's Command Line Interface.
- Free for end users.
- Learn More.

HSDS and Jupyter Notebook

- Highly Scalable Data Service (HSDS) is a REST-based product and solution for reading and writing complex binary data formats within an object-based storage environment, such as the Cloud.
- Free for end users.
- Learn More.

S3 Tools and APIs

- Access data with S3 Tools and APIs to analyze datasets.
- Free for end users.
- Learn More.

AWS Athena

- Athena is an interactive query service offered by AWS that makes it easy to access data in Amazon S3 using SQL.
- AWS Athena can be used for a small fee, often less than $1.
- Learn More.

Google BigQuery

- Google BigQuery is a serverless, highly scalable data warehouse that comes with Google's built-in query engine.
- BigQuery offers two pricing models (On-demand and Flat-rate pricing) for running queries, often less than $0.10
- Learn More.

Cloud-based HPC Cluster

-HPC (High Performance Computing) instances are ideal for applications that benefit from high-performance processors, such as large simulations and machine learning workloads.
- Multiple pricing models, often less than $10.
- Learn More.

AWS SageMaker Studio

- The SageMaker Studio provides a web-based integrated development environment (IDE) where users can see and interact with all ML workflows on AWS.
- End user costs are often less than $20. Use Amazon SageMaker Savings Plan to reduce costs by up to 64%, compared to On-Demand pricing.
- Learn More.

AWS SageMaker Studio Lab

- Amazon SageMaker Studio Lab is a free machine learning (ML) development environment that provides the compute, storage (up to 15GB) and security to learn and experiment with ML.
- Free for end users.
- Learn More.

Data Lakes


Data Lakes


of Data



(this year)
Filter by Cloud Provider:
Submission Availability Size Status Data from