OEDI Data Lakes
What is a data lake?
A data lake is a collection of curated and diverse datasets built to accelerate accessibility and collaboration. Data Lakes typically hold raw data files, scientific reports, supporting media, and links to online documents chosen by the contributing researchers. The lake enables sustained access to large data files, often through partnerships with a variety of cloud vendors.
Information flows into the Data Lake from a variety of sources: private industry, laboratories, analytic tools, use cases, research reports, and more. Next, the Data Lake's content is curated so that it is consistent and standardized. Once a Data Lake is curated it flows outward to the public and becomes universally accessible. Public access opens the door to potential collaboration between publication authors, researchers, universities, high schools, startup companies, and other innovators.
The availability and visibility of the Data Lakes remove barriers to innovation. The Data Lakes reduce duplication of effort by having a centralized location and reduces the cost of storage and analytics of large data sets. New insights and innovations flow outward from the lake, creating opportunities for even more rounds of research and development.
Our open architecture is designed for universal access and dissemination of big data. Data Lakes can be accessed via our cloud partners. There are a few ways to utilize Data Lakes. Jupyter notebooks are a common option for utilizing Data Lakes datasets. However, there are many options for processing Data Lakes.
- Jupyter notebooks (example)
- Google Earth Engine
- Direct access to Data Lakes (Requires a cloud account.)
- Data Lake Viewer (Currently only AWS)
- Native cloud command line tools (AWS, Google, Azure)
- Mounting the data as a local read-only drive in a cloud-built computer cluster. Requires same availability zone.
Typical Costs of Data Access
Data Lakes are accessible in a variety of ways based on user's needs and budgets. Jupyter notebooks can be accessed free of charge and many operations can be run right in the cloud. Depending on the workload and mechanisms, costs can be under $1 USD with AWS Athena. For more resource intensive operations opt for Google's BigQuery, On-demand or Flat-rate pricing, to process SQL queries on terabytes of data in a matter of seconds. SageMaker Studio Lab provides a fully managed machine learning service, free of charge, with no credit card or AWS account needed. See example use cases.