Experience is the best teacher. Data providers need a space to share their experiences generating ARCO (analysis-ready cloud-optimized) datasets and use the ESIP Winter session to collect best practices and examples.
The Cloud Computing Cluster is excited to organize this session which will produce “real outputs” in the form of best practices, guidance and use case driven workflow with examples which will be formalized in github repositories and shared through social media such as LinkedIn, twitter and slack.
The Cloud Computing Cluster is composed of members from data and service providers such as USGS, NASA, NOAA, cloud providers such as AWS and Microsoft, and academic institutions such as UCAR/NCAR and the University of Washington. The January meeting provides an opportunity for these organizations to share and coalesce on best practices for analysis-ready cloud-optimized (ARCO) data.
View Recording
View Notes
Agenda:Also see more details in this
Slide deck- 1:30-1:45: Gather, share agenda, share some work in progress “guidance” documents where we are consolidating what we know and what we don’t know (“Lessons from the field”)
- 1:45-2:05: Lightning talks - Listen to the experiences from a few in our community.
- Action for attendees: Listen and use the chat box to ask questions.
- 2:05-2:15: Individual (silent) brainstorming
- Action for attendees: Make a copy of slide 9 and answer questions
- 2:15-2:35: Small group brainstorming + voting
- Action for attendees: Share answers to slide 9 questions
- Action for attendees: Vote on questions in slide
- 2:35-3:05: Fishbowl
- Action for attendees: If your question received the most votes, ask it to the group
- Action for attendees: Listen to questions and deliver answers or insights.
- 3:05-3:15: Break
- Action for attendees: Recover for tutorials!
- 3:15-3:45: Tutorials on kerchunk and pangeo-forge
- Action for attendees: Listen to tutorials
- 3:45-4: Wrap up and next steps
- Action for attendees: Sign up for email and slack if not already on those channels
Lightning Talks
Four (4) 5-minute lightning talks
- Dieu My thanh Nguyen on Zarr chunking strategies research
- Anderson Banihirwe on producing Zarr data store for the complex climate model data (Community Earth System Model Large Ensemble (CESM LENS) Data Sets on AWS - Datasets - DASH Search - PRODUCTION)
- Lucas Sterzinger on Fake it until you make it — Reading GOES NetCDF4 data on AWS S3 as Zarr for rapid data access
- Landung (Don) Setiawan on building data portals for OOI and other NASA-funded projects requiring large-scale data conversion and utilization of both COG and Zarr.
Tutorials
pangeo-forge with Charles SternPangeo Forge is an open source tool for data Extraction, Transformation, and Loading (ETL). The goal of Pangeo Forge is to make it easy to extract data from traditional data repositories and deposit in cloud object storage in analysis-ready, cloud-optimized (ARCO) format.
Kerchunk with Lucas Cloud-friendly access to archival data. Kerchunk is a library that provides a unified way to represent a variety of chunked, compressed data formats (e.g. NetCDF, HDF5, GRIB), allowing efficient access to the data from traditional file systems or cloud object storage.