Loading…
This event has ended. Create your own event on Sched.
For over 20 years, ESIP meetings have brought together the most innovative thinkers and leaders around Earth science data, forming a community dedicated to making Earth science data more discoverable, accessible and useful to researchers, practitioners, policymakers, and the public. The theme of the July meeting is "Data for All People: From Generation to Use and Understanding."

Registered attendees can join us virtually at https://2022julyesipmeeting.qiqochat.com/.
Back To Schedule
Thursday, July 21 • 8:30am - 10:00am
Advances and Challenges of Cloud-Native Data (including Analysis-Ready Cloud-Optimized, or ARCO Formats) and Access, Part 1: Presentations

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Feedback form is now closed.
Zoom Recording
Notes Doc

(Check out Part 2)

Part 1: Presentations on Geospatial Cloud Data Formats and Access

The ESIP Cloud Computing Cluster is the space for new and existing Earth science users of the data in the cloud to discuss new technologies, challenges and opportunities. Part 1 of this two-part session will include presentations on technologies used to create, store and access geospatial data in the cloud. Presentations will cover new and existing tools such as pangeo-forge, Zarr, Cloud-Optimized GeoTIFF (COGs), Kerchunk, xpublish, Cloud-Optimized Point Clouds and GeoParquet. “It’s become increasingly clear how these formats are more convenient and performant than archival formats” (from Dave Meyer, GES DISC). We will emphasize the importance of real-world use cases in these presentations. Matt Hanson and another speaker will give “state of cloud native” presentations to start and at the finish.

During these talks, attendees will be encouraged to add questions to a virtual list where questions may be “upvoted”. These questions will be clustered to form discussion groups for the afternoon session: “Cloud Out Loud”.

Presentations:

Aimee Barciauskas: Motivations
  • Why this session? We need to learn from each other, understand advances and current methods
  • What to expect from this session
  • Agenda for Part 2: Discussion groups
  • How to get involved with the cloud computing cluster

Matt Hanson: STAC and how it’s Powering Cloud-Native Workflows

Briana Pagan: Current State of Cloud-Native Geospatial Formats  Geospatial information exists in a wide diversity of data types including vector, raster, point and multi-dimensional data cubes. The movement towards cloud-native geospatial data formatting has resulted in the creation of Cloud-Optimized GeoTIFFs (COGs), Zarr, GeoParquet and Cloud-Optimize Point Clouds (COPCs). This talk aims to provide an introductory overview of current cloud-native formats in terms of performance, popularity and suitability for various data archive types.

Christine Smit: Metadata for geospatial, multi-dimensional Zarr arrays in the Cloud The science community has begun to coalesce around Zarr for multi-dimensional data in the cloud. By itself, Zarr's metadata specification is focused on the mechanics of storing arrays rather than on the relationships between arrays or on the semantics of the data being stored. Fortunately, Zarr's metadata is highly flexible and geospatial enthusiasts have started the process of adding additional metadata and moving towards standardization. Python's popular xarray library has leaned heavily on CF-1 standards. NetCDF has brought its variable and dimension relationships into its own zarr standard, which was partly inspired by xarray's work. This talk aims to provide an overview of where standards are right now.

Hailiang Zhang: Zarr-based chunk-level cumulative sums in reduced dimensions for fast high-resolution data analysis At NASA GES DISC, we receive a large number of user requests each day for a variety of analysis and visualization services, some of which are very expensive due to large amounts of data averaging along one or more dimensions. These expensive services can be greatly sped up by adapting our data into cloud-friendly chunked format, such as Zarr, to facilitate parallel data access and computation; however, it is challenging to implement an efficient multidimensional averaging service with optimal chunk layout for high resolution datasets. We hereby propose a generic and dimension-agnostic method based on chunk-level cumulative sums on the regular grid which provides fast and cost-efficient cloud analysis for multidimensional averaging services such as area and time averaging. This method involves chunk-level weighted integration in stepwise-reduced dimensions, which introduces a small adjustable set of auxiliary data and leaves the raw data untouched. Compared to the standard method, this approach dramatically reduces the computational time by orders of magnitude with a minimal AWS cost incurred.

Maha Hegde: Operating mirror data stores: Challenges and Potential Solutions With the emergence of object-store-friendly data formats that are different from the formats used in the official data archive, the user community is demanding creation of parallel archives to take advantage  of Cloud Computing's strengths. In many cases, data stores are being created for consumption by specific communities. This talk explores the challenges and potential approaches, including recording provenance, to ensure that the data in a mirror data store is complete, verified and trustable.

Ramon Ramirez-Linan: NASA’s Science Managed Cloud Environment (SMCE) NASA’s Science Managed Cloud Environment (SMCE) is a managed Amazon Web Services (AWS)-based infrastructure for NASA-funded projects. SMCE engineers at NASA Goddard Space Flight Center were challenged to integrate numerous existing open-source projects that can easily be replicated both in the cloud and on premises in a complementary fashion. The SMCE team developed the NASA Earth Information System (EIS): a flexible, rapid response computing capability that leverages the versatility of the AWS cloud, including high-performance computing (HPC) services. With an Open Science objective, the SMCE team designed a platform that creates Infrastructure as Code (IaC) artifacts that are useful to NASA scientists and allows organizations outside of NASA to replicate this deployment.

Terence Tuhinanshu: Benchmarking the performance of cloud-friendly encodings of NWM The NOAA National Water Model (NWM) Retrospective dataset contains retrospective simulations of streamflow, soil moisture, and snowpack conditions at hourly and 3-hourly frequencies over the continental US from 1979-2020. This dataset has great value for environmental scientists, but is stored in a way that is not optimized for common patterns of usage. In particular, the dataset is stored as one NetCDF file for each time step, where each file covers the whole country. Implementing a query that involves a large number of time steps and a small subset of the country requires downloading a large number of files, and then discarding all but a small subset of the data, which is inefficient and not optimized for cloud computing. Recently, NWM has been re-encoded in Zarr format and released on AWS S3. The Zarr format supports reading subarrays from cloud storage in parallel which has the potential to speed up queries to NWM, although the performance benefits for specific queries may vary by the chunking strategy used in the Zarr encoding. 
This talk will discuss a set of experiments exploring how different approaches to encoding NWM data affect query performance. In addition to trying different chunking strategies with Zarr, we will present results on encoding data using Parquet, which is another cloud-friendly format that is better suited for tabular datasets. We hypothesize that Parquet may be more performant for the streamflow output which is tabular. These experiments will use a parametrically varied set of prototypical queries, and will also vary the number of cores of computation to test scalability. All code to run these experiments will be written in Python using xArray and Dask, and will be made open source. 

Ryan Abernathey Pangeo-forge: Crowdsourcing Open Data in the Cloud



Speakers
avatar for Aimee Barciauskas

Aimee Barciauskas

Data Engineer, Development Seed
avatar for Ramon Ramirez-Linan

Ramon Ramirez-Linan

co-founder, Navteca
avatar for Ryan Abernathey

Ryan Abernathey

Associate Professor, Columbia University
Ryan P. Abernathey, an Associate Professor of Earth And Environmental Science at Columbia University and Lamont Doherty Earth Observatory, is a physical oceanographer who studies large-scale ocean circulation and its relationship with Earth's climate. He received his Ph.D. from MIT... Read More →
avatar for Robert Casey

Robert Casey

Deputy Director of Cyberinfrastructure, IRIS Data Services
Rob currently serves as Deputy Director of Cyberinfrastructure at the Incorporated Research Institutions for Seismology (IRIS) Data Management Center (DMC) in Seattle, WA. His responsibilities include management of software development and data services activities as well as leading... Read More →
avatar for Dave Meyer

Dave Meyer

GES DISC, NASA
avatar for Matt Hanson

Matt Hanson

Sr Software Engineer, Element 84
Geospatial data interoperability and discovery
JC

James Coll

Physical Scientist, Lynker
avatar for Brianna Pagan

Brianna Pagan

Deputy Manager, NASA GES DISC


Thursday July 21, 2022 8:30am - 10:00am EDT
Ballroom 2 600 Commonwealth Pl, Pittsburgh, PA 15222