Zoom RecordingNotes Doc(Check out
Part 2)
Part 1: Presentations on Geospatial Cloud Data Formats and Access
The ESIP Cloud Computing Cluster is the space for new and existing Earth science users of the data in the cloud to discuss new technologies, challenges and opportunities. Part 1 of this two-part session will include presentations on technologies used to create, store and access geospatial data in the cloud. Presentations will cover new and existing tools such as pangeo-forge, Zarr, Cloud-Optimized GeoTIFF (COGs), Kerchunk, xpublish, Cloud-Optimized Point Clouds and GeoParquet. “It’s become increasingly clear how these formats are more convenient and performant than archival formats” (from Dave Meyer, GES DISC). We will emphasize the importance of real-world use cases in these presentations. Matt Hanson and another speaker will give “state of cloud native” presentations to start and at the finish.
During these talks, attendees will be encouraged to add questions to a virtual list where questions may be “upvoted”. These questions will be clustered to form discussion groups for the afternoon session: “Cloud Out Loud”.
Presentations:
Aimee Barciauskas: Motivations- Why this session? We need to learn from each other, understand advances and current methods
- What to expect from this session
- Agenda for Part 2: Discussion groups
- How to get involved with the cloud computing cluster
Matt Hanson: STAC and how it’s Powering Cloud-Native WorkflowsBriana Pagan: Current State of Cloud-Native Geospatial Formats Geospatial information exists in a wide diversity of data types including vector, raster, point and multi-dimensional data cubes. The movement towards cloud-native geospatial data formatting has resulted in the creation of Cloud-Optimized GeoTIFFs (COGs), Zarr, GeoParquet and Cloud-Optimize Point Clouds (COPCs). This talk aims to provide an introductory overview of current cloud-native formats in terms of performance, popularity and suitability for various data archive types.
Christine Smit: Metadata for geospatial, multi-dimensional Zarr arrays in the Cloud The science community has begun to coalesce around Zarr for multi-dimensional data in the cloud. By itself, Zarr's metadata specification is focused on the mechanics of storing arrays rather than on the relationships between arrays or on the semantics of the data being stored. Fortunately, Zarr's metadata is highly flexible and geospatial enthusiasts have started the process of adding additional metadata and moving towards standardization. Python's popular xarray library has leaned heavily on CF-1 standards. NetCDF has brought its variable and dimension relationships into its own zarr standard, which was partly inspired by xarray's work. This talk aims to provide an overview of where standards are right now.
Hailiang Zhang: Zarr-based chunk-level cumulative sums in reduced dimensions for fast high-resolution data analysis At NASA GES DISC, we receive a large number of user requests each day for a variety of analysis and visualization services, some of which are very expensive due to large amounts of data averaging along one or more dimensions. These expensive services can be greatly sped up by adapting our data into cloud-friendly chunked format, such as Zarr, to facilitate parallel data access and computation; however, it is challenging to implement an efficient multidimensional averaging service with optimal chunk layout for high resolution datasets. We hereby propose a generic and dimension-agnostic method based on chunk-level cumulative sums on the regular grid which provides fast and cost-efficient cloud analysis for multidimensional averaging services such as area and time averaging. This method involves chunk-level weighted integration in stepwise-reduced dimensions, which introduces a small adjustable set of auxiliary data and leaves the raw data untouched. Compared to the standard method, this approach dramatically reduces the computational time by orders of magnitude with a minimal AWS cost incurred.
Maha Hegde: Operating mirror data stores: Challenges and Potential Solutions With the emergence of object-store-friendly data formats that are different from the formats used in the official data archive, the user community is demanding creation of parallel archives to take advantage of Cloud Computing's strengths. In many cases, data stores are being created for consumption by specific communities. This talk explores the challenges and potential approaches, including recording provenance, to ensure that the data in a mirror data store is complete, verified and trustable.
Ramon Ramirez-Linan: NASA’s Science Managed Cloud Environment (SMCE) NASA’s Science Managed Cloud Environment (SMCE) is a managed Amazon Web Services (AWS)-based infrastructure for NASA-funded projects. SMCE engineers at NASA Goddard Space Flight Center were challenged to integrate numerous existing open-source projects that can easily be replicated both in the cloud and on premises in a complementary fashion. The SMCE team developed the NASA Earth Information System (EIS): a flexible, rapid response computing capability that leverages the versatility of the AWS cloud, including high-performance computing (HPC) services. With an Open Science objective, the SMCE team designed a platform that creates Infrastructure as Code (IaC) artifacts that are useful to NASA scientists and allows organizations outside of NASA to replicate this deployment.
Terence Tuhinanshu: Benchmarking the performance of cloud-friendly encodings of NWM The NOAA National Water Model (NWM) Retrospective dataset contains retrospective simulations of streamflow, soil moisture, and snowpack conditions at hourly and 3-hourly frequencies over the continental US from 1979-2020. This dataset has great value for environmental scientists, but is stored in a way that is not optimized for common patterns of usage. In particular, the dataset is stored as one NetCDF file for each time step, where each file covers the whole country. Implementing a query that involves a large number of time steps and a small subset of the country requires downloading a large number of files, and then discarding all but a small subset of the data, which is inefficient and not optimized for cloud computing. Recently, NWM has been re-encoded in Zarr format and released on AWS S3. The Zarr format supports reading subarrays from cloud storage in parallel which has the potential to speed up queries to NWM, although the performance benefits for specific queries may vary by the chunking strategy used in the Zarr encoding.
This talk will discuss a set of experiments exploring how different approaches to encoding NWM data affect query performance. In addition to trying different chunking strategies with Zarr, we will present results on encoding data using Parquet, which is another cloud-friendly format that is better suited for tabular datasets. We hypothesize that Parquet may be more performant for the streamflow output which is tabular. These experiments will use a parametrically varied set of prototypical queries, and will also vary the number of cores of computation to test scalability. All code to run these experiments will be written in Python using xArray and Dask, and will be made open source.
Ryan Abernathey Pangeo-forge: Crowdsourcing Open Data in the Cloud