Catalog#

The access to geospatial data has changed significantly over the past decade. Data has traditionally been accessed by downloading several files to a local computer, then analyzing them with software or programming languages. It has always been difficult to access analysis-ready datasets due to the diversity of data formats (NetCDF, Grib2, Geotiff, Shapefile, etc.) and the variety of access protocols from different providers (Opendap, HTTPS, SFTP, WPS, API Rest, Datamarts, etc.). Beyond that, with the ever-increasing size of geospatial datasets, most modern datasets cannot even fit on a local computer, limiting science’s progress.

The catalog presented here consists of large-scale analysis-ready cloud optimized (ARCO) datasets. In order to implement an entry point for these datasets, we have followed the methodology developed by the Pangeo community, which combines multiple technologies:

Data Lake (or S3, Azure Data Lake Storage, GCS, etc.) : distributed file-object storage
Zarr (or alternatively TileDB, COGs) : chunked N-dimensionnal array formats
Dask (or alternatively Spark, Ray, Distributed) : distributed computing and lazy loading
Intake Catalogs (or alternatively STAC) : a general interface for loading different data formats, mostly but not limited to spatiotemporal assets

For more information, please refer to the pangeo’s website.

It is important to keep in mind that most datasets in the catalogue have language-agnostic formats (such as Zarr, netcdfs, geojson, etc.), making them accessible through a variety of programming languages (including Python, Julia, Javascript, C, etc.) that implement the specifications for these formats.

The current catalog is presented below in a table format. A dataset should be used after consulting the status field. If a dataset has a “dev” flag, it signifies that we are actively working on it and do not recommend using it. It is production-ready if it has a “prod” flag. The “prod” label signifies that the dataset has undergone quality review and testing, however users should always double-check on their own because errors are still possible.

[1]:

from IPython.display import HTML

import pandas as pd
import intake
from itables import init_notebook_mode, show

init_notebook_mode(all_interactive=False)
catalog_url = 'https://raw.githubusercontent.com/hydrocloudservices/catalogs/main/catalogs/main.yaml'
cat=intake.open_catalog(catalog_url)

pd.set_option('display.max_colwidth', None)

df = pd.DataFrame([[field ,
               dataset,
               cat[field][dataset].describe()['description'],
               cat[field][dataset].describe()['metadata']['status'][0]]
              for field in list(cat._entries.keys())
              for dataset in cat[field]._entries.keys()],
            columns=['field', 'dataset_name', 'description', 'status']) \
.sort_values('field') \
.reset_index(drop=True)
show(df,
     tags="<caption>Catalog</caption>",
     column_filters="footer",
     dom="lrtip")

HTML('''<script>
code_show=true;
function code_toggle() {
 if (code_show){
 $('div.input_area').hide();
 } else {
 $('div.input_area').show();
 }
 code_show = !code_show
}
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value=""></form>''')

Catalog
field	dataset_name	description	status
Loading... (need help?)
field	dataset_name	description	status

[1]: