How to read data from STAC#

This notebook shows how to read the data in from a STAC asset using xarray and a little hidden helper library called xpystac.

tl;dr#

For any PySTAC asset that can be represented as an xarray dataset you can read the data using the following command:

xr.open_dataset(asset)

If you want to load multiple assets from the same item and/or many items at once use:

odc.stac.load([items])

Dependencies#

There are lots of optional dependencies depending on where and how the data you are interested in are stored. Here are some of the libraries that you will probably need:

  • dask - to delay data loading until access

  • pystac - STAC object structures

  • xarray, rioxarray - data structures

  • xpystac, odc-stac - helper for loading pystac into xarray objects

Despite all these install instructions, the import block is very straightforward

[1]:
import odc.stac
import planetary_computer
import rioxarray
import xarray as xr

import pystac

Examples#

Here are a few examples of the different types of objects that you can open in xarray.

COGs#

Read all the data from the COGs referenced by the assets on an item.

[2]:
landsat_item = pystac.Item.from_file(
    "https://planetarycomputer.microsoft.com/api/stac/v1/collections/landsat-c2-l2/items/LC09_L2SP_088084_20230408_02_T2",
)

ds = odc.stac.load(
    [landsat_item], chunks={"x": 1024, "y": 1024}, patch_url=planetary_computer.sign
)
ds
[2]:
<xarray.Dataset> Size: 2GB
Dimensions:      (y: 7801, x: 7761, time: 1)
Coordinates:
  * y            (y) float64 62kB -3.713e+06 -3.713e+06 ... -3.947e+06
  * x            (x) float64 62kB 3.774e+05 3.774e+05 ... 6.102e+05 6.102e+05
    spatial_ref  int32 4B 32656
  * time         (time) datetime64[ns] 8B 2023-04-08T23:37:51.630731
Data variables: (12/19)
    qa           (time, y, x) int16 121MB dask.array<chunksize=(1, 1024, 1024), meta=np.ndarray>
    red          (time, y, x) uint16 121MB dask.array<chunksize=(1, 1024, 1024), meta=np.ndarray>
    blue         (time, y, x) uint16 121MB dask.array<chunksize=(1, 1024, 1024), meta=np.ndarray>
    drad         (time, y, x) int16 121MB dask.array<chunksize=(1, 1024, 1024), meta=np.ndarray>
    emis         (time, y, x) int16 121MB dask.array<chunksize=(1, 1024, 1024), meta=np.ndarray>
    emsd         (time, y, x) int16 121MB dask.array<chunksize=(1, 1024, 1024), meta=np.ndarray>
    ...           ...
    swir16       (time, y, x) uint16 121MB dask.array<chunksize=(1, 1024, 1024), meta=np.ndarray>
    swir22       (time, y, x) uint16 121MB dask.array<chunksize=(1, 1024, 1024), meta=np.ndarray>
    coastal      (time, y, x) uint16 121MB dask.array<chunksize=(1, 1024, 1024), meta=np.ndarray>
    qa_pixel     (time, y, x) uint16 121MB dask.array<chunksize=(1, 1024, 1024), meta=np.ndarray>
    qa_radsat    (time, y, x) uint16 121MB dask.array<chunksize=(1, 1024, 1024), meta=np.ndarray>
    qa_aerosol   (time, y, x) uint8 61MB dask.array<chunksize=(1, 1024, 1024), meta=np.ndarray>

Let’s prove that we really can access the data within the COGs.

[3]:
%%time
ds.blue.mean().compute()
CPU times: user 2.68 s, sys: 459 ms, total: 3.14 s
Wall time: 5.31 s
[3]:
<xarray.DataArray 'blue' ()> Size: 8B
array(13940.01841915)
Coordinates:
    spatial_ref  int32 4B 32656

For more control over the individual COGs referenced by the assets, you can grab the href off the asset, sign it and then use it directly.

[4]:
landsat_asset = landsat_item.assets["blue"]
landsat_asset_href = planetary_computer.sign(landsat_asset).href

da = rioxarray.open_rasterio(landsat_asset_href, overview_level=5)
da
[4]:
<xarray.DataArray (band: 1, y: 122, x: 122)> Size: 30kB
[14884 values with dtype=uint16]
Coordinates:
  * band         (band) int64 8B 1
  * x            (x) float64 976B 3.783e+05 3.802e+05 ... 6.074e+05 6.093e+05
  * y            (y) float64 976B -3.714e+06 -3.716e+06 ... -3.946e+06
    spatial_ref  int64 8B 0
Attributes:
    AREA_OR_POINT:  Point
    _FillValue:     0
    scale_factor:   1.0
    add_offset:     0.0
[5]:
%%time
da.mean()
CPU times: user 3.44 ms, sys: 0 ns, total: 3.44 ms
Wall time: 186 ms
[5]:
<xarray.DataArray ()> Size: 8B
array(14472.21996775)
Coordinates:
    spatial_ref  int64 8B 0

Zarr#

Read from an asset that references data stored in zarr

[6]:
daymet_collection = pystac.Collection.from_file(
    "https://planetarycomputer.microsoft.com/api/stac/v1/collections/daymet-daily-hi"
)
daymet_asset = daymet_collection.assets["zarr-abfs"]
daymet_asset_signed = planetary_computer.sign(daymet_asset)

ds = xr.open_dataset(daymet_asset_signed, chunks={})
ds
[6]:
<xarray.Dataset> Size: 69GB
Dimensions:                  (time: 14965, y: 584, x: 284, nv: 2)
Coordinates:
    lat                      (y, x) float32 663kB dask.array<chunksize=(584, 284), meta=np.ndarray>
    lon                      (y, x) float32 663kB dask.array<chunksize=(584, 284), meta=np.ndarray>
  * time                     (time) datetime64[ns] 120kB 1980-01-01T12:00:00 ...
  * x                        (x) float32 1kB -5.802e+06 ... -5.519e+06
  * y                        (y) float32 2kB -3.9e+04 -4e+04 ... -6.22e+05
Dimensions without coordinates: nv
Data variables:
    dayl                     (time, y, x) float32 10GB dask.array<chunksize=(365, 584, 284), meta=np.ndarray>
    lambert_conformal_conic  int16 2B ...
    prcp                     (time, y, x) float32 10GB dask.array<chunksize=(365, 584, 284), meta=np.ndarray>
    srad                     (time, y, x) float32 10GB dask.array<chunksize=(365, 584, 284), meta=np.ndarray>
    swe                      (time, y, x) float32 10GB dask.array<chunksize=(365, 584, 284), meta=np.ndarray>
    time_bnds                (time, nv) datetime64[ns] 239kB dask.array<chunksize=(365, 2), meta=np.ndarray>
    tmax                     (time, y, x) float32 10GB dask.array<chunksize=(365, 584, 284), meta=np.ndarray>
    tmin                     (time, y, x) float32 10GB dask.array<chunksize=(365, 584, 284), meta=np.ndarray>
    vp                       (time, y, x) float32 10GB dask.array<chunksize=(365, 584, 284), meta=np.ndarray>
    yearday                  (time) int16 30kB dask.array<chunksize=(365,), meta=np.ndarray>
Attributes:
    Conventions:       CF-1.6
    Version_data:      Daymet Data Version 4.0
    Version_software:  Daymet Software Version 4.0
    citation:          Please see http://daymet.ornl.gov/ for current Daymet ...
    references:        Please see http://daymet.ornl.gov/ for current informa...
    source:            Daymet Software Version 4.0
    start_year:        1980

Reference file#

If the collection has a reference file we can use that.

This will not work for kerchunk>=0.2.8 and xpystac<=0.2.0. Track the xpystac ticket here: xpystac #48

[7]:
cmip6_collection = pystac.Collection.from_file(
    "https://planetarycomputer.microsoft.com/api/stac/v1/collections/nasa-nex-gddp-cmip6"
)
cmip6_asset = cmip6_collection.assets["ACCESS-CM2.historical"]
cmip6_asset_signed = planetary_computer.sign(cmip6_asset)

ds = xr.open_dataset(cmip6_asset_signed, chunks={}, patch_url=planetary_computer.sign)
ds
[7]:
<xarray.Dataset> Size: 738GB
Dimensions:  (time: 23741, lat: 600, lon: 1440)
Coordinates:
  * lat      (lat) float64 5kB -59.88 -59.62 -59.38 -59.12 ... 89.38 89.62 89.88
  * lon      (lon) float64 12kB 0.125 0.375 0.625 0.875 ... 359.4 359.6 359.9
  * time     (time) datetime64[us] 190kB 1950-01-01T12:00:00 ... 2014-12-31T1...
Data variables:
    hurs     (time, lat, lon) float32 82GB dask.array<chunksize=(1, 600, 1440), meta=np.ndarray>
    huss     (time, lat, lon) float32 82GB dask.array<chunksize=(1, 600, 1440), meta=np.ndarray>
    pr       (time, lat, lon) float32 82GB dask.array<chunksize=(1, 600, 1440), meta=np.ndarray>
    rlds     (time, lat, lon) float32 82GB dask.array<chunksize=(1, 600, 1440), meta=np.ndarray>
    rsds     (time, lat, lon) float32 82GB dask.array<chunksize=(1, 600, 1440), meta=np.ndarray>
    sfcWind  (time, lat, lon) float32 82GB dask.array<chunksize=(1, 600, 1440), meta=np.ndarray>
    tas      (time, lat, lon) float32 82GB dask.array<chunksize=(1, 600, 1440), meta=np.ndarray>
    tasmax   (time, lat, lon) float32 82GB dask.array<chunksize=(1, 600, 1440), meta=np.ndarray>
    tasmin   (time, lat, lon) float32 82GB dask.array<chunksize=(1, 600, 1440), meta=np.ndarray>
Attributes: (12/22)
    Conventions:           CF-1.7
    activity:              NEX-GDDP-CMIP6
    cmip6_institution_id:  CSIRO-ARCCSS
    cmip6_license:         CC-BY-SA 4.0
    cmip6_source_id:       ACCESS-CM2
    contact:               Dr. Rama Nemani: rama.nemani@nasa.gov, Dr. Bridget...
    ...                    ...
    scenario:              historical
    source:                BCSD
    title:                 ACCESS-CM2, r1i1p1f1, historical, global downscale...
    tracking_id:           16d27564-470f-41ea-8077-f4cc3efa5bfe
    variant_label:         r1i1p1f1
    version:               1.0