Skip to content

Data sources

This page covers where to obtain the raw meteorological input the preprocessor needs, how to authenticate against each source, and the recommended local layout under ~/data/AtmosTransport/met/.

If you only want to run the model and don't care about the preprocessing pipeline, skip to the quickstart bundle at the end — that's preprocessed binaries already, downloadable from a GitHub Release.

ERA5 (ECMWF Reanalysis 5)

ERA5 is the primary spectral source for the LL / CS spectral preprocessing path. Authoritative source: the Copernicus Climate Data Store (CDS).

What you need per day

FileVariableFormat
era5_spectral_YYYYMMDD_lnsp.gblog surface pressure (LNSP) spectral coefficientsGRIB
era5_spectral_YYYYMMDD_vo_d.gbvorticity + divergence (VO + D) spectral coefficientsGRIB
era5_thermo_ml_YYYYMMDD.ncmodel-level specific humidity (q), temperature, etc.NetCDF

The thermo file is mandatory for mass_basis = "dry" (the runtime default). All three files are model-level (137 levels, hybrid σ-p).

Credentials

Get a free CDS account, then drop your Personal Access Token (PAT) in ~/.cdsapirc:

text
url: https://cds.climate.copernicus.eu/api
key: <YOUR-PAT>

CDS migrated to single PAT-style keys in September 2024; the legacy <UID>:<API-KEY> format is no longer accepted. Get your PAT from your CDS account profile page once you're logged in. The base_url in config/met_sources/era5.toml matches this. Tools that read the CDS API will pick up the ~/.cdsapirc file automatically.

Datasets

CDS dataset nameUse
reanalysis-era5-completeModel-level spectral fields (VO, D, LNSP) — the AtmosTransport spectral preprocessor input
reanalysis-era5-single-levelsSurface fields (PS, 2T, 10U, 10V, …)
reanalysis-era5-pressure-levelsPressure-level diagnostics (not used by the preprocessor)

A reference download script lives outside the repository (the historical practice is per-user cdsapi calls); see config/met_sources/era5.toml for the canonical descriptor.

~/data/AtmosTransport/met/era5/
└── 0.5x0.5/
    ├── spectral_hourly/                    # CDS reanalysis-era5-complete output
    │   ├── era5_spectral_20211201_lnsp.gb
    │   ├── era5_spectral_20211201_vo_d.gb
    │   └── …
    └── physics/
        └── era5_thermo_ml_20211201.nc      # CDS reanalysis-era5-complete with q

The spectral_dir and thermo_dir keys in TOML schema point at these.

GEOS-IT (NASA GMAO Integrated Tropospheric)

GEOS-IT is the primary native cubed-sphere source. C180 (~50 km) is the production/debug resolution. GEOS-FP native C720 hourly CTM files are wired through the same source contract, with optional 0.25° surface/convection fallback files attached into the preprocessed binary.

Per-day file set

For each YYYYMMDD:

FileCadenceVariables
GEOSIT.YYYYMMDD.CTM_A1.C180.nchourly (window-averaged)MFXC, MFYC, DELP
GEOSIT.YYYYMMDD.CTM_I1.C180.nchourly (instantaneous)PS, QV
GEOSIT.YYYYMMDD.A1.C180.nchourlyPBLH, USTAR, HFLUX, T2M (only with include_surface)
GEOSIT.YYYYMMDD.A3mstE.C180.nc3-hourlyCMFMC (only with include_convection)
GEOSIT.YYYYMMDD.A3dyn.C180.nc3-hourlyDTRAIN (only with include_convection); U, V (with include_vdiff_fields)
GEOSIT.YYYYMMDD.I3.C180.nc3-hourlyT (only with include_vdiff_fields)

The GCHP Holtslag-Boville VDIFF data contract additionally requires include_vdiff_fields = true in the preprocessing TOML and the config/downloads/geosit_c180_gchp_vdiff.toml download recipe.

The preprocessor needs next-day hour 0 for the last window's forward-flux endpoint; download [start, end+1] for production runs.

Access

SourceURL patternAuth
AWS S3 (primary)s3://geos-chem/GEOS_C180/GEOS_IT/...public bucket; requester-pays NOT requirednone — use aws s3 cp --no-sign-request
WashU HTTP archive (fallback)http://geoschemdata.wustl.edu/ExtData/GEOS_C180/GEOS_IT/...none

The canonical descriptor is config/met_sources/geosit.toml (line 50-60), including the bucket name and the WashU base URL.

The downloader's canonical layout puts each day's collections under a per-day subdirectory:

~/data/AtmosTransport/met/geosit/
└── C180/
    └── daily/
        └── raw/
            └── 20211201/
                ├── GEOSIT.20211201.CTM_A1.C180.nc
                ├── GEOSIT.20211201.CTM_I1.C180.nc
                ├── GEOSIT.20211201.A3mstE.C180.nc       # if convection
                ├── GEOSIT.20211201.A3dyn.C180.nc        # if convection or VDIFF
                └── GEOSIT.20211201.I3.C180.nc           # if GCHP VDIFF

The [source].root_dir key in the GEOS preprocessing TOML points at the parent directory containing the YYYYMMDD/ per-day folders (~/data/AtmosTransport/met/geosit/C180/daily/raw in this layout). The preprocessor's file resolver also accepts a flat directory of all NetCDFs (no per-day subdir) — that's the raw_catrine layout the project's own configs use historically.

GEOS-FP, MERRA-2 (status)

GEOS-FP (C720). The active descriptor is native cubed-sphere:

  • config/met_sources/geosfp.toml — the native C720 hourly CTM product (GEOS.fp.asm.tavg_1hr_ctm_c0720_v72.*.nc4).

  • config/downloads/geosfp_c720.toml — the native C720 cubed-sphere download descriptor; the src/Downloads/sources/geosfp.jl downloader pulls from the WashU HTTP archive (NOT the GEOS-IT-style AWS S3 path) into a local directory.

GEOSSettings{:geosfp} opens 24 hourly native CTM files plus the next-day 00Z endpoint. The WashU archive names the hourly averaged files with HH30 timestamps; the resolver also accepts legacy HH00 fixtures. When [source] include_surface = true or include_convection = true, set [source] physics_dir to a directory containing GEOSFP.YYYYMMDD.{A1,A3mstE,A3dyn}.025x03125.nc files (or pre-regridded CS equivalents) and the preprocessor embeds PBLH, USTAR, HFLUX, T2M, CMFMC, and DTRAIN in the transport binary.

MERRA-2. Descriptor stub at config/met_sources/merra2.toml, but MERRA-2 is not an AbstractGEOSSettings preprocessing path — it has no native MFXC/MFYC, so the GEOS shortcut doesn't apply. A MERRA-2 preprocessing path would need its own settings type that derives mass fluxes from U/V/DELP at LL cell centers (closer in shape to the spectral path); not implemented today. NASA Earthdata Login auth would be required for actual data access.

The quickstart bundle

If you don't want to manage raw met data — for a smoke test, a tutorial, or a short benchmark — there's a preprocessed bundle on the data-quickstart-v2 GitHub Release:

TarballContents
quickstart_ll_dec2021_v2.tar.gz (~1.0 GB)LL 72x37 + LL 144x73 transport binaries, F32, Dec 1-3 2021
quickstart_cs_dec2021_v2.tar.gz (~1.9 GB)CS C24 + CS C90 transport binaries, F32, Dec 1-3 2021

Both built from raw ERA5 spectral via the preprocessor described in ERA5 spectral path. Use:

bash
bash scripts/download_quickstart_data.sh ll      # newcomer path
bash scripts/download_quickstart_data.sh cs
bash scripts/download_quickstart_data.sh         # both

The quickstart downloader and run configs use $ATMOSTRANSPORT_DATA_ROOT_quickstart, defaulting to ~/data/AtmosTransport_quickstart. Set it before downloading if the bundle should live somewhere else:

bash
export ATMOSTRANSPORT_DATA_ROOT_quickstart=/scratch/$USER/AtmosTransport_quickstart
bash scripts/download_quickstart_data.sh ll

See Quickstart with example data for the runnable walkthrough.

A note on disk space

For reference, a 30-day production preprocessing job needs:

Source × targetRaw inputBinary output (F32)
ERA5 spectral → LL 720×361 (full ERA5 res)~6 GB / day raw GRIB+NetCDF~5 GB / day binary
ERA5 spectral → CS C90same raw~700 MB / day binary
GEOS-IT C180 → CS C180 (passthrough)~3 GB / day raw NetCDF~3 GB / day binary

Plan storage accordingly; production run datasets are tens to hundreds of GB.