Meteorological Data: Preprocessing and Data Sources

New users: For step-by-step instructions on downloading data, preprocessing, and running the model, see the Quick Start Guide (docs/QUICKSTART.md in the repository root). This document is a deep-dive reference on met data formats, vertical coordinates, and the TM5 comparison.

This document explains how meteorological data flows into both TM5 and our Julia model, what preprocessing each requires, and why TM5's approach differs fundamentally from a simple gridpoint wind-based method.

CRITICAL: Model Levels Only — Never Pressure Levels

ERA5 pressure-level data (e.g. era5_pl_*.nc on fixed hPa levels) CANNOT be used for mass-flux transport.

The transport model requires model-level data (levtype: ml) on the native hybrid sigma-pressure grid, with lnsp (log surface pressure, param 152) for per-column pressure reconstruction:

Δp[i,j,k] = (A[k+1] - A[k]) + (B[k+1] - B[k]) × exp(lnsp[i,j])

Pressure-level data interpolates to fixed isobaric surfaces, which destroys:

  1. Terrain-following coordinate — mountains are cut off or extrapolated below the surface, creating fictitious mass in sub-surface cells
  2. Column mass consistencysum(Δp) no longer equals ps - p_top, so the continuity equation cannot be satisfied
  3. Vertical mass flux closurecm is derived from horizontal divergence and the B-ratio; this requires Δp from hybrid A/B coefficients, not fixed pressure differences

Use scripts/download_era5_model_levels.jl which requests levtype: ml with params 131 (U), 132 (V), 135 (omega), 152 (LNSP). The resulting files contain variables u, v, w, lnsp on model_level dimensions.

Overview

AspectTM5Julia AtmosTransport
Advection inputMass fluxes (kg/s) through cell facesWind velocities (m/s) at cell centers
Mass conservationGuaranteed by spectral integration (Bregman et al. 2003)Depends on advection scheme discretization
Vertical coordinateHybrid sigma-pressure (A/B coefficients)Hybrid sigma-pressure (A/B coefficients)
Convective fluxesRead from ECMWF archive (eu, ed, du, dd)Read from met data if available
Met formatPreprocessed NetCDF/HDF with specific namingStandard NetCDF (ERA5, GEOS-FP, MERRA-2)

TM5 Meteo Preprocessing

Why TM5 needs its own preprocessor

TM5 uses mass fluxes (kg/s through cell faces), not wind velocities, for tracer advection. Computing mass fluxes from ECMWF wind fields requires working in spherical harmonic (spectral) space to guarantee that the resulting mass fluxes satisfy the continuity equation exactly. A naive gridpoint computation (multiply u by dp/g and cell width) introduces mass imbalances that grow over time.

The foundational paper is:

Bregman, B., Segers, A., Krol, M., Meijer, E., and van Velthoven, P.: "On the use of mass-conserving wind fields in chemistry-transport models", Atmos. Chem. Phys., 3, 447-457, 2003. https://acp.copernicus.org/articles/3/447/2003/

TM5 is its own preprocessor

The preprocessing code lives inside TM5 itself, in the TMM (Transport Model Meteo) module:

  • base/src/tmm.F90: Core mass flux computation
    • tmm_Read_MFUV (~line 1944): horizontal mass fluxes from spectral U/V
    • tmm_Read_MFW (~line 2234): vertical mass flux from spectral divergence
  • base/src/tmm_mf_ecmwf_tmpp.F90: Reads raw ECMWF GRIB files
  • base/src/file_grib.F90: GRIB I/O using GRIBEX library
  • base/src/grid_interpol.F90: Spectral-to-gridpoint transforms

The mass flux formulas (computed in spectral space):

mfu = (R/g) * integral[ U * (dA + dB * exp(LNSP)) / cos(lat) * dlat ]
mfv = (R/g) * integral[ V * (dA + dB * exp(LNSP)) * dlon ]
mfw = derived from spectral divergence + surface pressure tendency

where A, B are hybrid level coefficients, LNSP is log surface pressure, R is Earth's radius, and g is gravitational acceleration. The integrals are evaluated analytically in spectral space.

Preprocessing mode

When TM5 is configured with tmm.output : T, it:

  1. Reads raw ECMWF GRIB data (spectral vorticity, divergence, LNSP)
  2. Computes mass-conserving mass fluxes via spectral integration
  3. Writes preprocessed files in TM5's NetCDF format
  4. Subsequent runs read from this archive with tmm.output : F

Required ECMWF fields (spectral GRIB)

TM5's mass flux computation requires these ERA5 fields in spectral form (spherical harmonic coefficients), not gridpoint:

FieldGRIB paramDescriptionForm
VO138VorticitySpectral
D155DivergenceSpectral
LNSP152Log surface pressureSpectral
T130TemperatureSpectral or gridpoint
Q133Specific humidityGridpoint
CLWC246Cloud liquid waterGridpoint
CIWC247Cloud ice waterGridpoint
CC248Cloud coverGridpoint

Convective fields (gridpoint, from ECMWF forecast):

FieldGRIB paramDescription
MUMF20 (table 162)Updraft mass flux
MDMF21 (table 162)Downdraft mass flux
EU44Updraft entrainment
ED45Updraft detrainment
DU46Downdraft entrainment
DD47Downdraft detrainment

Surface fields (gridpoint):

FieldGRIB paramDescription
SP134Surface pressure
Z129Orography (geopotential)
LSM172Land-sea mask
BLH159Boundary layer height
10U16510m U-wind
10V16610m V-wind
2T1672m temperature
2D1682m dewpoint
SSHF146Surface sensible heat flux
SLHF147Surface latent heat flux

TM5 preprocessed file format

After preprocessing, TM5 stores files as NetCDF with:

  • Directory structure: ec-ea-fc012up2tr3-ml137-glb100x100/YYYY/MM/
  • File naming: <paramkey>_YYYYMMDD_<tres>.nc (e.g. mfuv_20250201_00p03.nc)
  • Variable groups: sp, mfuv, mfw, t, q, cld, convec
  • Dimensions: lon, lat, lev, time
  • Hybrid coefficients: ap, b (and bounds) for vertical coordinate
  • Surface pressure: ps included in every 3D file
  • Time: 3-hourly (_00p03 = 00:00, 03:00, 06:00, ...)
  • CF conventions: CF-1.4 with standard names

The I/O module is base/src/tmm_mf_tm5_nc.F90 (and the ERA5-specific variant in proj/era5_met/src/tmm_mf_tm5_nc.F90).

Prerequisites for running TM5 preprocessing

  1. ecCodes (or legacy GRIBEX) with Fortran bindings, compiled with ifx
  2. TM5 built with with_tmm_ecmwf and with_grib macros enabled
  3. ERA5 data downloaded as GRIB with spectral coefficients (not gridpoint NetCDF)
  4. CDS API access configured (~/.cdsapirc)

See TM5LOCALSETUP.md for build instructions.

Julia Model Meteorological Input

Design Principle: Universal Hybrid Grid

All met data sources run on the same hybrid sigma-pressure vertical coordinate. The transport model never operates on raw pressure levels — every source provides A/B coefficients that define terrain-following levels:

p(k) = A[k] + B[k] × p_surface

This is a fundamental design decision:

  • ERA5 (137 levels): spectral-harmonic native grid → distributed on regular lat-lon. A/B coefficients from config/era5_L137_coefficients.toml.
  • GEOS-FP (72 levels): cubed-sphere internally → output on regular lat-lon. A/B from config/geos_L72_coefficients.toml.
  • MERRA-2 (72 levels): same vertical coordinate as GEOS-FP, different horizontal resolution. Same geos_L72_coefficients.toml.
  • Future sources: any met data that provides horizontal winds, surface pressure, and hybrid A/B coefficients can be added by writing a new TOML config and (if needed) a coefficient file. No Julia code changes required.

The HybridSigmaPressure type provides the universal abstraction. A vertical coordinate is constructed from any met source config with:

config = default_met_config("geosfp")
vc = build_vertical_coordinate(config)  # 72-level HybridSigmaPressure

This means the transport model sees the same type regardless of whether the underlying data came from ECMWF spectral harmonics, NASA's cubed-sphere model, or any future gridded dataset. The physics code (advection, convection, diffusion) only depends on the HybridSigmaPressure interface.

Supported data sources

SourceConfig fileResolutionLevelsCoefficientsAccess
ERA5config/met_sources/era5.toml0.25-2 deg137 modelera5_L137_coefficients.tomlCDS API (~/.cdsapirc)
GEOS-FP (lat-lon)config/met_sources/geosfp.toml0.25 deg72 hybridgeos_L72_coefficients.tomlOPeNDAP (no auth)
GEOS-FP (native CS)N/A (direct reader)C720 (~12.5 km)72 hybridgeos_L72_coefficients.tomlWashU archive (no auth)
MERRA-2config/met_sources/merra2.toml0.5x0.625 deg72 hybridgeos_L72_coefficients.tomlNASA Earthdata (~/.netrc)

Processing pipeline

  1. Read (src/IO/met_reader.jl): Load NetCDF variables, apply unit conversions
  2. Build vertical coordinate: build_vertical_coordinate(config) loads A/B coefficients and constructs HybridSigmaPressure
  3. Stagger (src/IO/met_fields_bridge.jl): Interpolate cell-center winds to faces
    • u -> x-faces (Nx+1, Ny, Nz)
    • v -> y-faces (Nx, Ny+1, Nz)
    • omega -> w at z-interfaces (Nx, Ny, Nz+1), sign flip (omega>0 downward, w>0 upward)
  4. Convection: Read conv_mass_flux_up/down if available, else convection is a no-op
  5. Diffusivity: Read kh (eddy diffusivity) if available

Download scripts

ScriptSourceOutput location
scripts/download_era5_week.jlERA5 (CDS API)~/data/metDrivers/era5/
scripts/download_era5_model_levels.jlERA5 model levels (CDS API)~/data/metDrivers/era5/
scripts/download_geosfp_week.jlGEOS-FP lat-lon (OPeNDAP)~/data/metDrivers/geosfp/
scripts/download_geosfp_cs_massflux.jlGEOS-FP native C720 cubed-sphere (WashU)~/data/geosfp_cs/
scripts/download_test_data.jlAll three~/data/metDrivers/*/test/

Current status

  • ERA5 model-level data (L137) supported via scripts/download_era5_model_levels.jl
  • A/B coefficients available for both ERA5 (137 levels) and GEOS/MERRA (72 levels)
  • Universal build_vertical_coordinate() works for all met sources
  • Native GEOS-FP C720 cubed-sphere mass flux reader implemented (src/IO/geosfp_cubed_sphere_reader.jl)
  • No convective mass flux variables in the ERA5 config yet
  • No spectral processing — uses gridpoint winds directly (TM5 comparison ongoing)

GEOS-FP Native Cubed-Sphere Mass Fluxes

Data source

Native cubed-sphere mass fluxes from GEOS-FP are archived by the GEOS-Chem Support Team at Washington University in St. Louis:

  • URL: http://geoschemdata.wustl.edu/ExtData/GEOS_C720/
  • Reference: Martin et al. (2022), GMD 15, 8731-8748
  • Access: HTTP, no authentication required
  • Also available via: Globus

This data is NOT available on the NASA NCCS portal — the NCCS portal only serves lat-lon interpolated GEOS-FP products. The native cubed-sphere data is extracted from GMAO's internal FP system and distributed via the WashU archive.

Collections

DirectoryFile patternSize/fileContents
GEOS_FP_Native/tavg_1hr_ctm_c0720_v72~2.7 GBMFXC, MFYC, DELP, PS, CX, CY
GEOS_FP_DerivedWind/tavg_1hr_ctmwind_c0720_v72~1.7 GBDerived U/V winds from mass fluxes
GEOS_FP_Raw/Same as NativeSameMirror of Native

For mass-flux transport, use GEOS_FP_Native (tavg_1hr_ctm collection).

File format

GEOS.fp.asm.tavg_1hr_ctm_c0720_v72.YYYYMMDD_HHMM.V01.nc4

Timestamps at HH30 (center of hour-long averaging window), 24 files per day.

Dimensions: (Xdim=720, Ydim=720, nf=6, lev=72, time=1)

Note: nf (panel/face index) comes before lev (vertical) in the dimension ordering — not after as might be expected.

Variables:

VariableDimensionsUnitsDescription
MFXC(720,720,6,72)Pa⋅m²Pressure-weighted accumulated eastward mass flux
MFYC(720,720,6,72)Pa⋅m²Pressure-weighted accumulated northward mass flux
DELP(720,720,6,72)PaPressure thickness
PS(720,720,6)PaSurface pressure
CX(720,720,6,72)Eastward accumulated Courant number
CY(720,720,6,72)Northward accumulated Courant number
lons(720,720,6)degreesCell-center longitudes
lats(720,720,6)degreesCell-center latitudes
corner_lons(721,721,6)degreesCell-corner longitudes
corner_lats(721,721,6)degreesCell-corner latitudes

C-grid convention: MFXC and MFYC use the same (Nc, Nc) indexing as cell-center variables. MFXC(i,j) is the flux through the east face of cell (i,j); MFYC(i,j) is the flux through the north face.

Unit conversion to kg/s: divide by g × Δt_met where g = 9.80665 m/s² and Δt_met = 3600 s (1-hour accumulation interval).

Data volume

Period# FilesTotal Size
1 day24~65 GB
1 month720~1.9 TB
1 year8,760~23 TB

Known gaps: June 16-18, 2024 are missing from the archive.

Also available: GEOS-IT (C180)

For longer reanalysis periods, GEOS-IT (the successor to MERRA-2) provides C180 (~50 km) cubed-sphere data from 1998 to present:

http://geoschemdata.wustl.edu/ExtData/GEOS_C180/GEOS_IT_Native/

GEOS-IT uses the same file format and variable names as GEOS-FP.

Validation Strategy: TM5 vs Julia Model

Using the same ERA5 data

For a fair comparison, both models must process the same underlying ERA5 meteorology. The pipeline is:

ERA5 (ECMWF archive)
    |
    +--> Spectral GRIB -----> TM5 preprocessing (tmm.F90) -----> TM5 forward run
    |                              mass-conserving mass fluxes
    |
    +--> Gridpoint NetCDF ---> Julia model (met_reader.jl) -----> Julia forward run
                                   direct wind-based advection

Both runs use the same ERA5 fields for the same time period. The comparison then tests whether the Julia model's gridpoint approach gives comparable tracer transport to TM5's spectral mass-flux approach.

What the comparison tests

  • Advection accuracy: Gridpoint winds vs spectral mass fluxes
  • Mass conservation: TM5 guarantees it; Julia model should be close
  • Convection: Both use ECMWF-derived convective mass fluxes (once enabled in Julia)
  • Overall transport: Column-integrated tracers, zonal means, time series

GEOS-FP validation (separate)

TM5 has no GEOS-FP reader. For GEOS-FP validation, compare the Julia model against:

  • GEOS-Chem (which natively uses GEOS-FP)
  • Observations (surface stations, satellite retrievals)
  • The Julia model already has a working GEOS-FP reader

Mass fluxes vs. winds — lessons from GCHP v13

Martin et al. (2022, GMD 15, 8731-8748) demonstrate that directly using air mass fluxes instead of inferring them from archived winds significantly reduces transport errors in offline models. Key findings relevant to our model:

  1. Surface pressure tendency error: Using winds gives 15× larger error than using mass fluxes directly (MAE 15 Pa vs 1 Pa for a 5-min timestep).

  2. Restaggering error: Even on the native cubed-sphere grid, converting between A-grid (cell-center winds) and C-grid (face mass fluxes) weakens vertical advection by ~8% (regression slope = 0.92). Our stagger_velocities function performs exactly this operation.

  3. Regridding + restaggering: The traditional GEOS-FP pipeline (C720 cubed-sphere → 0.25° lat-lon winds → regrid to model grid → restagger) introduces compounding errors that dominate vertical transport, especially in the stratosphere where vertical motion is weak.

  4. Native cubed-sphere mass fluxes: Since March 2021, GEOS-FP operationally archives hourly C720 cubed-sphere mass fluxes. GEOS-IT provides C180 archives for 1998-present. These avoid all regridding/restaggering errors.

Implications for our model:

  • Our current wind-based approach (cell-center → face staggering) introduces a known ~1% mass conservation error, consistent with our observed 0.9% drift
  • Supporting direct mass flux ingestion from native cubed-sphere GEOS-FP/GEOS-IT archives would eliminate this error source entirely
  • The CubedSphereGrid implementation uses the same gnomonic projection as FV3, enabling direct use of native GEOS-FP mass flux archives in the future

CO2 Surface Flux Components

The model supports multiple CO2 flux components, following the ECMWF IFS/CHE nature run framework. Each component can be loaded and applied independently or combined via CompositeEmission.

Comparison with ECMWF IFS and CarbonTracker

ComponentECMWF IFS (CY49R1)CarbonTracker CT-NRTOur Model
AnthropogenicCAMS-GLOB-ANT v6.2Miller fossil fuelEDGAR v8.0
BiosphereECLand (online)CASA + optimizationCT-NRT biofluxopt
OceanJena CarboScope v2020OIF ocean prior + optJena CarboScope oc_v2024
FireGFAS v1.4GFED4.1sCT-NRT firefluximp or GFAS

Available flux datasets

SourceReaderResolutionTemporalUnits (native)Download Script
EDGAR v8.0load_edgar_co20.1°AnnualTonnes/yr(manual)
CT-NRT v2025-1load_carbontracker_fluxes1×1°3-hourlymol/m²/sdownload_carbontracker_fluxes.jl
Jena CarboScopeload_jena_ocean_flux1×1°DailyPgC/yr/celldownload_ocean_flux.jl
CAMS GFAS v1.4load_gfas_fire_flux0.1°Dailykg/m²/sdownload_gfas_fire.jl

Tagged tracers (future)

The IFS CHE nature run uses tagged CO2 tracers (GRIB table 210) to track atmospheric enhancement by source:

ParameterDescription
210061Total CO2 mass mixing ratio
68.210Natural biosphere CO2 flux
69.210Anthropogenic CO2 emissions
64.210CO2 column-mean molar fraction

Our model can replicate this by adding separate tracers (co2_anthro, co2_bio, co2_ocean, co2_fire) transported independently, with total CO2 = sum of all components. This is implemented as a stretch goal.

Monthly-Sharded Mass Flux Preprocessing

For year-long simulations, mass flux files are sharded by month:

~/data/metDrivers/era5/era5_ml_1deg_20240101_20241231/
    massflux_era5_202401_float32.nc   (~5-12 GB)
    massflux_era5_202401_float32.bin  (mmap-ready binary)
    massflux_era5_202402_float32.nc
    ...
    massflux_era5_202412_float32.nc

The forward model (run.jl with a preprocessed config) accepts the preprocessed directory via the TOML configuration and chains through monthly shards automatically.

Data size estimates (1-degree, L88, full year)

ComponentPer MonthFull Year
ERA5 model-level downloads~3-5 GB~40-60 GB
Mass flux NetCDF (deflate=1)~5-12 GB~60-100 GB
Mass flux binary (mmap)~11 GB~135 GB
CT-NRT fluxes (1x1, 3-hourly)~60 MB~730 MB
GFAS fire (0.1, daily)~150 MB~2 GB
Jena CarboScope ocean~100 MB (single file)
EDGAR v8.0~26 MB

Total storage for a full-year 1-degree simulation: ~200-300 GB.

Data on disk

Current meteorological data available:

DatasetLocationSizeDetails
ERA5 model levels (June 1-7)~/data/metDrivers/era5/era5_ml_10deg_20240601_20240607/~1.1 GB1-deg, L88 (levels 50-137), 4x daily, u/v/w/lnsp
ERA5 model levels (June 8-30)~/data/metDrivers/era5/era5_ml_10deg_20240608_20240630/~3.6 GB1-deg, L88 (levels 50-137), 4x daily, u/v/w/lnsp
ERA5 full year 2024~/data/metDrivers/era5/era5_ml_10deg_20240101_20241231/~31 GB (partial)1-deg, downloading via CDS API
GEOS-FP lat-lon week~/data/metDrivers/geosfp/654 MB2025-02-01 to 07, ~4x5 deg
GEOS-FP native C720 June~/data/geosfp_cs/YYYYMMDD/~1.9 TB (downloading)C720 cubed-sphere, hourly, from WashU
EDGAR v8.0 CO2~/data/emissions/edgar_v8/~26 MB2022 annual totals, 0.1 deg
TM5 meteo archive(not yet created)Requires preprocessing step above

Deleted (unusable):

  • era5/era5_025deg_20240601_20240831/ — 0.25-deg ERA5 on fixed pressure levels (not model levels). Cannot be used for mass-flux transport. See "Critical: Model Levels Only" above.

Data server: NFS on curry.gps.caltech.edu, 7 TB volume, ~4.8 TB available.

References

  • Bregman et al. (2003): Mass-conserving wind fields. ACP 3, 447-457.
  • Krol et al. (2005): TM5 algorithm. ACP 5, 417-432.
  • Williams et al. (2017): TM5-MP description. GMD 10, 721.
  • Huijnen et al. (2010): TM5 tropospheric chemistry. GMD 3, 445.
  • Martin et al. (2022): GCHP v13 — improved advection, native CS archives. GMD 15, 8731-8748. https://doi.org/10.5194/gmd-15-8731-2022
  • Putman & Lin (2007): FV3 cubed-sphere advection. JCP 227, 55-78.
  • Agusti-Panareda et al. (2023): CAMS GHG production system. GMD.
  • Rödenbeck et al. (2013): Jena CarboScope ocean CO2 fluxes. Ocean Sci.
  • Agustí-Panareda et al. (2019): CHE global nature run. Sci. Data 9, 1-16. https://doi.org/10.1038/s41597-022-01228-2
  • TM5 Wiki: https://sourceforge.net/p/tm5/wiki/Meteo/