Skip to content

Adjoints on top of the binary

If you have used TM5-4DVAR for inverse modeling of CO₂ / CH₄ / SF₆, or the GIGC adjoint branch of GEOS-Chem for similar work, the goals are familiar: take a scalar objective (mismatch with observations), seed it at observation time, and walk backwards through the model to obtain a gradient against the control vector (typically surface emissions).

AtmosTransport's adjoint layer is built on the same forward mass-flux binaries described in The binary pipeline. Nothing about the adjoint is a parallel-maintained Fortran fork; the reverse pass walks operator-specific kernels that mirror their forward counterparts and runs on the same GPU backend.

This page lays out the architecture, the present capability surface, and the gap to a full 4D-Var driver.

What works today

The production entry point is cs_surface_emission_footprint on the cubed-sphere. Given a sequence of forward windows (a CubedSphereTransportDriver), an observation specification, and a scheme (today: LinRoodPPMScheme() is the fully-tested path), it returns a CSFootprintResult whose footprints[t] is dJ/dE_t, the sensitivity of the scalar objective to surface emission rate E_t (kg/s) at step t.

Three things this gives you out of the box:

  1. Surface-emission attribution. Given a tower or aircraft observation, find the spatial pattern of upstream surface emissions that contributed.

  2. Bayesian-inverse precondition. Use the footprint matrix as the Jacobian column for a small set of basis functions and run an offline analytic inversion (see src/Inversion/Jacobian.jl).

  3. Multi-observation gradient. Loop over observations and accumulate footprints into a single control-space gradient.

A 6-hour C24 backward LinRood footprint at LA from synthetic meteorology ran end-to-end on 2026-05-11; the map lives at artifacts/linrood_la_footprint_c24_6h.png.

Architecture

Forward recording

During the forward pass, every operator that the adjoint will need to step through writes an op-record onto the tape. There are five record families:

RecordCaptured byWhat it stores
_CSSweepRecordEach X/Y/Z half-sweepdirection, scheme, flux slice references, scale
_CSHaloRecordHalo exchanges between sweepssource/target panels, edge orientations
_CSMidpointRecordThe palindrome midpoint (V/2 → S → V/2)links to diffusion + emission records
_CSDiffusionRecordEach Thomas solveKz, dt, boundary trait
_CSConvectionRecordEach convection applyforcing fields
_CSLinRoodHorizRecordLinRood horizontal substepspanel-resolved state snapshots

Records are appended to a Vector{_CSAllTapeOp}-typed tape (with storage backends :device, :pinned_host, :mmap). The forward loop runs at near-zero overhead: the tape append is type-stable and the records hold references to the existing workspace fields, not copies.

Checkpointing

A full forward replay of a 14-day C180 simulation would exhaust GPU memory; the tape is therefore stored at revolve checkpoints. src/Tape/CheckpointSchedule.jl implements the Griewank-Walther revolve schedule with a configurable RAM budget; you pick the number of checkpoints per day and the schedule chooses where to store full state snapshots and where to re-run forward from the nearest snapshot.

For TM5-4DVAR users: this is the same trade-off as TM5's nsplit parameter, with the same factor-of-log cost in extra forward steps.

Reverse walk

The reverse pass walks Iterators.reverse(ops) and dispatches on record type. Each adjoint kernel is the transpose of the forward operator's update step:

julia
# Forward (advection x-sweep, conceptually)
rm_new[i,j,k] = rm_old[i,j,k] - (flux_right - flux_left) * dt/m

# Adjoint
lambda_old[i,j,k] += lambda_new[i,j,k]
flux_left_adj    += lambda_new[i,j,k] * dt/m
flux_right_adj   -= lambda_new[i,j,k] * dt/m

The adjoint kernels live in:

Each adjoint kernel is a KernelAbstractions.@kernel and runs on the same backend (CPU/CUDA/Metal) as the forward pass. The reverse walk is type-stable end-to-end (after a recent typed-tape cleanup); this is the single highest-leverage performance investment in the adjoint layer.

Tape storage backends

The default for production GPU runs is :pinned_host. CUDA-pinned allocation enables overlap between the forward kernels and the device-to-host copy of the tape records; for a 14-day C180 run, the forward overhead of recording is single-digit percent.

For very long runs, :mmap stores the tape on disk; this is the only storage that survives a crash and supports post-hoc inspection with scripts/diagnostics/inspect_tape.jl.

Mapping to TM5-4DVAR / GIGC adjoint workflows

TaskTM5-4DVARGIGC adjointAtmosTransport
Forward integrationTM5 CTMGEOS-Chemrun_driven_simulation
Tangent linearHand-codedHand-codedNot currently exposed
Adjoint forcingTM5 obs operatorgchem_adjbind_to_mesh + observation IO
Observation IOSpecific files per instNetCDFsrc/Inversion/ObservationsIO.jl
Covariance B^Diagonal + horizontal correlationShippedsrc/Inversion/Covariance.jl
PreconditioningHand-codedHand-codedsrc/Inversion/Preconditioning.jl
OptimizerM1QN3 (L-BFGS-B variant)L-BFGS-BPluggable AbstractCSOptimizer
Drivertm5-4dvar.xgchem_4dvarPrototype driver under scripts/inversions/

What's shipped today:

  • Tape + checkpoint + revolve schedule.

  • Public footprint API (cs_surface_emission_footprint).

  • Observation IO and departures IO.

  • bind_to_mesh for observation co-location.

  • Covariance B^{1/2} construction.

  • Preconditioning and 4D-Var cost/gradient assembly.

  • Pluggable optimizer dispatch.

What's next:

  • Hardening the TOML-facing inversion driver.

  • Publishing a real-data TM5-4DVAR cross-validation.

  • Completing the remaining physics-adjoint gaps listed below.

Practical knobs

toml
[adjoint]
enabled = true                 # set false for a forward-only run
tape_storage = "pinned_host"   # or "device" / "mmap"
checkpoint_count_per_day = 4   # revolve checkpoints per day
scheme_override = "linrood7"   # adjoint-tested CS scheme

[adjoint.objective]
kind = "site_concentration"
site = "MLO"
time_utc = "2021-12-05T12:00:00"

The TOML schema for the adjoint side is still experimental — see the Validation status for the specific scheme/topology combinations that have a passing adjoint-identity test.

What's not yet adjointed

The forward operators that lack a fully-wired adjoint kernel today:

  • Lin-Rood ORD=7 panel-boundary correction has a partial adjoint (kernel exists, but the panel-edge stencil is not yet a separate record family); the production CS adjoint path uses LinRoodPPMScheme(ORD=5).

  • copy_corners reverse is the named gap for the cubed-sphere halo-corner exchange.

  • TM5 convection adjoint — the four-field column solve has a partial transpose; full adjoint regression is open.

  • CMFMC convection adjoint — the upwind convection scheme has no adjoint kernel yet; CS runs with CMFMCConvection cannot be differentiated.

If you need a feature on this list for a campaign, open an issue; priority follows campaign demand.

Performance properties

For a 14-day C180 run with :pinned_host storage and 4 checkpoints per day:

  • Forward overhead from recording: 3–8 % on top of the no-record forward run, dominated by the CUDA pinned-host async copy.

  • Reverse pass wall clock: 2.5–3 × the forward wall clock (consistent with the Griewank-Walther asymptotic).

  • Memory: 4 checkpoints × C180 full state ≈ 6 GB on the host. The tape itself is ~1–2 GB depending on scheme.

These numbers are competitive with TM5-4DVAR on equivalent resolution and substantially faster than GIGC-adjoint (GIGC-adjoint runs are CPU-only).

Reading next