SGP4/SDP4 on WebGPU

Full orbital propagation in WGSL compute shaders with zero-copy rendering

Red RiddellApril 2026

Overview

SGP4 and SDP4 run entirely in WGSL compute shaders — the full specification, including deep-space secular and periodic corrections, half-day and near-synchronous resonance integration, and Lyddane low-inclination modification. On an Apple M1 Max, the propagator achieves 899M propagations/sec peak throughput, propagating 650,000 satellites in 0.72ms — 422x faster than single-threaded satellite.js v6 [3]. Output buffers bind directly as vertex attributes. There is no CPU readback.

Fig. 1 — Speedup over satellite.js (CPU) by catalog size. Increases as GPU parallelism saturates.

Prior GPU implementations of SGP4 target server-side CUDA for batch conjunction screening. This work brings the complete propagator to the browser via WebGPU, with direct integration into a real-time rendering pipeline. The implementation uses WGS-72 constants (Earth radius 6378.135 km, mu 398600.8 km³/s²) consistent with the NORAD TLE catalog generation process [1] [2].

Performance

All GPU measurements use WebGPU timestamp queries (GPU-clock, not wall-clock). Each measurement is a single compute dispatch propagating N satellites at one epoch — one propagation produces one position-velocity pair. Hardware: Apple M1 Max, 64GB unified memory, macOS, Chrome, Metal backend. Synthetic catalog with mixed orbital distribution: 70% LEO, 20% MEO, 10% GEO (deep-space path). 2000 iterations per data point, 100 warmup.

Count	Avg (ms)	M props/s	Buffer
30k	0.042	718	11.9 MB
100k	0.118	847	39.7 MB
250k	0.286	875	99.2 MB
500k	0.561	891	198.4 MB
650k	0.723	899	257.9 MB

Fig. 2 — GPU throughput (M propagations/sec) vs. catalog size. Saturation begins around 100k; throughput plateaus at ~899M/s.

At the operational catalog size (~30k tracked objects), propagation takes 0.042ms — well under a single frame at 60 fps (16.6ms budget). At 650k the propagator is still under 1ms, leaving headroom for projected catalog growth.

Comparison: satellite.js (CPU)

Baseline: single-threaded satellite.js v6 [3], the standard JavaScript SGP4 implementation. Same catalog, distribution, and hardware. CPU throughput is flat at ~2.2M propagations/sec regardless of catalog size — linear scaling with no parallelism.

Count	CPU (ms)	GPU (ms)	Speedup
30k	12.91	0.042	307x
100k	45.11	0.118	382x
250k	114.89	0.286	402x
500k	230.50	0.561	411x
650k	305.29	0.723	422x

Fig. 3 — Propagation time: GPU vs CPU. The GPU line is near zero across the full range.

Zero-Copy Buffer Architecture

The propagator's position output buffer is created with STORAGE | COPY_SRC | VERTEX usage flags. The compute shader writes propagated positions, and the rendering pipeline reads them as instanced vertex attributes — no staging buffer, no CPU readback, no copy.

A SharedBufferBridge mediates ownership. It allocates Three.js StorageInstancedBufferAttribute objects, extracts the underlyingGPUBuffer handles from the renderer backend, and passes them to the compute pipeline as external buffers. The compute class marks these as non-owned — it writes to them but does not manage their lifecycle.

Three.jsAllocates StorageInstancedBufferAttribute

SharedBufferBridgeExtracts GPUBuffer handle

Sgp4ComputeBinds buffer (non-owned)

Compute shaderWrites positions

Vertex shaderReads positions

No copy. No readback. No staging.

Fig. 4 — Single GPUBuffer flows from allocation through compute to vertex shader with no intermediate copies.

CPU readback is available as an opt-in path. A staging buffer is allocated lazily on first read request.

Dual Dispatch Model

The shader exposes two compute entry points:

WGSL entry points

@compute @workgroup_size(64)
fn main(@builtin(global_invocation_id) gid: vec3<u32>)
  // Batch: N satellites × 1 time
  // Thread i → satellite i at baseTsinceMin

@compute @workgroup_size(64)
fn propagateOrbit(@builtin(global_invocation_id) gid: vec3<u32>)
  // Orbit: 1 satellite × M times
  // Thread i → time sample i for satellite S

Both call the same sgp4() function and read from the same constants buffer. The batch path writes position and velocity output buffers. The orbit path writes to a separate results buffer containing both scene-space and ECI positions per time sample.

The orbit entry point packs the satellite index into the uniform via bitcast<u32>, reusing the existing struct layout. An arbitrary number of time samples are uploaded to a storage buffer. Selecting a satellite triggers a single dispatch returning the full orbital trace in one GPU round-trip.

f32 Precision Strategy

The entire GPU pipeline operates in single precision. The primary source of error is the time-since-epoch calculation: SGP4 expects tsince in minutes from the TLE epoch. A naive approach stores the absolute epoch as milliseconds — but f32 has only 24 bits of mantissa, giving ~1-minute resolution at typical epoch magnitudes. That is unacceptable.

The solution: compute the average epoch across the entire catalog and use it as a shared reference. Each satellite stores a signed offset from the reference to its TLE epoch, in minutes. At dispatch time, the CPU writes a single uniform containing the minutes elapsed since the reference epoch. The GPU computes each satellite's tsince by adding its per-satellite offset.

This keeps both terms small. The base term is minutes from the catalog's temporal center (typically near zero for a fresh catalog). The delta term is the spread between individual epochs (days to weeks, not decades). Measured error against f64 CPU reference: mean 0.054 km (54 meters), validated against Vallado verification vectors [2]. This is two orders of magnitude below SGP4's inherent model accuracy (~1 km at epoch), making it operationally negligible.

Coefficient Packing

SGP4/SDP4 initialization is computationally cheap but branchy — resonance detection, deep-space classification (orbital period > 225 minutes), Lyddane path selection. This work runs once on the CPU. The GPU receives a pre-computed coefficient buffer: 96 floats per satellite (384 bytes), tightly packed into a single storage buffer. The stride aligns to exactly 3 GPU cache lines (128 bytes each).

Flags for near-earth vs. deep-space path selection and higher-order drag perturbation skipping are set during CPU initialization — the GPU branches on these without recomputation. The full catalog constants buffer for 30k objects is ~11.5 MB; at 650k it is ~250 MB.

Resonance Integration

The SDP4 deep-space resonance integrator (dspace) steps from a known state to the target time. The standard CPU implementation uses an unbounded while loop. GPU compute shaders require bounded execution — the implementation uses a fixed-iteration loop with 720-minute steps, configurable up to 1000 iterations (~1.4 years from epoch).

Vallado's original C++ persists integration state (atime, xli, xni) across calls via mutable references, allowing the integrator to resume rather than restart. The standard JavaScript port (satellite.js) does not persist this state — it returns updated values from dspace but does not write them back to satrec, re-integrating from initial values on every call. On GPU, both approaches are viable — persisting state requires an additional read-write buffer, while stateless re-integration trades memory bandwidth for compute. We benchmarked both:

Count	Stateful (M/s)	Stateless (M/s)
30k	718	634
100k	847	838
250k	841	875
500k	852	891
650k	856	899

At smaller catalog sizes (≤100k), the GPU is compute-bound — persisting state reduces the worst-case iteration count per resonant satellite, allowing workgroups to complete faster. At larger sizes (≥250k), the GPU saturates and becomes memory-bound — the additional buffer read/write per resonant satellite adds traffic that exceeds the saved compute. Vantafort uses the stateful path, as the operational catalog is ~30k objects.

Resonance type is pre-classified on the CPU: no resonance, near-synchronous (period ≈ 1 sidereal day), or half-day (period ≈ 12 hours). Non-resonant deep-space objects skip the integration loop entirely.

Validation

The implementation is validated against the Vallado SGP4 verification dataset [2]. The GPU propagator matches reference positions within 0.054 km mean error, attributable to f32 precision loss. Additional verification covers determinism, batch indexing correctness, and validity flag propagation (w=1.0 for valid orbits, 0.0 for decayed or error conditions).

Error conditions are handled in the shader: negative mean motion, escape eccentricity (e ≥ 1), negative semi-latus rectum, and sub-surface radius all return a zero vector with w=0.0, propagating cleanly through the rendering pipeline as invisible points.

Limitations

The pipeline operates entirely in f32. The average-epoch strategy keeps precision loss operationally negligible for fresh catalogs and near-epoch propagation. For stale TLEs (months-old epochs) or long propagation spans (weeks), the f32 error budget grows — though it remains below SGP4's own model degradation over the same timeframes.

WebGPU does not yet support f64 in compute shaders. When it does, the precision constraint disappears without architectural changes — the buffer layout and dispatch model are precision-agnostic.

References

Hoots, F.R., Roehrich, R.L. "Spacetrack Report No. 3: Models for Propagation of NORAD Element Sets." U.S. Air Force Aerospace Defense Command, 1980.
Vallado, D.A., Crawford, P., Hujsak, R., Kelso, T.S. "Revisiting Spacetrack Report #3." AIAA/AAS Astrodynamics Specialist Conference, 2006.
satellite.js — JavaScript port of SGP4/SDP4 based on Vallado's reference implementation.