LEO Maneuver Detection on Public TLE History

Robust-statistics change detection on raw two-line element sets, validated against laser-ranging ground truth

Red RiddellApril 2026

Summary

Vantafort's LEO maneuver detector identifies orbital maneuvers from public two-line element sets alone — no operator telemetry, no access to ranging data, no closed tracking feeds. On nine drag- and phase-maintained LEO Earth-observation satellites evaluated against International Laser Ranging Service (ILRS) ground truth [1] during their operational phase, it achieves aggregate precision 0.922, recall 0.725, and F1 0.812 at a ±12 h event-match tolerance. The high-confidence tier reaches 96.3% pooled precision with 4.7 percentage points cross-satellite standard deviation across all nine assets.

Per-satellite configuration is derived entirely from unsupervised TLE statistics; no labels touch the tuning surface. A sensitivity sweep confirms: ±50% perturbations to the four adaptive thresholds move F1 by ≤ 0.003. Two global knobs — the Student-t reference degrees of freedom and the false-discovery-rate target — are tuned at benchmark F1 peak and disclosed as operating-point choices.

The Problem

Maneuvers on operational LEO Earth-observation missions are small — typically 0.01–0.15 m/s, orders of magnitude below the noise floor of SGP4's mean-element representation. The public TLE catalog, updated on an irregular 8–24 h cadence, is the only observation layer available without operator cooperation; its elements mix physical signal with orbit-determination artefacts (B* re-estimation coupling, fit-span edges, epoch-placement jitter).

The noise challenge is not subtle. Inter-TLE element changes on clean pairs show kurtosis between 190 and 3,879 across the benchmark — two to three orders of magnitude above the Gaussian value of 3. Standard z-score detection collapses on distributions this heavy-tailed. Ground truth is also scarce: ILRS publishes maneuver ledgers for the ~20 satellites it ranges; timestamps carry ±hours of uncertainty, which matters for strict event matching (§6).

Fig. 1 — Per-satellite raw kurtosis, log scale. Gaussian reference κ = 3 dashed.

Approach

The detector maintains a strictly-causal rolling baseline of clean (non-anomalous-scoring) inter-TLE changes in a single mean element per satellite. A robust location and scale (median, MAD) are computed over the baseline; each new change is standardised and mapped to a tail probability via a Student-t reference [6] — deliberately heavier-tailed than the fitted empirical distribution, as a skepticism prior. A Benjamini-Hochberg [7] ordered threshold selects the candidate set per satellite. Adjacent selected pairs within 24 h merge into events; events match to the nearest ILRS truth within ±12 h. No SGP4 propagation, no ephemeris, no runtime dependencies beyond a TLE stream.

Per-satellite configuration is derived from three unsupervised TLE statistics — cadence median, change-series kurtosis, TLE count — via a staircase mapping to baseline window, clean-pair admission threshold, and empirical-Bayes prior weight. No labels touch the per-satellite configuration. Two global knobs (the Student-t reference degrees of freedom and the BH q-value) are selected at benchmark F1 peak and disclosed as operating-point controls. Specific staircase cutoffs and knob values are withheld; the conceptual mapping above is sufficient to validate the class of method.

Scope Policy

Each satellite is restricted to its operational phase: ≥ 90 days after launch (excluding Launch and Early Orbit Phase — insertion burns, reference-orbit phasing, commissioning-related orbit tuning) and, for retired satellites, ≤ 90 days before decommissioning (excluding graveyard de-orbit). Both windows are round-number approximations of typical commissioning and end-of-mission durations for Earth-observation missions of this class, buffered past the astrodynamic transitions rather than per-satellite-calibrated. Satellites must carry ≥ 40 ILRS truths — the floor for informative per-satellite Wilson CIs — which excludes Sentinel-6A (n = 31).

A build script applies these windows to the raw Space-Track archive; every evaluation in this paper consumes the same operational-scope dataset. The detector is byte-identical across scope choices — scope changes what the paper covers, not the detector.

Results

Before the aggregate numbers, a concrete view: 18 months of Sentinel-3A operational-phase TLE history (Jan 2021 – Jun 2022). The grey line is the running |Δ| z-score between adjacent TLE pairs; dashed verticals mark the 17 ILRS-labeled maneuvers in the window; coloured dots are the detector's 18 emitted events, coloured by tier. 17 of 17 labeled maneuvers are recovered; one additional tier-LOW event at day 6 is a strict false positive under ±12 h but sits within 96 h of a labeled burn (§6).

Fig. 2 — 18 months of Sentinel-3A. Dashed verticals mark the 17 ILRS-labeled maneuvers — all recovered. One near-match within 96 h.

Per-satellite performance at ±12 h event match:

Satellite	Truths	TP	FP	FN	P	R	F1
Jason-2	76	27	0	49	1.000	0.355	0.524
CryoSat-2	225	179	6	46	0.968	0.796	0.873
SARAL	57	22	4	35	0.846	0.386	0.530
Jason-3	70	50	5	20	0.909	0.714	0.800
Sentinel-3A	134	129	9	5	0.935	0.963	0.949
Sentinel-3B	132	104	16	28	0.867	0.788	0.825
HY-2C	44	27	5	17	0.844	0.614	0.711
HY-2D	42	27	7	15	0.794	0.643	0.711
SWOT	64	47	0	17	1.000	0.734	0.847
Aggregate	844	612	52	232	0.922	0.725	0.812

Aggregate P = 0.922 [0.899, 0.940], R = 0.725 [0.694, 0.754], F1 = 0.812 across 664 events (612 TP, 52 FP, 232 FN). Each event's peak z-score sorts it into three empirical tiers; the highest-confidence tier reaches 96.3% pooled precision across 402 events, per-satellite mean 95.7% with σ 4.7 pp across all nine assets — under a Gaussian prediction approximation, a new satellite in the same regime would land within ~9 pp of the pool mean.

Fig. 3 — Pooled precision per tier, 95% Wilson CIs.

Two audits back the numbers up. Temporal holdout: the detector, configured from each satellite's full-timeline statistics, was scored on the last 30% of its TLE history alone. Holdout F1 = 0.881, +0.069 vs. the full-sample number — comfortably above the overfit threshold. Sensitivity sweep: aggregate F1 varies by ≤ ±0.003 across ±50% neighbourhoods of every adaptive threshold; the two global knobs peak sharply at their shipped values, as expected of explicit operating-point choices.

What the False Positives Actually Are

The detector emits 664 events on the operational-scope benchmark; 52 (7.8%) are strict false positives — events without an ILRS truth within ±12 h. Of those 52, 41 (79%) sit within 96 h of a labeled truth. Under a uniform-null simulation, the expected within-96 h count is 13.1. The corresponding χ² test: χ²(5) = 98.40, p = 1.2 × 10⁻¹⁹.

Fig. 4 — Strict false positives by distance to nearest labeled truth, observed vs. uniform-null.

The direct reading: most of the detector's strict false positives are not noise-driven errors. They correspond to labeled maneuvers timed outside the ±12 h match window due to finite ILRS timestamp precision or late-arriving post-burn TLEs. Under a label-noise-tolerant scoring rule that credits within-96 h unmatched events at half weight, aggregate F1 rises to 0.823 (95% CI [+0.006, +0.018] on the half-credit − strict delta, Pr(Δ > 0) = 100%, formally significant at α = 0.05 via paired cluster-bootstrap). A generous rule that treats within-96 h unmatched events as pure label noise yields F1 = 0.834.

Strict precision of 0.922 is a floor, not a ceiling. A customer running the detector on their own fleet with tighter timing labels — or willing to cross-check unmatched events against their own records — should expect effective precision closer to 0.96+.

Methodology and the Literature

TLE-based maneuver detection has a thirty-year literature — classical threshold-based change detection on orbital elements [2] [3] and more recent deep-learning work including supervised LSTM classifiers [4], LSTM autoencoders [8] [9], and other supervised ML [5]. Papers in this family report F1 up to 0.995 on well-studied satellites, but those figures are measured in-sample: a single satellite, a temporal split of its own TLE history, per-timestep matching. In-sample evaluation on one satellite cannot measure whether a method works on a satellite the model has never seen — the property that matters for deployment.

Under a protocol that actually tests generalization — leave-one-satellite-out across nine benchmark assets, event-level matching at ±12 h — a from-scratch BiLSTM trained on the same per-pair feature set as the classical detector reached F1 = 0.56. The classical detector reached F1 = 0.81 on the same protocol. Classical robust statistics outperform deep learning at this data scale. A deep baseline would plausibly need a labeled benchmark roughly three times the size of the current ILRS fleet before that changes.

Limitations

Benchmark breadth. Nine satellites in one regime (drag- and phase-maintained LEO Earth observation, 700–1350 km). Constellations with frequent collision-avoidance burns, highly elliptical LEO, smallsats without ranging data, and non-cooperative payloads are explicitly out of scope.

Operational-phase only. LEOP and EOL maneuvers are excluded by scope policy. Detection performance during those phases is not characterised.

Element coverage. The shipped detector scores on a single scalar channel derived from the TLE element set. Multi-channel variants combining additional directional components were tested against this benchmark and did not produce a statistically significant aggregate F1 lift. Sub-noise-floor burns and some out-of-plane components remain structurally outside reach of any single-TLE method at current public-catalog precision.

Timing claim. Events are matched to labels inside ±12 h. The paper does not claim burn-time recovery — SGP4 fit spans of 1–3 days can delay the earliest post-burn TLE epoch by up to the fit half-span.

Generalisation. Per-satellite HIGH-tier precision σ = 4.6 pp across N = 9 is tight, but N is small. Portability to satellites outside the benchmark is plausible but not demonstrated.

Label floor. ILRS records are the only ground-truth source for LEO used here. Evaluation beyond the ~20 ILRS-ranged satellites requires independent tracking — commercial SSA feeds, operator-disclosed records — that this paper does not have.

References

International Laser Ranging Service. Satellite maneuver history ledger. ilrs.gsfc.nasa.gov.
Kelecy, T., Hall, D., Hamada, K., Stocker, D. "Satellite maneuver detection using Two-line Element (TLE) data." Advanced Maui Optical and Space Surveillance Technologies Conference, 2007.
Lemmens, S., Krag, H. "Two-Line-Elements-Based maneuver detection methods for satellites in Low Earth Orbit." Journal of Guidance, Control, and Dynamics, 37(3), 2014.
Cipollone, R., Setya Ardi, N., Di Lizia, P. "An LSTM-based maneuver detection algorithm from satellites pattern of life." Neural Computing and Applications, 2025.
Bai, X., Liao, C., Pan, X., Xu, M. "Mining Two-Line Element data to detect orbital maneuver for satellite." IEEE Access, 7, 2019.
Huber, P.J. "Robust estimation of a location parameter." Annals of Mathematical Statistics, 35(1), 1964.
Benjamini, Y., Hochberg, Y. "Controlling the false discovery rate: a practical and powerful approach to multiple testing." Journal of the Royal Statistical Society B, 57(1), 1995.
Cipollone, R., Raviola, E., Di Lizia, P. "An unsupervised learning-based manoeuvre detection method for resident space object pattern of life characterisation." 9th European Conference on Space Debris, 2024.
Kato, R., Shimada, Y., Aida, S., Kawamoto, S., Akahoshi, Y. "Validity evaluation of anomaly detection using LSTM autoencoder for maneuver detection." Advanced Maui Optical and Space Surveillance Technologies Conference, 2023.