LEO Maneuver Detection on Public TLE History
Robust-statistics change detection on raw two-line element sets, validated against laser-ranging ground truth
Summary
Vantafort's LEO maneuver detector identifies orbital maneuvers from public two-line element sets alone — no operator telemetry, no access to ranging data, no closed tracking feeds. On nine drag- and phase-maintained LEO Earth-observation satellites evaluated against International Laser Ranging Service (ILRS) ground truth [1] during their operational phase, it achieves aggregate precision 0.922, recall 0.725, and F1 0.812 at a ±12 h event-match tolerance. The high-confidence tier reaches 96.3% pooled precision with 4.7 percentage points cross-satellite standard deviation across all nine assets.
Per-satellite configuration is derived entirely from unsupervised TLE statistics; no labels touch the tuning surface. A sensitivity sweep confirms: ±50% perturbations to the four adaptive thresholds move F1 by ≤ 0.003. Two global knobs — the Student-t reference degrees of freedom and the false-discovery-rate target — are tuned at benchmark F1 peak and disclosed as operating-point choices.
The Problem
Maneuvers on operational LEO Earth-observation missions are small — typically 0.01–0.15 m/s, orders of magnitude below the noise floor of SGP4's mean-element representation. The public TLE catalog, updated on an irregular 8–24 h cadence, is the only observation layer available without operator cooperation; its elements mix physical signal with orbit-determination artefacts (B* re-estimation coupling, fit-span edges, epoch-placement jitter).
The noise challenge is not subtle. Inter-TLE element changes on clean pairs show kurtosis between 190 and 3,879 across the benchmark — two to three orders of magnitude above the Gaussian value of 3. Standard z-score detection collapses on distributions this heavy-tailed. Ground truth is also scarce: ILRS publishes maneuver ledgers for the ~20 satellites it ranges; timestamps carry ±hours of uncertainty, which matters for strict event matching (§6).
Approach
The detector maintains a strictly-causal rolling baseline of clean (non-anomalous-scoring) inter-TLE changes in a single mean element per satellite. A robust location and scale (median, MAD) are computed over the baseline; each new change is standardised and mapped to a tail probability via a Student-t reference [6] — deliberately heavier-tailed than the fitted empirical distribution, as a skepticism prior. A Benjamini-Hochberg [7] ordered threshold selects the candidate set per satellite. Adjacent selected pairs within 24 h merge into events; events match to the nearest ILRS truth within ±12 h. No SGP4 propagation, no ephemeris, no runtime dependencies beyond a TLE stream.
Per-satellite configuration is derived from three unsupervised TLE statistics — cadence median, change-series kurtosis, TLE count — via a staircase mapping to baseline window, clean-pair admission threshold, and empirical-Bayes prior weight. No labels touch the per-satellite configuration. Two global knobs (the Student-t reference degrees of freedom and the BH q-value) are selected at benchmark F1 peak and disclosed as operating-point controls. Specific staircase cutoffs and knob values are withheld; the conceptual mapping above is sufficient to validate the class of method.
Scope Policy
Each satellite is restricted to its operational phase: ≥ 90 days after launch (excluding Launch and Early Orbit Phase — insertion burns, reference-orbit phasing, commissioning-related orbit tuning) and, for retired satellites, ≤ 90 days before decommissioning (excluding graveyard de-orbit). Both windows are round-number approximations of typical commissioning and end-of-mission durations for Earth-observation missions of this class, buffered past the astrodynamic transitions rather than per-satellite-calibrated. Satellites must carry ≥ 40 ILRS truths — the floor for informative per-satellite Wilson CIs — which excludes Sentinel-6A (n = 31).
A build script applies these windows to the raw Space-Track archive; every evaluation in this paper consumes the same operational-scope dataset. The detector is byte-identical across scope choices — scope changes what the paper covers, not the detector.
Results
Before the aggregate numbers, a concrete view: 18 months of Sentinel-3A operational-phase TLE history (Jan 2021 – Jun 2022). The grey line is the running |Δ| z-score between adjacent TLE pairs; dashed verticals mark the 17 ILRS-labeled maneuvers in the window; coloured dots are the detector's 18 emitted events, coloured by tier. 17 of 17 labeled maneuvers are recovered; one additional tier-LOW event at day 6 is a strict false positive under ±12 h but sits within 96 h of a labeled burn (§6).
Per-satellite performance at ±12 h event match:
| Satellite | Truths | TP | FP | FN | P | R | F1 |
|---|---|---|---|---|---|---|---|
| Jason-2 | 76 | 27 | 0 | 49 | 1.000 | 0.355 | 0.524 |
| CryoSat-2 | 225 | 179 | 6 | 46 | 0.968 | 0.796 | 0.873 |
| SARAL | 57 | 22 | 4 | 35 | 0.846 | 0.386 | 0.530 |
| Jason-3 | 70 | 50 | 5 | 20 | 0.909 | 0.714 | 0.800 |
| Sentinel-3A | 134 | 129 | 9 | 5 | 0.935 | 0.963 | 0.949 |
| Sentinel-3B | 132 | 104 | 16 | 28 | 0.867 | 0.788 | 0.825 |
| HY-2C | 44 | 27 | 5 | 17 | 0.844 | 0.614 | 0.711 |
| HY-2D | 42 | 27 | 7 | 15 | 0.794 | 0.643 | 0.711 |
| SWOT | 64 | 47 | 0 | 17 | 1.000 | 0.734 | 0.847 |
| Aggregate | 844 | 612 | 52 | 232 | 0.922 | 0.725 | 0.812 |
Aggregate P = 0.922 [0.899, 0.940], R = 0.725 [0.694, 0.754], F1 = 0.812 across 664 events (612 TP, 52 FP, 232 FN). Each event's peak z-score sorts it into three empirical tiers; the highest-confidence tier reaches 96.3% pooled precision across 402 events, per-satellite mean 95.7% with σ 4.7 pp across all nine assets — under a Gaussian prediction approximation, a new satellite in the same regime would land within ~9 pp of the pool mean.
Two audits back the numbers up. Temporal holdout: the detector, configured from each satellite's full-timeline statistics, was scored on the last 30% of its TLE history alone. Holdout F1 = 0.881, +0.069 vs. the full-sample number — comfortably above the overfit threshold. Sensitivity sweep: aggregate F1 varies by ≤ ±0.003 across ±50% neighbourhoods of every adaptive threshold; the two global knobs peak sharply at their shipped values, as expected of explicit operating-point choices.
What the False Positives Actually Are
The detector emits 664 events on the operational-scope benchmark; 52 (7.8%) are strict false positives — events without an ILRS truth within ±12 h. Of those 52, 41 (79%) sit within 96 h of a labeled truth. Under a uniform-null simulation, the expected within-96 h count is 13.1. The corresponding χ² test: χ²(5) = 98.40, p = 1.2 × 10⁻¹⁹.
The direct reading: most of the detector's strict false positives are not noise-driven errors. They correspond to labeled maneuvers timed outside the ±12 h match window due to finite ILRS timestamp precision or late-arriving post-burn TLEs. Under a label-noise-tolerant scoring rule that credits within-96 h unmatched events at half weight, aggregate F1 rises to 0.823 (95% CI [+0.006, +0.018] on the half-credit − strict delta, Pr(Δ > 0) = 100%, formally significant at α = 0.05 via paired cluster-bootstrap). A generous rule that treats within-96 h unmatched events as pure label noise yields F1 = 0.834.
Strict precision of 0.922 is a floor, not a ceiling. A customer running the detector on their own fleet with tighter timing labels — or willing to cross-check unmatched events against their own records — should expect effective precision closer to 0.96+.
Methodology and the Literature
TLE-based maneuver detection has a thirty-year literature — classical threshold-based change detection on orbital elements [2] [3] and more recent deep-learning work including supervised LSTM classifiers [4], LSTM autoencoders [8] [9], and other supervised ML [5]. Papers in this family report F1 up to 0.995 on well-studied satellites, but those figures are measured in-sample: a single satellite, a temporal split of its own TLE history, per-timestep matching. In-sample evaluation on one satellite cannot measure whether a method works on a satellite the model has never seen — the property that matters for deployment.
Under a protocol that actually tests generalization — leave-one-satellite-out across nine benchmark assets, event-level matching at ±12 h — a from-scratch BiLSTM trained on the same per-pair feature set as the classical detector reached F1 = 0.56. The classical detector reached F1 = 0.81 on the same protocol. Classical robust statistics outperform deep learning at this data scale. A deep baseline would plausibly need a labeled benchmark roughly three times the size of the current ILRS fleet before that changes.
Limitations
Benchmark breadth. Nine satellites in one regime (drag- and phase-maintained LEO Earth observation, 700–1350 km). Constellations with frequent collision-avoidance burns, highly elliptical LEO, smallsats without ranging data, and non-cooperative payloads are explicitly out of scope.
Operational-phase only. LEOP and EOL maneuvers are excluded by scope policy. Detection performance during those phases is not characterised.
Element coverage. The shipped detector scores on a single scalar channel derived from the TLE element set. Multi-channel variants combining additional directional components were tested against this benchmark and did not produce a statistically significant aggregate F1 lift. Sub-noise-floor burns and some out-of-plane components remain structurally outside reach of any single-TLE method at current public-catalog precision.
Timing claim. Events are matched to labels inside ±12 h. The paper does not claim burn-time recovery — SGP4 fit spans of 1–3 days can delay the earliest post-burn TLE epoch by up to the fit half-span.
Generalisation. Per-satellite HIGH-tier precision σ = 4.6 pp across N = 9 is tight, but N is small. Portability to satellites outside the benchmark is plausible but not demonstrated.
Label floor. ILRS records are the only ground-truth source for LEO used here. Evaluation beyond the ~20 ILRS-ranged satellites requires independent tracking — commercial SSA feeds, operator-disclosed records — that this paper does not have.
References
- International Laser Ranging Service. Satellite maneuver history ledger. ilrs.gsfc.nasa.gov.
- Kelecy, T., Hall, D., Hamada, K., Stocker, D. "Satellite maneuver detection using Two-line Element (TLE) data." Advanced Maui Optical and Space Surveillance Technologies Conference, 2007.
- Lemmens, S., Krag, H. "Two-Line-Elements-Based maneuver detection methods for satellites in Low Earth Orbit." Journal of Guidance, Control, and Dynamics, 37(3), 2014.
- Cipollone, R., Setya Ardi, N., Di Lizia, P. "An LSTM-based maneuver detection algorithm from satellites pattern of life." Neural Computing and Applications, 2025.
- Bai, X., Liao, C., Pan, X., Xu, M. "Mining Two-Line Element data to detect orbital maneuver for satellite." IEEE Access, 7, 2019.
- Huber, P.J. "Robust estimation of a location parameter." Annals of Mathematical Statistics, 35(1), 1964.
- Benjamini, Y., Hochberg, Y. "Controlling the false discovery rate: a practical and powerful approach to multiple testing." Journal of the Royal Statistical Society B, 57(1), 1995.
- Cipollone, R., Raviola, E., Di Lizia, P. "An unsupervised learning-based manoeuvre detection method for resident space object pattern of life characterisation." 9th European Conference on Space Debris, 2024.
- Kato, R., Shimada, Y., Aida, S., Kawamoto, S., Akahoshi, Y. "Validity evaluation of anomaly detection using LSTM autoencoder for maneuver detection." Advanced Maui Optical and Space Surveillance Technologies Conference, 2023.