Observatory live · 1594 signals · 55 consilience · 139 confirmed · 289 killed Register interest · Newsletter
Long-cycle structural research
The Consiliences Institute

Unity of knowledge in an age of fragmentation

Methodology

How the Observatory validates a signal: the eight statistical dimensions, the four-battery framework, the Whewell consilience rubric, and the devil's-advocate review that kills weak findings before publication.

What consilience means

In 1840, William Whewell coined the phrase consilience of inductions to describe the strongest form of scientific confirmation: a hypothesis earns its most compelling support not from the evidence that generated it, but from independent evidence in domains it was never designed to explain.

Newton’s theory of gravity was not just confirmed by falling objects. It was confirmed by its ability to explain planetary orbits, tidal patterns, and the precession of the equinoxes - none of which were used to build it. The convergence of independent lines of evidence on a single underlying mechanism is what Whewell called consilience. It remains the most reliable signal we have that a hypothesis is tracking something real rather than an artifact of the data used to discover it.

We apply this as a formal quality criterion: a pattern must survive confirmation from genuinely independent sources before it is worth reporting. Agreement across series derived from the same upstream model or methodology does not count. Independence must be traced to the generative process, not just the label on the dataset.

Three layers of validation

Every signal in the Observatory passes through three parallel layers of scrutiny before receiving a verdict. Each layer targets a different failure mode and uses a different instrument. No layer is a substitute for either of the others.

Layer 1 — Statistical battery. A deterministic battery of eight statistical tests applied by validator code against the candidate signal’s data: Bonferroni-corrected significance, effect size, out-of-sample hold-out replication, mechanism plausibility review against the domain literature, confound isolation through partial regression, directional falsification across regimes, era and epoch stability, and phase-randomised surrogate specificity. The battery targets statistical artifacts, temporal overfitting, and indistinguishability from structured noise. Its outputs are reproducible from the dossier.

The eight tests are organised into four batteries by the failure mode they address — Battery I (Associative), Battery II (Confound), Battery III (Stability), and Battery IV (Mechanism). Each battery returns a pass, fail, weak, or skip verdict, and a signal’s dossier records all four.

Layer 2 — Devil’s Advocate Protocol. A structured counter-argument pass required for every signal that reaches a tentative CONFIRMED verdict. The researcher must explicitly state the strongest case for why the signal might be a false positive: where the data is thin, where the mechanism requires an untested assumption, what specific observation would refute the finding. The protocol is documented for every signal, whether ultimately published or not, and interrogates both the statistical verdict of Layer 1 and the epistemic judgment of Layer 3.

Layer 3 — Whewell epistemic rubric. Nine criteria drawn from the historical philosophy of science and associated with William Whewell’s doctrine of consilience: prediction, consilience across independent data streams, mechanism plausibility, mechanism citation in peer-reviewed literature, falsifiability, specificity against negative controls, reproducibility by independent code paths or personnel, accuracy of quantitative estimates, and generalisability beyond the discovery sample. The rubric is scored by structured review of the full signal dossier, including Layer 1’s outputs. Three of its nine criteria — consilience, mechanism citation, and reproducibility — have no analogue in the statistical battery, and it is chiefly through those three that the rubric catches classes of error the battery cannot see.

Convergent verdicts across all three layers constitute the strongest confirmation we issue. Divergence between layers triggers a second-pass investigation rather than a publication decision. A signal is considered fully audited only when all three layers have returned a verdict and all three are visible in its dossier.

The integrated framework — the statistical battery, the Devil’s Advocate protocol, the nine-criterion Whewell rubric, and the audit gate that checks a draft against its dossier before publication — is collectively the Whewell Gate. The remainder of this site uses “Whewell Gate” as shorthand for the whole; the individual component names (statistical battery, Whewell rubric, audit gate) continue to refer to the parts.

Verdicts

Signals are assigned a verdict after evaluation. Verdicts fall into five tiers:

Strong confirmation

Supported with qualification

Preliminary

Inactive

We publish the killed signals list. A research programme that never kills anything is not doing science.

Trading signals with sufficient forward-return data are additionally subjected to a forward-return sub-battery targeting overfitting, negative-control specificity, and tolerance robustness. Signals that fail its core pass-fail criteria are retired regardless of prior verdict. Several signals that carried CONFIRMED or SUGGESTIVE verdicts from the primary validation process have been killed by this sub-battery — the kill list reflects this.

One calibration note on what verdicts mean: even a CONSILIENCE verdict is best read as “a high-confidence hypothesis that has survived rigorous internal stress-testing and converges across independent domains” - not as settled law. These findings have not been subjected to adversarial external peer review or independent replication by unaffiliated teams. That is the remaining gap between what we do and what would be required for a finding to enter the scientific consensus. We are explicit about this because we think conflating the two is the most common failure mode in private research.

The multiple comparisons problem

We have formally evaluated more than fifteen hundred independent hypotheses through the validation framework, with the full count maintained live in the Observatory. A substantial fraction of the evaluated set has been killed — a retirement rate high enough to indicate the process is actually discriminating, not just confirming. Current counts are maintained on the killed signals page, which is the authoritative record.

At conventional significance thresholds, some confirmed findings are false positives by chance alone — this is a statistical inevitability of large-scale pattern search. We do not pretend otherwise.

Our partial mitigations: we require effect sizes, not just p-values; we apply surrogate controls that correct for autocorrelation inflation; and the CONSILIENCE upgrade requires genuinely independent replication, which random noise cannot produce. But the honest position is that findings should be weighted by effect size, the number of independent validation streams, and mechanism coherence — not treated as equivalent because they share a verdict label.

What we publish — and what we don’t

We publish empirically validated signals and their historical performance metrics, including the statistical tests used, effect sizes, and where applicable, forward-return backtests. This includes the trading signals section, which lists battery-verified signals with their historical hit rates and return characteristics.

What we do not publish: active position guidance, near-term entry or exit timing, or portfolio construction recommendations. Structural signals tell you that conditions are present. They do not tell you whether the market has already priced those conditions in, or when the structural force will express itself in price. That distinction is important and we will not blur it.

Origins

The Observatory was founded in early 2026. Its structural ancestor was an internal convergence detector built for a newswire pipeline — a mechanism for flagging when two independent agent streams arrived at the same variable. The step that mattered was redirecting that mechanic outward at published external data.

The signal catalogue was seeded with five heterogeneous starting points — central bank linguistics, solar geomagnetic activity, volcanic aerosols, ENSO, Ramadan consumption cycles — chosen specifically to avoid a unifying thesis. The kill rate was structural from the first validator run.

An early illustration of the framework’s discipline: the Tzolkin 260-day corn cycle was confirmed in the first week and retired three weeks later, when the full battery found that the 365-day annual harvest harmonic dominates the 260-day peak at a specificity margin too small to clear threshold. The early confirmation and its subsequent retraction sit on the same page in the record. That is the design.

The Whewell rubric was drafted in parallel with the initial signal list, not retrofitted afterward. Its current nine-criterion form is the third revision, with the mechanism criterion split into plausibility and citation after early validation runs revealed they fail independently.

The operating principle has not changed: a system that only confirms its own hypotheses has no epistemic value.

On the use of AI

Every paper and bulletin here is produced by AI. We do not hide this.

Consilience research requires holding independent data streams from unrelated disciplines in parallel - tracking climate physics, agricultural history, epidemiology, and long-wave economic cycles simultaneously. No single human specialist commands all of these. AI executes the parallel analysis and generates the research agenda — Observatory agents surface candidate signals autonomously. A human operator applies the publication standards, approves the quality gate, and decides what is released.

The risk this introduces - and we name it directly - is the most serious operational risk in our process. AI systems are excellent at pattern-matching across domains and unreliable at original statistical rigour when not tightly constrained. They will generate plausible-sounding statistics that do not appear in the underlying source data.

Our guard against this is structural, not aspirational. The pipeline separates three operations that a single generation pass conflates: claim extraction (each source assertion is typed and recorded before any prose is written), deterministic resolution of source conflicts by explicit rules rather than the model’s in-prose judgment, and claim-referenced synthesis – the generator writes under a hard constraint that every checkable token must trace back to a claim that survived resolution. A verification pass then maps every number, date, and named entity in the finished prose against its backing claim, surfacing unbacked tokens as reviewable items rather than letting them pass as fluent text. The validation files themselves are produced by deterministic statistical code, not by the AI that writes the papers – this separation is the critical safeguard. If that loop were ever closed (AI generating both the validation results and the papers summarising them), the entire output would become sophisticated hallucination. We have not closed that loop, and we do not intend to.

Human oversight is not a formality here. It is the editorial process. The architecture underlying this approach is described in Verifiable, Not Eliminated.

Definitional Anchors

Signals whose verdict is DEFINITIONAL — foundational facts validated by established convention, not by the statistical battery.

Methodology Taxonomy

Canonical dimensions, batteries, and verdict definitions used across all Consiliences research.

Verdict Tiers

The full dossier verdict set — the canonical vocabulary recorded in each signal's dossier, and its mapping to the user-facing six-tier presentation set.