Methodology — The Consiliences Institute

What consilience means

In 1840, William Whewell coined the phrase consilience of inductions to describe the strongest form of scientific confirmation: a hypothesis earns its most compelling support not from the evidence that generated it, but from independent evidence in domains it was never designed to explain.

Newton’s theory of gravity was not just confirmed by falling objects. It was confirmed by its ability to explain planetary orbits, tidal patterns, and the precession of the equinoxes - none of which were used to build it. The convergence of independent lines of evidence on a single underlying mechanism is what Whewell called consilience. It remains the most reliable signal we have that a hypothesis is tracking something real rather than an artifact of the data used to discover it.

We apply this as a formal quality criterion: a pattern must survive confirmation from genuinely independent sources before it is worth reporting. Agreement across series derived from the same upstream model or methodology does not count. Independence must be traced to the generative process, not just the label on the dataset.

The eight validation dimensions

Every signal in the Observatory is evaluated against eight dimensions before receiving a verdict. A finding must pass the majority of these - with no fatal failures - to qualify for publication.

1. Literature synthesis and prior effect sizes. We collect published correlations, effect sizes, and mechanism descriptions from peer-reviewed sources. This establishes what prior research predicts and at what magnitude.

2. Primary associative test. The headline claim is tested directly: correlation, regression, or lag analysis on the relevant time series. Effect size and confidence interval are reported alongside statistical significance. We use the word “causal” only where a verified physical mechanism connects driver to outcome. Where the test is purely observational, we say so. Economists would rightly want instrumental variables or natural experiments in some cases; we note where our evidence falls short of that standard.

3. Confound and artifact control. We test for the most plausible alternative explanations: secular trend, ENSO influence, volcanic aerosol injection, autocorrelation inflation. The signal must survive these controls, not just the headline test.

4. Permutation and surrogate significance testing. We generate randomised surrogate series with the same autocorrelation structure as the original and retest. A genuine signal should be distinguishable from its own noise floor.

5. Era-splitting and regime stability. The relationship is tested in at least two non-overlapping time periods. Patterns that hold only in the discovery sample and dissolve in the out-of-sample period are downgraded or killed.

6. Devil’s advocate assessment. A structured adversarial review identifies the weakest points in the evidential chain: where is the data thin? Where does the mechanism require an untested assumption? What would kill this finding? This is documented in the validation record for every signal, whether published or not.

7. Cross-signal consilience check. Where two or more independent research traditions - using different instruments, different time periods, different methodologies - converge on the same mechanism, the signal is upgraded to CONSILIENCE verdict. This is the Whewell test applied formally: does the hypothesis explain phenomena it was not designed to explain?

8. Mechanism plausibility review. We verify that the causal chain between the proposed driver and the observed outcome is physically or biologically coherent. Statistical association without a plausible mechanism receives a lower confidence designation.

Verdicts

Signals are assigned one of five verdicts after evaluation:

CONSILIENCE - All eight dimensions passed; two or more independent research traditions converge on the same mechanism
CONFIRMED - Six or more dimensions passed, no fatal flaws
CONFIRMED WITH CAVEATS - Core finding holds but one or more specific limitations (era-dependence, mechanism uncertainty, small N) are documented and binding
SUSPENDED - Insufficient data for resolution; under continued monitoring
NOISE / KILLED - Failed on primary test or fatal confound identified; permanently retired

We publish the killed signals list. A research programme that never kills anything is not doing science.

Trading signals with sufficient forward-return data are additionally subjected to a five-test finding-validator battery: Monte Carlo null, blind era-split replication, specificity against negative controls, mechanism sensitivity, and tolerance testing. Signals that fail all four pass-fail criteria are retired regardless of prior verdict. Several signals that carried CONFIRMED or SUGGESTIVE verdicts from the primary validation process have been killed by this battery — the kill list reflects this.

One calibration note on what verdicts mean: even a CONSILIENCE verdict is best read as “a high-confidence hypothesis that has survived rigorous internal stress-testing and converges across independent domains” - not as settled law. These findings have not been subjected to adversarial external peer review or independent replication by unaffiliated teams. That is the remaining gap between what we do and what would be required for a finding to enter the scientific consensus. We are explicit about this because we think conflating the two is the most common failure mode in private research.

The multiple comparisons problem

We have formally evaluated over 400 independent hypotheses through the eight-dimension validation framework, with a further several hundred signals catalogued from the research literature and awaiting full validation. Of the formally evaluated set, over 120 have been killed — a retirement rate high enough to indicate the process is actually discriminating, not just confirming. Current counts are maintained on the killed signals page, which is the authoritative record.

At conventional significance thresholds, some confirmed findings are false positives by chance alone — this is a statistical inevitability of large-scale pattern search. We do not pretend otherwise.

Our partial mitigations: we require effect sizes, not just p-values; we apply surrogate controls that correct for autocorrelation inflation; and the CONSILIENCE upgrade requires genuinely independent replication, which random noise cannot produce. But the honest position is that findings should be weighted by effect size, the number of independent validation streams, and mechanism coherence — not treated as equivalent because they share a verdict label.

What we publish — and what we don’t

We publish empirically validated signals and their historical performance metrics, including the statistical tests used, effect sizes, and where applicable, forward-return backtests. This includes the trading signals section, which lists battery-verified signals with their historical hit rates and return characteristics.

What we do not publish: active position guidance, near-term entry or exit timing, or portfolio construction recommendations. Structural signals tell you that conditions are present. They do not tell you whether the market has already priced those conditions in, or when the structural force will express itself in price. That distinction is important and we will not blur it.

On the use of AI

Every paper and bulletin here is produced by AI. We do not hide this.

Consilience research requires holding independent data streams from unrelated disciplines in parallel - tracking climate physics, agricultural history, epidemiology, and long-wave economic cycles simultaneously. No single human specialist commands all of these. AI executes the parallel analysis and generates the research agenda — Observatory agents surface candidate signals autonomously. A human operator applies the publication standards, approves the quality gate, and decides what is released.

The risk this introduces - and we name it directly - is the most serious operational risk in our process. AI systems are excellent at pattern-matching across domains and unreliable at original statistical rigour when not tightly constrained. They will generate plausible-sounding statistics that do not appear in the underlying source data. This has happened once in our process: a generated paper cited a correlation coefficient that did not exist in any validation file. It was caught at editorial review.

Our guard against this is structural, not aspirational. Every numerical claim in a published piece must trace to a specific value in the Observatory’s validation files. A second-pass automated validator checks the draft against those files before it reaches human review. Claims that cannot be sourced are removed. The validation files themselves are produced by deterministic statistical code, not by the AI that writes the papers - this separation is the critical safeguard. If that loop were ever closed (AI generating both the validation results and the papers summarising them), the entire output would become sophisticated hallucination. We have not closed that loop, and we do not intend to.

Human oversight is not a formality here. It is the editorial process.