A falsification harness for trading signals and the agents that act on them — strategy evaluation, benchmarking, and decision audit.
Touchstone takes any scalar signal aligned to a price series and tries to kill the
claim that it has an edge. It emits a machine-readable verdict — SURVIVES or
NO_EDGE — after a five-stage pipeline that most backtests skip. It does the same to a
decision layer — an LLM council, a rule-based filter, or a hybrid gate — and asks
whether its vetoes actually remove worse-than-random trades or just add latency and burn
tokens (the Gate Selectivity Test): SELECTIVE or NOT_SELECTIVE.
It is honest by construction: both axes are orthogonal to PnL, and the tool says so in every output. A passing verdict is necessary, not sufficient for a profitable system.
Subject #1 is the author's own funding-rate strategy (
examples/perceptrade-frz/), driven by an LLM council; touchstone reports itNO_EDGEwith aNOT_SELECTIVEgate. The first thing it kills is its maker's.
Demo walkthrough (≤3 min): https://youtu.be/ukLcTlRxl7M
No dependencies. Node 18+. Nothing to configure; the bundled examples run out of the box.
git clone https://github.com/cryptohakka/touchstone
cd touchstone
# the falsification case — the author's own funding-rate signal → NO_EDGE
node src/index.mjs examples/perceptrade-frz/input.json --map price=btcPrice signal=frZ
# the positive control — a synthetic signal with a planted edge → SURVIVES
node src/index.mjs examples/synthetic-leak/input.jsonThe first command runs touchstone on its own subject #1 and reports NO_EDGE — the
falsification case the whole tool is built around. The second is a synthetic signal with a
planted edge; it returns SURVIVES, proving the harness is not a timid always-veto but
detects real edge when it's there. perceptrade-frz = the falsification case; synthetic =
proof of operation. See "Self-validation" and "Subject #1" below.
You'll get a human summary on stderr and a verdict JSON on stdout. node --test runs a
deterministic, seeded test suite. node src/index.mjs --help is a complete reference.
| axis | question | verdict | needs |
|---|---|---|---|
| signal alpha | does this signal have edge that survives correction? | NO_EDGE / SURVIVES |
price+signal series |
| gate selectivity | does this agent's veto layer block worse-than-random trades, or just randomly? | NOT_SELECTIVE / SELECTIVE |
+ a block/veto decision log |
Both are orthogonal to realized PnL.
A JSON array. Each record needs a timestamp, a price, and one scalar signal:
[{ "timestamp": "2026-06-06T00:00:00.000Z", "btcPrice": 65960.71, "frZ": -1.591 },
{ "timestamp": "2026-06-06T00:05:00.000Z", "btcPrice": 65913.48, "frZ": -1.462 },
{ "timestamp": "2026-06-06T00:10:00.000Z", "btcPrice": 65871.80, "frZ": -1.374 }]- timestamp — anything
Date.parseaccepts (ISO 8601 recommended). - price — the underlying instrument's price (number).
- signal — one scalar per snapshot (number). A z-score, a deviation, a confidence, anything 1-dimensional.
If your field names are exactly timestamp / price / signal, no --map is needed.
Otherwise map them: the example above uses btcPrice / frZ, so it needs
--map price=btcPrice signal=frZ. Dotted paths work: signal=directionSignal.frZ.
If your bot or backtest already logs a price and a decision score each step, you're done in one step: dump those to a JSON array and map the field names.
# centered z-like signal (funding-z, MACD histogram, VWAP deviation):
node src/index.mjs mydata.json --map price=close signal=zscore
# non-centered signal (RSI 0..100, a 0..1 confidence score): standardize it first
node src/index.mjs mydata.json --map price=close signal=rsi14 --standardize--standardize z-scores the signal to mean 0 / sd 1 so the default --threshold 1.5
means "1.5 SD" regardless of the signal's native scale.
Have a CSV? Convert it to the JSON array touchstone reads with a one-liner (no deps):
node -e 'const fs=require("fs"),[h,...r]=fs.readFileSync("mydata.csv","utf8").trim().split(/\r?\n/);const k=h.split(",");console.log(JSON.stringify(r.map(l=>Object.fromEntries(l.split(",").map((v,i)=>[k[i],isNaN(+v)?v:+v])))))' > mydata.jsonThen map your column names as usual: --map time=time price=close signal=rsi.
Read it as: NO_EDGE = this signal's apparent edge does not survive correction;
SURVIVES = at least one horizon×side cell beats Benjamini-Hochberg. min_p above the
BH threshold is the usual cause of death — a result that looks significant alone
(min_p=0.0897) but fails once you account for testing 6 cells. power bounds the
honesty of a NO_EDGE: at this episode count the smallest reliably-detectable effect is
d≈0.97, so no edge detected ≠ no edge exists.
- Forward return at each firing, over each horizon.
- Beta-adjust — subtract the horizon's market drift, so gains that were just the underlying's beta don't count as signal.
- Episode-collapse — merge consecutive same-sign firings into independent episodes. This is where t-stats inflated by autocorrelation die: on the subject signal the strongest cell's naive |t|≈4.9 collapses to episode |t|<2, and three of the six cells flip sign once each event is counted once instead of once per 5-minute firing.
- Multiple comparison (Benjamini-Hochberg + Bonferroni) across all horizon×side cells
that survive a minimum-episode guard: cells left with fewer than
--min-episodes(default 5) independent episodes are excluded from testing — a truncated horizon reduced to a couple of episodes can fire an explosive t-stat off near-zero variance, and the harness refuses to vote on its own degenerate cells. Excluded cells are still reported, underunderpowered_excluded. - Power — report the minimum detectable effect at the episode-level n, so
NO_EDGEis bounded rather than overclaimed.
Naive vs episode t on the frZ subject. 785 autocorrelated firings collapse to 35 independent episodes; min_p = 0.090 > the BH threshold 0.0083 → NO_EDGE.
Touchstone validates itself on your data. --selfcheck runs three checks:
node src/index.mjs examples/perceptrade-frz/input.json --map price=btcPrice signal=frZ \
--selfcheck --json selfcheck.json- detection power — graded future-leak injection: as genuine signal is mixed in, the
verdict must flip
NO_EDGE → SURVIVESwithmin_pfalling monotonically. On the frZ subject this happens at α=0.35 (Spearman −0.81). The harness is not a timid always-veto; it detects real edge when it's there. - no false positive — a time-shuffled copy of the signal must stay
NO_EDGE. It does. - direction invariance — sign-flipping the signal (z → −z) must leave
min_punchanged, since the verdict is two-sided. It does. This catches accidental directional bias in the harness.
The subject signal's own baseline is NO_EDGE with a raw min_p that looks tempting in
isolation but dies under correction — while the harness clearly detects a planted edge
at α=0.35. Read together: the tool finds edge when it exists; the subject just doesn't
have it.
Detection-power curve. Injecting a graded future-return leak into the dead frZ signal drives min_p below the BH threshold at α=0.35 (Spearman −0.81) — the harness detects real edge rather than always vetoing.
Your LLM council vetoes trades. Your rule-based filter blocks signals. Are those vetoes actually removing worse-than-random trades, or just adding latency and burning tokens? The Gate Selectivity Test answers this for any decision producer — LLM, rule-based, or hybrid — whose accept/reject calls can be logged:
- LLM-driven councils (Architect/Auditor/Arbiter, a single-LLM gate, ensemble votes)
- rule-based filters (volatility gates, drawdown caps, momentum checks)
- hybrid gates (LLM + heuristic)
- any decision producer that emits accept/reject per signal firing
Given a log of trades a gate blocked, touchstone bootstraps random blocks from the same firing population and asks whether the gate's blocked set is significantly worse-than-random. Two guards are baked in:
- the gate's actual mean is recomputed with the same PnL formula as the baseline (never compared across different bases);
- the baseline side is derived from each drawn point's own signal, not the recorded block list.
SELECTIVE means the gate discriminates within the firing population. It does not mean
the system is profitable. A SELECTIVE gate on a NO_EDGE signal correctly does nothing —
and on subject #1 touchstone reports both axes negative (NO_EDGE + NOT_SELECTIVE),
which is internally consistent: selectivity and edge are orthogonal, and both negative means
the system has neither a signal to act on nor a gate that beats coin-flipping. No tool in the
overfitting-detection family below tests the gate layer — this axis is specific to touchstone.
The gate's PnL uses an execution model: default leverage 2 and round-trip fee 0.24%
(0.06% maker+taker, two legs, at 2x). Override with --leverage and --fee; both the gate's
actual mean and the random baseline use the same values, so the verdict is fee-symmetric. The bootstrap is seeded (--seed, default 12345), so the firing-pool percentile is reproducible across runs.
Decision-log format: any array of resolved records carrying a pnl field, an optional
side, and an entry timestamp (entry_time / entryTime / opened_at / … auto-detected):
[{ "resolved": true, "side": "short", "entry_time": "...", "virtual_pnl_pct": -0.31 }]If your log uses different keys, map them — no need to rewrite the file. Side values are
normalized (long|l|buy|1 → long, short|s|sell|-1 → short):
# log records look like { blocked:true, dir:'S', ts:'...', pnl_pct:-0.31 }
node src/index.mjs data.json --map price=close signal=rsi --standardize \
--gate skips.json --gate-map resolved=blocked side=dir pnl=pnl_pctSee node src/index.mjs --help for the canonical list with examples. Summary:
--map, --standardize, --threshold, --horizons, --gap, --min-episodes, --sph, --gate,
--selfcheck, --leak-horizon, --seed, --diag, --json.
For agents and LLMs integrating touchstone programmatically.
Input series — Array<{ [timeField]: string|number, [priceField]: number, [signalField]: number }>.
Default field names timestamp/price/signal; override with --map k=field. Records
with non-finite mapped values are dropped. Minimum 30 usable rows.
Decision log (--gate) — Array<{ resolved: true, [pnlField]: number, side?: "long"|"short", <entryTimeField>: string|number }>.
pnlField defaults to virtual_pnl_pct. Entry-time auto-detected from
entry_time|entryTime|opened_at|openedAt|entry_ts|timestamp|time|ts.
Verdict output (default mode) — object with:
headline: string, signal_alpha: { verdict: "NO_EDGE"|"SURVIVES", n_snapshots, n_episodes, threshold, cadence_snapshots_per_hour, horizons_hours, multiple_comparison: { method, M, min_episodes, bh_rejected: string[], bonferroni_survivors: string[], min_p, min_p_cell, underpowered_excluded: [{ label, nEpisodes }] }, power: [{ n_episodes, min_detectable_d }], note },
gate_selectivity: null | { verdict: "SELECTIVE"|"NOT_SELECTIVE", horizon_min, execution, n_blocks, basis_used: "recomputed"|"recorded", mapped_coverage, coverage_warning, firing_pool: { size, gate_mean, random_mean, percentile }, all_pool, note },
disclaimer: string.
Self-check output (--selfcheck) — object with:
baseline_signal: { min_p, bh_survivors, verdict },
detection_power: { leak_curve: [{ alpha, min_p, bh_survivors, verdict }], first_survive_alpha, spearman_alpha_minp, passes },
null_shuffle: { verdict, passes }, null_sign_flip: { verdict, passes },
summary: { detection_power, no_false_positive, direction_invariant, overall }.
Exit codes — 0 success; 1 bad usage or unusable input. Human summary on stderr,
JSON on stdout (or to --json <file>). Reproducible: node --test runs a deterministic,
seeded suite (data seed 7, shuffle-null seed 12345) asserting noise → NO_EDGE, a planted
leak → SURVIVES, and that the synthetic self-check first flips at α=0.2. (The α=0.35
figure in Self-validation is the frZ subject, which is harder to revive than synthetic
noise — consistent with it carrying no edge of its own.)
Five bundled examples double as verifiable usage records (examples/). Every real
signal tested — the author's own and a classic indicator — fails; only the synthetic
positive controls pass, which is how you know the harness discriminates.
| example | verdict | what it shows |
|---|---|---|
random-walk-null |
NO_EDGE |
pure noise — the harness does not invent edge |
perceptrade-frz |
NO_EDGE + gate NOT_SELECTIVE |
the author's funding-rate strategy and its LLM council |
rsi-mean-reversion |
NO_EDGE |
RSI(14) — a classic indicator is not rubber-stamped |
synthetic-leak |
SURVIVES |
planted future-return leak — alpha-axis detection power |
llm-council (oracle gate) |
SELECTIVE |
a gate that really blocks losers — gate-axis detection power |
The overfitting-detection family — Probabilistic Sharpe Ratio, Deflated Sharpe Ratio,
Probability of Backtest Overfitting (Bailey & López de Prado; pypbo, AuditZK's PBO
calculator) — corrects selection bias across many strategy trials / parameter
configurations, operating on returns or Sharpe ratios.
Touchstone is complementary, not competing. It works one level upstream, on a single signal's raw series, and corrects a different set of failure modes: intra-signal firing autocorrelation (via episode-collapse), multiple horizons/sides within one signal, market beta leakage, and — uniquely — gate-layer selectivity. It does not trust a reported Sharpe; it recomputes forward returns from price. Use DSR/PBO to judge a search over many strategies; use touchstone to judge one signal and its veto layer.
- native asymmetric / bounded signals — currently handled via the
--standardizeworkaround; native support would skip standardization and allow a custom one-sidedsideRule - continuous position-sizing signals (weighted episodes)
- optional DSR-style correction as an additional cell-level adjustment
- multi-regime conditioning (the current subject window is a single risk-on regime)
MIT
{ "headline": "signal: NO_EDGE | gate: NOT_SELECTIVE", "signal_alpha": { "verdict": "NO_EDGE", // no cell survived multiple-comparison correction "n_episodes": 35, // autocorrelated firings collapsed to 35 independent episodes "multiple_comparison": { "method": "BH", "M": 6, // 6 cells tested (3 horizons x 2 sides), all >= min_episodes "min_episodes": 5, // cells with fewer independent episodes are excluded "bh_rejected": [], // survivors after Benjamini-Hochberg "min_p": 0.0897, // best raw p-value; > BH threshold 0.00833 -> dies "underpowered_excluded": [] // [{ label, nEpisodes }] for any cell dropped by the guard }, "power": [{ "n_episodes": 13, "min_detectable_d": 0.97 }], // NO_EDGE is bounded by this "note": "No detectable edge != no edge exists." }, "gate_selectivity": { "verdict": "NOT_SELECTIVE", "firing_pool": { "gate_mean": -0.3074, "random_mean": -0.2493, "percentile": 14.8 }, "note": "Gate selectivity is orthogonal to PnL." }, "disclaimer": "Both axes are orthogonal to realized PnL." }