Methodology
Every control we run, in the open
Validraft maps to an institutional validation framework of 159 controls spanning the full research lifecycle — from economic rationale to governance. On every engagement we run the subset that fits your hypothesis, with no black-box score. This is what “institutional rigor” actually means, control by control.
- 0+
- Controls applied
- 0
- Control areas
- 0
- Framework controls referenced
Not a checklist. A scoped validation profile.
A single un-optimized rule and a parameter-swept factor model need different scrutiny. So each control is required, conditional, diagnostic, or not applicable depending on the strategy family — multiple-testing corrections matter for a search, not for one fixed rule.
We pick the validation profile that matches your idea, then the report states exactly which controls ran, which passed, and which were not applicable — and why. Every result is descriptive, never prescriptive: research and simulation only, not investment advice.
Phase 0
Economic rationale & pre-registration
An edge needs a reason before it needs a backtest.
5 controls
Formalized economic hypothesis
The idea is written down as a testable hypothesis — what the edge is, why it should exist, and what would falsify it — before any compute runs.
Edge classification
The proposed edge is tagged by type (risk premium, behavioral, structural, information) so the validation profile matches the claim.
Ex-ante expected value
A first-principles estimate of the edge net of expected costs, so a result can be compared against what was predicted upfront.
Publication bias & factor decay review
For published or well-known anomalies, we assess decay since publication and crowding before trusting historical strength.
Causal robustness checklist
A structured five-question check that the relationship is plausibly causal, not a coincidence mined from the data.
Phase 1
Data integrity & point-in-time
Most fake edges are really data bugs.
7 controls
Survivorship-bias correction
Equity universes are reconstructed point-in-time, including delisted names, so the test is not run only on today's survivors.
Point-in-time validation
Every input carries an as-of timestamp; data is joined as it was known at the time, with an explicit event-leakage gate.
Corporate-actions adjustment
Splits, dividends, and IPOs are materialized and applied so prices and returns are continuous and correct across events.
Data-quality audit
Inputs are screened for NaNs, duplicates, gaps, and outliers before anything runs — a hard gate, not a best-effort check.
Timezone & holiday alignment
Per-market calendars align sessions and align cross-asset series, so signals never read across mismatched clocks.
Pipeline isolation
Ingestion, backtest, and validation are separate stages with a snapshot manifest — no validation step can leak back into the data.
Investable-universe construction
The tradable universe is built point-in-time from a universe master and symbol resolver, not hand-picked after the fact.
Phase 2 & 3
Signal & feature integrity
Where look-ahead and overfitting hide.
7 controls
Look-ahead bias detection
Features are audited for any use of information not available at decision time, including macro/treasury series joined on their as-of date.
Feature leakage detection
Inputs that smell like the answer (forward, target, label, outcome, pnl) are flagged before they can inflate a result.
Overfitting diagnosis
Walk-forward, Deflated Sharpe, PBO, haircut Sharpe, and parameter sensitivity combine to expose curve-fitting rather than edge.
Stationarity tests (ADF / KPSS)
Signal features are tested for stationarity, with cointegration checks (Engle–Granger + half-life) where a spread thesis applies.
Information Coefficient (IC / ICIR)
For continuous signals, predictive power is measured with Pearson/Spearman IC, rolling ICIR, and decay — robust p-values included.
Feature stability & redundancy
Rolling IC stability plus opt-in PCA diagnostics (explained variance, condition number, loading stability) catch unstable or redundant predictors.
Parameter sensitivity analysis
Parameters are perturbed one at a time and re-tested, so a result that only works on one knife-edge setting is caught.
Phase 3.5
Execution & cost realism
A backtest with no costs is a fantasy.
6 controls
Bid-ask spread simulation
Configurable spread is subtracted from PnL; tick-level engines use declared slippage so fills are not assumed at mid.
Commission & financing costs
Per-share, per-contract, and basis-point fees are wired in; carry/financing is applied where overnight, leverage, or short exposure require it.
Market-impact modeling
A square-root / coefficient impact proxy is applied where relevant, with an execution-realism gate that flags when a proxy isn't strong enough for the claim.
Turnover analysis
Annualized turnover is reported so cost sensitivity and capacity can be judged against the strategy's trading intensity.
Capacity analysis
A static capacity gate estimates how much size the strategy could absorb given volume and turnover before impact erodes the edge.
Short-borrow cost
Borrow cost is charged on overnight/multi-day shorts; locate and hard-to-borrow constraints are surfaced where they affect tradability.
Phase 4
Statistical validation
The moat: is the edge real, or luck?
12 controls
Out-of-sample testing
A frozen candidate is tested on data never used in selection, with explicit train/test windows.
Walk-forward analysis
Rolling re-optimization and evaluation over time, so performance is judged out-of-sample at every step, not on one lucky split.
Combinatorial Purged CV (CPCV)
Combinatorial folds with purging and embargo give a distribution of out-of-sample outcomes instead of a single path.
Deflated Sharpe Ratio (DSR)
Sharpe is corrected for skew, kurtosis, and the number of trials tested — the headline number you can actually trust.
Probability of Backtest Overfitting (PBO)
CSCV estimates the probability that the selected configuration is best in-sample but not out-of-sample.
Haircut Sharpe (Harvey–Liu)
Sharpe is discounted for multiple testing across everything that was tried, so a single survivor isn't mistaken for skill.
Permutation / null-edge tests
Returns are permuted to break the signal–outcome link; a real edge must beat the block-bootstrap and signal-target null distributions.
Walk-forward permutation test
The walk-forward Sharpe is compared against a block-permuted null distribution, with the full null saved as an artifact.
Multiple-testing correction
Bonferroni / Holm / Benjamini–Hochberg, plus Hansen SPA and Romano–Wolf step-down for post-selection inference on families of candidates.
Autocorrelation (Ljung–Box)
Return autocorrelation is tested so a Sharpe inflated by serial correlation is identified and interpreted correctly.
Purging & embargo
Overlapping labels and look-ahead around events are removed with purge and embargo windows, enforced down to the day slice.
Minimum backtest length (MinBTL)
The observed window is checked against the minimum length needed to support the Sharpe claim, with a warning when it falls short.
Phase 4.5
Stress testing & regime analysis
How it behaves when markets break.
7 controls
Historical crisis scenarios
Survival is measured through built-in crisis windows (2008, 2020, 2022, and more), with worst-case drawdown and return per scenario.
Regime-conditional analysis
Performance is split across volatility, trend, liquidity, and extreme-event regimes, with an empirical transition matrix.
Regime contract (ex-ante vs ex-post)
If a regime drives a filter, sizing, or selection, it must be defined ahead of time — circular, in-sample regime definitions are rejected.
Hidden Markov & change-point
Opt-in HMM regime calibration and BOCPD change-point detection surface structural breaks in behavior.
Monte Carlo path risk
Trade-sequence permutation, stationary bootstrap, and regime-switching resampling produce drawdown, terminal-return, and ruin distributions.
Perturbation & synthetic scenarios
Realized returns are perturbed and stressed against synthetic macro scenarios to probe fragility beyond the realized path.
Liquidity stress
A capacity/cost replay under widened spreads and volume caps tests whether the edge survives a thinner market.
Phase 5
Performance metrics
Beyond a single headline Sharpe.
5 controls
Risk-adjusted returns
Sharpe, Sortino, and Calmar on net returns — not gross, not pre-cost.
Drawdown family
Max drawdown with duration and time-to-recovery, plus Ulcer, Martin/MAR, Burke, and Pain index for the full pain profile.
Tail risk (CVaR / CDaR)
Conditional value-at-risk and conditional drawdown-at-risk quantify the tail, not just the average.
Distribution shape
Omega, gain-to-pain, profit factor, hit rate, and bias-corrected skew/kurtosis describe the actual return distribution.
Equity-curve stability
Equity-curve R² measures how steadily the curve compounds versus how much it relies on a few outlier periods.
Phase 5.5
Benchmarking & attribution
Is it alpha, or just beta you could buy cheaply?
6 controls
Buy-and-hold benchmark
Results are compared to a relevant buy-and-hold benchmark on total return — required, not optional, so excess return is explicit.
Alpha / beta decomposition
OLS alpha, beta, R², and overlap window separate genuine excess return from market exposure.
Multi-factor attribution (FF5 + UMD)
Returns are regressed on Fama–French five factors plus momentum, against a versioned factor snapshot, to see what really drives them.
Information ratio
Active return over tracking error quantifies consistency of outperformance versus the benchmark.
Tail-risk correlation
Correlation to the benchmark during its worst periods is checked, so a strategy isn't quietly long the same crash.
Market-neutral residual test
For market-neutral and stat-arb claims, residual beta and R² are bounded to confirm the neutrality the strategy advertises.
Phase 6
Risk, sizing & portfolio
Surviving long enough for the edge to pay.
5 controls
Ruin & drawdown analysis
Monte Carlo ruin probability by leverage and expected/percentile future max-drawdown frame the real risk of capital impairment.
Position sizing (Kelly / vol target)
Fractional-Kelly budgeting and feature-based volatility targeting size positions with explicit leverage caps.
Risk-parity & constrained allocation
Inverse-volatility / risk-parity weighting with min/max-weight constraints and a bounded simplex projection for multi-leg books.
Correlation & crowding (internal)
Candidate correlation and marginal-Sharpe contribution are checked against the existing book to avoid redundant exposure.
Drawdown limits & kill-switch
Max-drawdown and daily-loss limits are modeled so risk controls are part of the test, not an afterthought.
Phase 8–10
Governance, lineage & reproducibility
A result you can audit is a result you can trust.
6 controls
Pre-registration & sign-off
The hypothesis is hashed and pre-registered before the run; a formal phase-gate sign-off records the GO / NO-GO decision.
Versioned data & code
Every run records the git SHA, data hash, and engine version, so a result is tied to the exact inputs that produced it.
End-to-end data lineage
A manifest plus feature-level lineage traces every number in the report back to its raw source.
Audit trail of decisions
Scoping and validation decisions are captured in the run manifest — the reasoning is part of the deliverable, not lost in chat.
Model card
An auto-generated model card documents the strategy, its assumptions, and its limitations alongside the results.
Reproducibility
Immutable raw data, an immutable feature store, and a determinism check support independent re-running of the result.
Where we draw the line
What a validation is not
Rigor includes being honest about scope. An engagement is an offline validation of your hypothesis — these things are deliberately outside it.
- Live order execution or order routing — research and simulation only.
- Managed paper-trading or live-vs-backtest parity monitoring on your behalf.
- Capital allocation, ramp-up, or any management of real money.
- Investment advice — every report is descriptive, never prescriptive.
Put your idea through all of it
Describe your hypothesis in a brief. We scope the right validation profile and return the full report — the same controls, run by humans.