Why most backtests lie, and how to tell if yours is one of them
Learning Objectives
After this session you will be able to:
Explain in plain language what a backtest is and what makes one trustworthy
Describe selection bias and backtest overfitting using a coin-flip intuition
Contrast a naive train/test split with walk-forward validation
Interpret the Probability of Backtest Overfitting (PBO) from CSCV
Report a Sharpe ratio honestly using PSR and DSR
Apply the honest-reporting template to the Coursework 2 momentum scaffold
Where We Are
Week
Question
Answer
3
Can we predict returns?
Barely — ARIMA R² ≈ 1%
4
Can we predict risk?
Yes — GARCH R² ≈ 15–40%
5
Can we build portfolios?
Mean-variance, with estimation error
9
Which stocks outperform?
Factor premia with HAC-robust t-tests
10
Would any of this have worked?
Today — honest backtesting
Every week so far has taught you a model. This week is a validation week: a research discipline that decides which of your models you should actually believe.
Part 0: What Even Is a Backtest?
What Is a Backtest?
A backtest is a simulation: you run today’s trading rule on yesterday’s data to see what would have happened.
Three ingredients:
A rule — e.g. “each month, buy the 30% cheapest stocks by book-to-market, short the 30% most expensive”
A history — prices, returns, fundamentals for a window (say 1990–2020)
A performance metric — cumulative return, Sharpe ratio, drawdown
The output is a single equity curve and a handful of numbers. It looks authoritative. That is the problem.
What You’re Actually Choosing
“Rule, history, metric” hides a pile of choices. Every lever below is a researcher decision, and every decision is another dimension of your effective search.
Sample
Universe: which assets, listed when, survivorship-cleaned?
Window: start/end dates, one regime or many?
Frequency: daily, weekly, monthly?
Strategy
Signal formula with its lookbacks, thresholds, z-score windows
Rebalance schedule (daily through annual)
Position sizing: equal-weight, inverse-vol, risk parity
A “200-signal search” crossed with 5 cost assumptions and 3 rebalance schedules is a 3,000-trial search. That is the \(N\) the \(\sqrt{2 \log N}\) bound in Part I will be counting.
What “Authoritative” Looks Like
This is the pitch deck. Sharpe \(\approx\) 1.2, steady compounding, shallow drawdowns.
The catch: the returns above are i.i.d. Gaussian with a positive drift. No rule, no signal, no skill — just np.random.normal(...). The visual authority is entirely decorative.
Why We Backtest At All
Backtests do have legitimate uses. They:
Rule out obviously broken strategies (they lose money in-sample too)
Quantify turnover, drawdown, capacity for a rule that survives
Give a risk budget to a real portfolio manager: “at this size, this rule has historically drawn down 20%”
What they cannot do is prove a strategy will work in the future. A backtest is a filter, not an oracle.
Why Backtests Lie — A Small Story
You try one momentum rule (12-month lookback, 1-month skip). Sharpe = 0.8. Plausible.
You try ten variations (different lookbacks, skips, universes). You keep the best. Sharpe = 1.4. Impressive.
You try two hundred variations. Best Sharpe = 2.3. “Publishable.”
The last number is almost certainly a mirage. It is the Sharpe of the luckiest noise in your search, not of any real edge.
Part I: Selection Bias & Overfitting
The Coin-Flip Analogy
1,024 analysts, ten months, one monthly trade each. Heads = the call paid off (the trade made money); tails = it lost. The coins are fair: no skill, all luck.
Month 1: about half the desk is already down a trade.
Month 10: one or two analysts are still 10-for-10: every monthly trade profitable, zero losing months. They get promoted, featured in pitch decks, and written up as star stock-pickers.
Month 11: they lose money. The coin doesn’t remember it was lucky.
“Perfect” here means track-record-perfect, not skilful. Selection survives, skill does not.
The Factor Zoo
Harvey, Liu & Zhu (2016) catalogue 316 factors published in top finance journals between 1967 and 2014.
Most were reported with a t-statistic of 2.0 or higher — the standard “statistically significant” bar for one hypothesis.
But if hundreds of researchers each test a new factor, finding \(t > 2\) by chance alone is almost guaranteed. The 5% false-positive rate compounds.
HLZ recommend a much sterner threshold: \(t > 3.0\) to declare a factor discovered.
Selection Bias: The Best of N
Search \(N\) strategies, keep the best Sharpe. With no true edge at all:
That Sharpe of 0.73 is on pure noise, after 200 tries over 20 years of monthly data.
For context, documented long-horizon post-cost factor Sharpes:
Factor (1963 onwards, US)
Annualised Sharpe
US equity premium
\(\approx 0.40\)
HML (value)
\(\approx 0.30\)
SMB (size)
\(\approx 0.20\)
WML (momentum, pre-2000)
\(\approx 0.50\text{ to }0.80\)
Two hundred naive tries already beat the equity premium. Correlation (\(\rho = 0.2\)) softens the bound but never closes the gap.
The question is not “is my Sharpe high?” It is: is my Sharpe high given that I searched \(N\) strategies?
Watch Selection Bias Inflate — Live
Show code
import numpy as np, sys, pathlibsys.path.append(str(pathlib.Path().resolve().parent))from scripts.utilities.overfit_metrics import generate_noise_strategies, sharpe_ratioT =240# 20 years of monthly returnsann = np.sqrt(12)print(f"{'N':>5}{'best Sharpe (annualised) on pure noise':>40}")for N in [1, 10, 50, 100, 200, 500]: X = generate_noise_strategies(T=T, N=N, rho=0.2, seed=1) best =max(sharpe_ratio(X[:, j]) for j inrange(N))print(f"{N:>5}{best * ann:>40.2f}")
N best Sharpe (annualised) on pure noise
1 -0.15
10 0.22
50 0.21
100 0.24
200 0.26
500 0.51
Every one of these numbers comes from mean-zero noise. The growth is pure selection.
Backtest Overfitting — Defined
Backtest overfitting occurs when a strategy’s in-sample performance reflects the luck of the search rather than any persistent edge.
In-sample Sharpe >> out-of-sample Sharpe (the “best” rule fades)
The rule is unstable — small changes to parameters change the ranking
We need a number that captures this, before we ever see out-of-sample data.
Part II: Walk-Forward Validation
The Naive 80/20 Split
A student’s first instinct: train on the first 80%, test on the last 20%.
What goes wrong:
You still see the whole history when you design the rule, so parameters are picked with look-ahead.
A single test window is one draw from a distribution of possible regimes. 2008 looks different from 2015.
Many results are reported against one test set, so the community implicitly overfits to that set too.
Walk-Forward in Pictures
At each step: fit or select the strategy on only the training window, skip the embargo, then evaluate on the next untouched test window. Roll forward and repeat. Stitch the test results into a single out-of-sample return series.
The small grey gap between train and test is the embargo (López de Prado 2018, Ch. 7). It removes overlapping-label leakage and breaks residual serial correlation, so train and test are genuinely out-of-sample. Report embargo length alongside train and test lengths.
What to Report From a Walk-Forward
Out-of-sample Sharpe: the Sharpe computed only on test windows. If you re-tune on the test set, it is no longer out-of-sample.
Distribution across windows: each test window produces one Sharpe, one drawdown, one hit-rate. Report the mean and the spread, and ideally plot the individual windows. Good rules are consistent; fragile rules have one hero window and three disasters.
Worst window: the largest drawdown, lowest hit-rate, or highest turnover observed across windows. This is your practical risk budget (the number you tell a portfolio manager).
Parameter stability: does each refit keep picking roughly the same rule (same lookback, same threshold), or does the “best” rule drift every time? Drift is a tell for fitting to noise rather than signal.
Protocol parameters: the exact train length, embargo length, test length, and number of splits used. Changing any of these can flip the result, so they are part of the evidence, not footnotes.
Important
Rule of thumb: If in-sample Sharpe is more than about 2× out-of-sample Sharpe, the strategy is probably overfit.
From Walk-Forward to CSCV
Walk-forward validates one rule against multiple time windows.
But in practice we validate many rules — hundreds of parameter combinations in a single research project.
We need a procedure that asks:
Of the strategies that looked best in-sample, how many stay best out-of-sample?
This is what CSCV formalises.
Part III: Measuring Overfit — PBO via CSCV
CSCV — The Picture
Combinatorially Symmetric Cross-Validation(Bailey et al. 2015) works in four steps:
Chop the return history into \(k\) equal, contiguous time-blocks. Worked example: 240 monthly returns with \(k = 10\) gives ten 24-month slices\(S_1, S_2, \ldots, S_{10}\), laid end-to-end along the timeline.
Enumerate every way to deal the 10 slices into two halves of 5. There are \(\binom{10}{5} = 252\) such splits: one half is training, the other is test.
For each of the 252 splits: pick the in-sample winner (highest Sharpe on the 5 training slices), then record its rank among the candidates on the 5 test slices.
Count what fraction of the 252 splits have the in-sample winner scoring below the median out-of-sample.
That fraction is the Probability of Backtest Overfitting (PBO). PBO \(\approx 0.5\) is a coin flip; PBO \(\ll 0.5\) means the winner survives when the calendar is reshuffled.
Reading a PBO Number
PBO
Interpretation
0.0 – 0.2
Credible evidence of persistence
0.2 – 0.5
Questionable; investigate further
0.5
Coin-flip — no signal detected
> 0.5
Systematic anti-selection; worse than chance
On pure noise, PBO centres on ~0.5 by construction. On real signal, PBO drops well below 0.5.
Noise vs. Signal — A Demo
Show code
import numpy as npimport sys, pathlibsys.path.append(str(pathlib.Path().resolve().parent))from scripts.utilities.overfit_metrics import generate_noise_strategies, cscv_pboT, N =240, 200# (a) Pure noise: no strategy has an edgeX_noise = generate_noise_strategies(T=T, N=N, rho=0.2, seed=42)pbo_noise = cscv_pbo(X_noise, n_folds=10).pbo# (b) Real signal: 10 of 200 strategies carry a genuine edge of ~0.5# annualised Sharpe (a realistic factor-strength effect for this simulation).X_edge = generate_noise_strategies(T=T, N=N, rho=0.2, seed=42)X_edge[:, :10] +=0.50/ np.sqrt(12)pbo_edge = cscv_pbo(X_edge, n_folds=10).pboprint(f"PBO on pure noise: {pbo_noise:.3f}")print(f"PBO with 10/200 edged: {pbo_edge:.3f}")
PBO on pure noise: 0.631
PBO with 10/200 edged: 0.210
On pure noise, the in-sample champion ranks below the OOS median in roughly half the splits, so PBO hovers around \(0.5\). Once a real edge is present, the champion tends to stay near the top OOS, so the fraction-below-median drops well under \(0.5\). That drop is the CSCV diagnostic signal you were looking for.
PBO Cannot Tell You Whether A Strategy Is “Good”
PBO only asks: does in-sample rank predict out-of-sample rank?
A strategy can have low PBO and a terrible Sharpe — it’s a consistent loser. PBO says “the loss is real”, not “the profit is real”.
We need a second lens on magnitude, adjusted for how many strategies we trialled.
That lens is PSR and DSR.
Part IV: Honest Sharpe — PSR & DSR
The Sharpe Ratio — Recap
\[\text{SR} = \frac{\bar{r}}{s} \qquad \text{(excess return over volatility)}\]
where \(N\) is the number of trials, \(\sigma_{SR}\) is the standard deviation of SRs across trials, and \(\gamma\) is the Euler-Mascheroni constant.
You don’t need to memorise the formula. The intuition is all that matters:
The more rules you tried, the higher the bar the winner has to clear.
PSR vs DSR — The Gap Is The Point
For a noise-only simulation (\(T=240\), \(N=200\), \(\rho=0.2\)):
Metric
Typical value on noise
Best in-sample Sharpe (annualised)
~ 0.5
Selection-adjusted \(SR^*\) (annualised)
~ 0.5
PSR against \(SR^* = 0\)
> 0.95 (looks great)
DSR against selection-adjusted \(SR^*\)
~ 0.5 (honest)
Gap (PSR − DSR)
Cost of selection
The gap between PSR and DSR is the number we most want you to internalise. It is the price you pay for searching the haystack.
PSR vs DSR — Live, Noise vs Edge
Show code
import numpy as np, sys, pathlibsys.path.append(str(pathlib.Path().resolve().parent))from scripts.utilities.overfit_metrics import ( generate_noise_strategies, sharpe_ratio, probabilistic_sharpe_ratio, deflated_sharpe_ratio,)T, N, rho =240, 200, 0.2ann = np.sqrt(12)header =f"{'Scenario':<18}{'SR':>6}{'SR*':>6}{'PSR':>6}{'DSR':>6}"print(header)print("-"*len(header))for label, add_edge in [("Pure noise", 0.0), ("10/200 with edge", 0.05/ np.sqrt(12))]: X = generate_noise_strategies(T=T, N=N, rho=rho, seed=42)if add_edge: X[:, :10] += add_edge srs = np.array([sharpe_ratio(X[:, j]) for j inrange(N)]) j =int(np.argmax(srs)) psr = probabilistic_sharpe_ratio(srs[j], sr_benchmark=0.0, n_obs=T) dsr, sr_star = deflated_sharpe_ratio(srs[j], srs, n_obs=T, rho=rho)print(f"{label:<18}{srs[j]*ann:>6.2f}{sr_star*ann:>6.2f}{psr:>6.3f}{dsr:>6.3f}")
Scenario SR SR* PSR DSR
----------------------------------------------
Pure noise 0.50 0.50 0.987 0.493
10/200 with edge 0.50 0.50 0.987 0.494
On pure noise, PSR \(\approx\) 1 but DSR \(\approx\) 0.5 — the benchmark has swallowed the observed Sharpe. Inject a real edge and DSR climbs to meet PSR; the gap collapses.
Part V: Honest Reporting
The Research Playbook
When you claim a backtest result, always declare:
Universe and sample: stocks, dates, rebalancing frequency
Trials: how many parameter combinations did you search? (\(N\))
Bailey, David H., Jonathan M. Borwein, Marcos López de Prado, and Qiji Jim Zhu. 2015. “The Probability of Backtest Overfitting.”Journal of Computational Finance. https://doi.org/10.2139/ssrn.2326253.
Bailey, David H., and Marcos López de Prado. 2014. “The Deflated Sharpe Ratio: Correcting for Selection Bias, Backtest Overfitting and Non-Normality.”Journal of Portfolio Management 40 (5): 94–107. https://doi.org/10.2139/ssrn.2460551.
Harvey, Campbell R., Yan Liu, and Heqing Zhu. 2016“... And the Cross-Section of Expected Returns.”Review of Financial Studies 29 (1): 5–68. https://doi.org/10.1093/rfs/hhv059.
———. 2020. “False (and Missed) Discoveries in Financial Economics.”Journal of Finance 75 (5): 2503–53. https://doi.org/10.1111/jofi.12960.