Week 10: Backtesting & Validation

Why most backtests lie, and how to tell if yours is one of them

Learning Objectives

After this session you will be able to:

Explain in plain language what a backtest is and what makes one trustworthy
Describe selection bias and backtest overfitting using a coin-flip intuition
Contrast a naive train/test split with walk-forward validation
Interpret the Probability of Backtest Overfitting (PBO) from CSCV
Report a Sharpe ratio honestly using PSR and DSR
Apply the honest-reporting template to the Coursework 2 momentum scaffold

Where We Are

Week	Question	Answer
3	Can we predict returns?	Barely — ARIMA R² ≈ 1%
4	Can we predict risk?	Yes — GARCH R² ≈ 15–40%
5	Can we build portfolios?	Mean-variance, with estimation error
9	Which stocks outperform?	Factor premia with HAC-robust t-tests
10	Would any of this have worked?	Today — honest backtesting

Every week so far has taught you a model. This week is a validation week: a research discipline that decides which of your models you should actually believe.

Part 0: What Even Is a Backtest?

What Is a Backtest?

A backtest is a simulation: you run today’s trading rule on yesterday’s data to see what would have happened.

Three ingredients:

A rule — e.g. “each month, buy the 30% cheapest stocks by book-to-market, short the 30% most expensive”
A history — prices, returns, fundamentals for a window (say 1990–2020)
A performance metric — cumulative return, Sharpe ratio, drawdown

The output is a single equity curve and a handful of numbers. It looks authoritative. That is the problem.

What You’re Actually Choosing

“Rule, history, metric” hides a pile of choices. Every lever below is a researcher decision, and every decision is another dimension of your effective search.

Sample

Universe: which assets, listed when, survivorship-cleaned?
Window: start/end dates, one regime or many?
Frequency: daily, weekly, monthly?

Strategy

Signal formula with its lookbacks, thresholds, z-score windows
Rebalance schedule (daily through annual)
Position sizing: equal-weight, inverse-vol, risk parity

Economics

Costs: commissions, slippage, bid-ask spread, market impact
Risk-free rate, benchmark, annualisation factor
Leverage and short-selling constraints

Validation and search

Train/test split, walk-forward window length, embargo
Parameter combinations tried (\(N\))
Stopping rule: when do you stop searching?

A “200-signal search” crossed with 5 cost assumptions and 3 rebalance schedules is a 3,000-trial search. That is the \(N\) the \(\sqrt{2 \log N}\) bound in Part I will be counting.

What “Authoritative” Looks Like

This is the pitch deck. Sharpe \(\approx\) 1.2, steady compounding, shallow drawdowns.

The catch: the returns above are i.i.d. Gaussian with a positive drift. No rule, no signal, no skill — just np.random.normal(...). The visual authority is entirely decorative.

Why We Backtest At All

Backtests do have legitimate uses. They:

Rule out obviously broken strategies (they lose money in-sample too)
Quantify turnover, drawdown, capacity for a rule that survives
Give a risk budget to a real portfolio manager: “at this size, this rule has historically drawn down 20%”

What they cannot do is prove a strategy will work in the future. A backtest is a filter, not an oracle.

Why Backtests Lie — A Small Story

You try one momentum rule (12-month lookback, 1-month skip). Sharpe = 0.8. Plausible.

You try ten variations (different lookbacks, skips, universes). You keep the best. Sharpe = 1.4. Impressive.

You try two hundred variations. Best Sharpe = 2.3. “Publishable.”

The last number is almost certainly a mirage. It is the Sharpe of the luckiest noise in your search, not of any real edge.

Part I: Selection Bias & Overfitting

The Coin-Flip Analogy

1,024 analysts, ten months, one monthly trade each. Heads = the call paid off (the trade made money); tails = it lost. The coins are fair: no skill, all luck.

Month 1: about half the desk is already down a trade.

Month 10: one or two analysts are still 10-for-10: every monthly trade profitable, zero losing months. They get promoted, featured in pitch decks, and written up as star stock-pickers.

Month 11: they lose money. The coin doesn’t remember it was lucky.

“Perfect” here means track-record-perfect, not skilful. Selection survives, skill does not.

The Factor Zoo

Harvey, Liu & Zhu (2016) catalogue 316 factors published in top finance journals between 1967 and 2014.

Most were reported with a t-statistic of 2.0 or higher — the standard “statistically significant” bar for one hypothesis.

But if hundreds of researchers each test a new factor, finding \(t > 2\) by chance alone is almost guaranteed. The 5% false-positive rate compounds.

HLZ recommend a much sterner threshold: \(t > 3.0\) to declare a factor discovered.

Selection Bias: The Best of N

Search \(N\) strategies, keep the best Sharpe. With no true edge at all:

\[ \mathbb{E}\!\left[\max_{i \le N} \widehat{SR}_i\right] \;\approx\; \sqrt{\tfrac{2 \log N}{T}} \qquad\text{(per period, i.i.d. noise)} \]

Plug in \(T = 240\) months, \(N = 200\):

\[ \sqrt{2 \log 200}\;\approx\;3.26,\qquad \mathbb{E}[\max \widehat{SR}_{\text{mo}}]\;\approx\;\tfrac{3.26}{\sqrt{240}}\;\approx\;0.21,\qquad \text{annualised}\;\approx\;0.21\sqrt{12}\;\approx\;\mathbf{0.73}. \]

That Sharpe of 0.73 is on pure noise, after 200 tries over 20 years of monthly data.

For context, documented long-horizon post-cost factor Sharpes:

Factor (1963 onwards, US)	Annualised Sharpe
US equity premium	\(\approx 0.40\)
HML (value)	\(\approx 0.30\)
SMB (size)	\(\approx 0.20\)
WML (momentum, pre-2000)	\(\approx 0.50\text{ to }0.80\)

Two hundred naive tries already beat the equity premium. Correlation (\(\rho = 0.2\)) softens the bound but never closes the gap.

The question is not “is my Sharpe high?” It is: is my Sharpe high given that I searched \(N\) strategies?

Watch Selection Bias Inflate — Live

Show code

import numpy as np, sys, pathlib
sys.path.append(str(pathlib.Path().resolve().parent))
from scripts.utilities.overfit_metrics import generate_noise_strategies, sharpe_ratio

T = 240  # 20 years of monthly returns
ann = np.sqrt(12)
print(f"{'N':>5}  {'best Sharpe (annualised) on pure noise':>40}")
for N in [1, 10, 50, 100, 200, 500]:
    X = generate_noise_strategies(T=T, N=N, rho=0.2, seed=1)
    best = max(sharpe_ratio(X[:, j]) for j in range(N))
    print(f"{N:>5}  {best * ann:>40.2f}")

    N    best Sharpe (annualised) on pure noise
    1                                     -0.15
   10                                      0.22
   50                                      0.21
  100                                      0.24
  200                                      0.26
  500                                      0.51

Every one of these numbers comes from mean-zero noise. The growth is pure selection.

Backtest Overfitting — Defined

Backtest overfitting occurs when a strategy’s in-sample performance reflects the luck of the search rather than any persistent edge.

— Bailey et al. (2015)

Two tell-tale symptoms:

In-sample Sharpe >> out-of-sample Sharpe (the “best” rule fades)
The rule is unstable — small changes to parameters change the ranking

We need a number that captures this, before we ever see out-of-sample data.

Part II: Walk-Forward Validation

The Naive 80/20 Split

A student’s first instinct: train on the first 80%, test on the last 20%.

What goes wrong:

You still see the whole history when you design the rule, so parameters are picked with look-ahead.
A single test window is one draw from a distribution of possible regimes. 2008 looks different from 2015.
Many results are reported against one test set, so the community implicitly overfits to that set too.

Walk-Forward in Pictures

At each step: fit or select the strategy on only the training window, skip the embargo, then evaluate on the next untouched test window. Roll forward and repeat. Stitch the test results into a single out-of-sample return series.

The small grey gap between train and test is the embargo (López de Prado 2018, Ch. 7). It removes overlapping-label leakage and breaks residual serial correlation, so train and test are genuinely out-of-sample. Report embargo length alongside train and test lengths.

What to Report From a Walk-Forward

Out-of-sample Sharpe: the Sharpe computed only on test windows. If you re-tune on the test set, it is no longer out-of-sample.
Distribution across windows: each test window produces one Sharpe, one drawdown, one hit-rate. Report the mean and the spread, and ideally plot the individual windows. Good rules are consistent; fragile rules have one hero window and three disasters.
Worst window: the largest drawdown, lowest hit-rate, or highest turnover observed across windows. This is your practical risk budget (the number you tell a portfolio manager).
Parameter stability: does each refit keep picking roughly the same rule (same lookback, same threshold), or does the “best” rule drift every time? Drift is a tell for fitting to noise rather than signal.
Protocol parameters: the exact train length, embargo length, test length, and number of splits used. Changing any of these can flip the result, so they are part of the evidence, not footnotes.

Important

Rule of thumb: If in-sample Sharpe is more than about 2× out-of-sample Sharpe, the strategy is probably overfit.

From Walk-Forward to CSCV

Walk-forward validates one rule against multiple time windows.

But in practice we validate many rules — hundreds of parameter combinations in a single research project.

We need a procedure that asks:

Of the strategies that looked best in-sample, how many stay best out-of-sample?

This is what CSCV formalises.

Part III: Measuring Overfit — PBO via CSCV

CSCV — The Picture

Combinatorially Symmetric Cross-Validation (Bailey et al. 2015) works in four steps:

Chop the return history into \(k\) equal, contiguous time-blocks. Worked example: 240 monthly returns with \(k = 10\) gives ten 24-month slices \(S_1, S_2, \ldots, S_{10}\), laid end-to-end along the timeline.
Enumerate every way to deal the 10 slices into two halves of 5. There are \(\binom{10}{5} = 252\) such splits: one half is training, the other is test.
For each of the 252 splits: pick the in-sample winner (highest Sharpe on the 5 training slices), then record its rank among the candidates on the 5 test slices.
Count what fraction of the 252 splits have the in-sample winner scoring below the median out-of-sample.

That fraction is the Probability of Backtest Overfitting (PBO). PBO \(\approx 0.5\) is a coin flip; PBO \(\ll 0.5\) means the winner survives when the calendar is reshuffled.

Reading a PBO Number

PBO	Interpretation
0.0 – 0.2	Credible evidence of persistence
0.2 – 0.5	Questionable; investigate further
0.5	Coin-flip — no signal detected
> 0.5	Systematic anti-selection; worse than chance

On pure noise, PBO centres on ~0.5 by construction. On real signal, PBO drops well below 0.5.

Noise vs. Signal — A Demo

Show code

import numpy as np
import sys, pathlib
sys.path.append(str(pathlib.Path().resolve().parent))
from scripts.utilities.overfit_metrics import generate_noise_strategies, cscv_pbo

T, N = 240, 200

# (a) Pure noise: no strategy has an edge
X_noise = generate_noise_strategies(T=T, N=N, rho=0.2, seed=42)
pbo_noise = cscv_pbo(X_noise, n_folds=10).pbo

# (b) Real signal: 10 of 200 strategies carry a genuine edge of ~0.5
# annualised Sharpe (a realistic factor-strength effect for this simulation).
X_edge = generate_noise_strategies(T=T, N=N, rho=0.2, seed=42)
X_edge[:, :10] += 0.50 / np.sqrt(12)
pbo_edge = cscv_pbo(X_edge, n_folds=10).pbo

print(f"PBO on pure noise:       {pbo_noise:.3f}")
print(f"PBO with 10/200 edged:   {pbo_edge:.3f}")

PBO on pure noise:       0.631
PBO with 10/200 edged:   0.210

On pure noise, the in-sample champion ranks below the OOS median in roughly half the splits, so PBO hovers around \(0.5\). Once a real edge is present, the champion tends to stay near the top OOS, so the fraction-below-median drops well under \(0.5\). That drop is the CSCV diagnostic signal you were looking for.

PBO Cannot Tell You Whether A Strategy Is “Good”

PBO only asks: does in-sample rank predict out-of-sample rank?

A strategy can have low PBO and a terrible Sharpe — it’s a consistent loser. PBO says “the loss is real”, not “the profit is real”.

We need a second lens on magnitude, adjusted for how many strategies we trialled.

That lens is PSR and DSR.

Part IV: Honest Sharpe — PSR & DSR

The Sharpe Ratio — Recap

\[\text{SR} = \frac{\bar{r}}{s} \qquad \text{(excess return over volatility)}\]

Known pitfalls:

Annualising requires assuming i.i.d. returns (usually false)
A high SR with low data (\(n < 60\) months) is barely informative
Selection bias inflates the best SR in a search
Skew and fat tails make the simple formula optimistic

The Probabilistic Sharpe Ratio

PSR asks: what is the probability that the true Sharpe exceeds some benchmark \(SR^*\), given what we observed?

Accounts for sample size (more data → sharper inference)
Accounts for skewness and kurtosis (fat tails penalise PSR)
Output is in \([0, 1]\) — a probability, not a point estimate

Tip

Reporting tip: A strategy with PSR > 0.95 against \(SR^* = 0\) is only saying “the true Sharpe is probably positive.” That is a very weak claim.

The Deflated Sharpe Ratio

DSR raises \(SR^*\) to account for how many strategies you trialled and how correlated they were (Bailey and Prado 2014).

\[SR^{*} = \sigma_{SR} \cdot \Bigl[ (1 - \gamma)\, \Phi^{-1}\!\bigl(1 - \tfrac{1}{N}\bigr) + \gamma\, \Phi^{-1}\!\bigl(1 - \tfrac{1}{Ne}\bigr) \Bigr]\]

where \(N\) is the number of trials, \(\sigma_{SR}\) is the standard deviation of SRs across trials, and \(\gamma\) is the Euler-Mascheroni constant.

You don’t need to memorise the formula. The intuition is all that matters:

The more rules you tried, the higher the bar the winner has to clear.

PSR vs DSR — The Gap Is The Point

For a noise-only simulation (\(T=240\), \(N=200\), \(\rho=0.2\)):

Metric	Typical value on noise
Best in-sample Sharpe (annualised)	~ 0.5
Selection-adjusted \(SR^*\) (annualised)	~ 0.5
PSR against \(SR^* = 0\)	> 0.95 (looks great)
DSR against selection-adjusted \(SR^*\)	~ 0.5 (honest)
Gap (PSR − DSR)	Cost of selection

The gap between PSR and DSR is the number we most want you to internalise. It is the price you pay for searching the haystack.

PSR vs DSR — Live, Noise vs Edge

Show code

import numpy as np, sys, pathlib
sys.path.append(str(pathlib.Path().resolve().parent))
from scripts.utilities.overfit_metrics import (
    generate_noise_strategies, sharpe_ratio,
    probabilistic_sharpe_ratio, deflated_sharpe_ratio,
)

T, N, rho = 240, 200, 0.2
ann = np.sqrt(12)
header = f"{'Scenario':<18} {'SR':>6} {'SR*':>6} {'PSR':>6} {'DSR':>6}"
print(header)
print("-" * len(header))
for label, add_edge in [("Pure noise", 0.0), ("10/200 with edge", 0.05 / np.sqrt(12))]:
    X = generate_noise_strategies(T=T, N=N, rho=rho, seed=42)
    if add_edge:
        X[:, :10] += add_edge
    srs = np.array([sharpe_ratio(X[:, j]) for j in range(N)])
    j = int(np.argmax(srs))
    psr = probabilistic_sharpe_ratio(srs[j], sr_benchmark=0.0, n_obs=T)
    dsr, sr_star = deflated_sharpe_ratio(srs[j], srs, n_obs=T, rho=rho)
    print(f"{label:<18} {srs[j]*ann:>6.2f} {sr_star*ann:>6.2f} {psr:>6.3f} {dsr:>6.3f}")

Scenario               SR    SR*    PSR    DSR
----------------------------------------------
Pure noise           0.50   0.50  0.987  0.493
10/200 with edge     0.50   0.50  0.987  0.494

On pure noise, PSR \(\approx\) 1 but DSR \(\approx\) 0.5 — the benchmark has swallowed the observed Sharpe. Inject a real edge and DSR climbs to meet PSR; the gap collapses.

Part V: Honest Reporting

The Research Playbook

When you claim a backtest result, always declare:

Universe and sample: stocks, dates, rebalancing frequency
Trials: how many parameter combinations did you search? (\(N\))
Correlation assumption: \(\rho\) among the trials
Walk-forward diagnostics: out-of-sample Sharpe, worst window, turnover
PBO from CSCV with \(k=10\)
PSR against \(SR^* = 0\) and DSR against selection-adjusted \(SR^*\)
Decision: promote, park, or discard — with reasoning

Promote, Park, or Discard

Outcome	PBO	DSR	Action
Green	< 0.2	> 0.95	Promote to paper-trade
Amber	0.2–0.4	0.7–0.95	Park — collect more data
Red	> 0.4	< 0.7	Discard

These thresholds are conventions, not laws. What matters is that you fixed them before looking at the result.

Lab Preview

In labs/lab10_backtesting.qmd you will:

Generate \(N=200\) noise-only strategies with correlation \(\rho = 0.2\)
Compute the best in-sample Sharpe — watch it balloon
Run CSCV and report PBO
Compute PSR against \(SR^* = 0\) and DSR against the selection-adjusted benchmark
Add 10 strategies with a small true edge; re-run; watch PBO drop and DSR rise

Assessment Connection — CW2

Coursework 2 (both the fraud scaffold and the momentum scaffold) asks you to pick a model configuration from a grid of candidates.

Marking will reward:

Explicit declaration of \(N\) (the trial count)
A walk-forward evaluation, not a single train/test split
PBO + DSR reported alongside the headline Sharpe
A clear promote / park / discard decision with pre-declared thresholds

If you can defend your chosen configuration against these four questions, you have done the job.

Summary

Key Takeaways

A backtest is a filter, not an oracle — it rules out bad ideas, not in good ones.
Selection bias inflates the best-of-N Sharpe even when no strategy has any edge.
Use walk-forward instead of a single train/test split.
CSCV → PBO quantifies how often your in-sample winner fades out-of-sample. Aim for PBO < 0.2.
PSR attaches a probability to a Sharpe; DSR raises the benchmark to account for how many strategies you trialled. Always quote the gap.
Report \(N\), \(\rho\), PBO, PSR, DSR, and your decision rule. Pre-register the thresholds.

📖 Lab: Lab 10 — Backtesting & Validation

References

Bailey, David H., Jonathan M. Borwein, Marcos López de Prado, and Qiji Jim Zhu. 2015. “The Probability of Backtest Overfitting.” Journal of Computational Finance. https://doi.org/10.2139/ssrn.2326253.

Bailey, David H., and Marcos López de Prado. 2014. “The Deflated Sharpe Ratio: Correcting for Selection Bias, Backtest Overfitting and Non-Normality.” Journal of Portfolio Management 40 (5): 94–107. https://doi.org/10.2139/ssrn.2460551.

Hansen, Peter R. 2005. “A Test for Superior Predictive Ability.” Journal of Business & Economic Statistics 23 (4): 365–80. https://doi.org/10.1198/073500105000000063.

Harvey, Campbell R., Yan Liu, and Heqing Zhu. 2016“... And the Cross-Section of Expected Returns.” Review of Financial Studies 29 (1): 5–68. https://doi.org/10.1093/rfs/hhv059.

———. 2020. “False (and Missed) Discoveries in Financial Economics.” Journal of Finance 75 (5): 2503–53. https://doi.org/10.1111/jofi.12960.

White, Halbert. 2000. “A Reality Check for Data Snooping.” Econometrica 68 (5): 1097–1126. https://doi.org/10.1111/1468-0262.00152.

Week 10: Backtesting & Validation

Learning Objectives

Where We Are

Part 0: What Even Is a Backtest?

What Is a Backtest?

What You’re Actually Choosing

What “Authoritative” Looks Like

Why We Backtest At All

Why Backtests Lie — A Small Story

Part I: Selection Bias & Overfitting

The Coin-Flip Analogy

The Factor Zoo

Selection Bias: The Best of N

Watch Selection Bias Inflate — Live

Backtest Overfitting — Defined

Part II: Walk-Forward Validation

The Naive 80/20 Split

Walk-Forward in Pictures

What to Report From a Walk-Forward

From Walk-Forward to CSCV

Part III: Measuring Overfit — PBO via CSCV

CSCV — The Picture

Reading a PBO Number

Noise vs. Signal — A Demo

PBO Cannot Tell You Whether A Strategy Is “Good”

Part IV: Honest Sharpe — PSR & DSR

The Sharpe Ratio — Recap

The Probabilistic Sharpe Ratio

The Deflated Sharpe Ratio

PSR vs DSR — The Gap Is The Point

PSR vs DSR — Live, Noise vs Edge

Part V: Honest Reporting

The Research Playbook

Promote, Park, or Discard

Lab Preview

Assessment Connection — CW2

Summary

Key Takeaways

Further Reading

References