Backtesting and Validation

Week 10: Why Most Backtests Lie, and How to Tell If Yours Is One of Them

1 Why Validation Is Its Own Discipline

A standard factor-testing workflow runs a time-series regression, reports an alpha with HAC standard errors, and asks whether the measured premium is statistically different from zero. That is the discipline of testing a single, pre-specified hypothesis. It is the right discipline for a factor that someone else has already proposed, whose definition is inherited, and whose construction is not tuned by the current researcher.

This chapter is about the other half of quantitative research: the part in which the researcher specifies the rule, chooses the lookback and rebalancing schedule, tries multiple variations, and then must decide whether the reported result is signal or mirage. Most desk-level research notes and most proposed factors in the published literature live in this second world, and it is here that the usual tools of inference begin to mislead. The Sharpe ratio computed on the best of two hundred strategies is not a Sharpe ratio in the sense of Bailey et al. (2015) or Fama and French (1993). It is a selection statistic, and it obeys different distributional rules.

Bailey et al. (2015) open their paper with a deliberately jarring result. They generate a garden of random strategies, none of which has any true edge, and they follow standard research practice: rank by in-sample Sharpe, pick the winner, rebadge it as “the strategy we trade”, and track it out-of-sample. In their synthetic example, 100% of in-sample Sharpe ratios are positive and 78% of out-of-sample Sharpe ratios are negative. The procedure is not broken because the data were unusual. It is broken because the researcher searched.

This is not an exotic edge case. Harvey, Liu, and Zhu (2016) tabulate roughly three hundred and sixteen factors proposed in top finance journals between 1967 and 2014, nearly all of them reported at the conventional two-sided five-per-cent bar (equivalently, large-sample $|t| \approx 2$), and argue that under any reasonable adjustment for the volume of testing conducted across the literature the appropriate threshold is closer to three. Harvey, Liu, and Zhu (2020) extend the analysis using a Benjamini-Yekutieli false-discovery-rate procedure and conclude that a substantial fraction of the published factor literature is consistent with null data-generating processes. The problem is endemic; the solution is a set of tools that make the act of searching explicit, cost it, and build a number that survives the accounting.

Note

Learning objectives. After completing this chapter, the reader should be able to: (1) explain what a backtest is and why a large in-sample Sharpe ratio is not, by itself, evidence of a persistent edge; (2) describe the selection-bias and multiple-testing problems in plain language, including the Harvey, Liu, and Zhu (2016) finding that the conventional $t > 2.0$ threshold is too lenient given the size of the published factor zoo; (3) compute and interpret the Probability of Backtest Overfitting (PBO) using Combinatorially Symmetric Cross-Validation (CSCV), and apply the $PBO > 0.05$ rejection threshold proposed by Bailey et al.; (4) compute the Deflated Sharpe Ratio (DSR) and interpret it as a selection-adjusted probability of genuine skill; (5) report a backtest using the López de Prado template in a reproducible and decision-ready format.

2 What a Backtest Is, and What It Is Not

A backtest is a simulation. A trading rule expressed in code is run against historical data, and the resulting return sequence is recorded. The output is a single equity curve and a small table of summary statistics: cumulative return, annualised Sharpe ratio, maximum drawdown, turnover, and perhaps a breakdown by year or regime. The whole exercise is stripped of most of the difficulties of live trading. There are no rejected orders, no information-leakage delays, no broker quirks, and no macroeconomic surprises. Every number is produced from data that already exists, and the researcher knows how every episode ends before the simulation starts.

Three ingredients enter a backtest. The rule specifies what the strategy does: for example, “at the end of each month, buy the thirty per cent of stocks with the highest book-to-market ratio, short the thirty per cent with the lowest, hold for one month, rebalance.” The history provides the data against which the rule is simulated: prices, volumes, fundamentals, and index memberships for a given universe over a specified window. The performance metric is the single number used to decide whether the result is good: usually the Sharpe ratio, sometimes the information ratio, occasionally a utility-adjusted alternative. The output of the backtest is then a story about the past, expressed in numbers that look quantitative and definitive.

It is worth dwelling on why backtests feel so rigorous. They use real prices on real dates. They produce real-looking equity curves with drawdowns dated to actual recessions. They run on the same Pandas dataframes that professional portfolio managers use. A reader meeting a backtest for the first time typically assumes that the exercise is a factual claim about the past: the strategy did earn that return, or it did not. The feeling is misleading. The numbers are not facts about the past. They are the result of choices made when specifying the rule, evaluated on a single historical realisation of an uncertain world. Change the universe, the rebalancing day, the definition of the signal, or the window, and every number moves. This is not a pathology. It is simply that the backtest has a large and unacknowledged degree of freedom, and the researcher’s job is to ensure that the number finally reported is not merely the luckiest corner of that degree of freedom.

The pedagogical point is then straightforward. Backtests are a filter, not an oracle. They are useful for ruling out strategies that cannot even survive in-sample, for quantifying the turnover and drawdown profile of a strategy that has already passed some other test, and for sizing risk budgets. They are dangerous when used to discover strategies that should be traded with real money, because the procedure of searching for the best rule systematically biases the output upward.

3 Why Backtests Lie

3.1 Two hundred strategies and one winner

Consider a concrete variant of the opening story. Suppose there are twenty years of monthly returns for a broad universe and a momentum-style rule. The lookback window is uncertain, as is the amount of recent-month skip and whether to use an equal-weighted or value-weighted portfolio. So multiple combinations are tried. Three lookbacks, three skips, two weightings, two rebalancing days, two holding horizons: thirty-six versions. The best one is reported. Its Sharpe ratio is 1.8. A momentum premium is then claimed.

The difficulty is that, under the null hypothesis that none of the thirty-six rules has any true edge, the maximum of thirty-six i.i.d. sample Sharpe ratios has an expected value that is much larger than zero. The false-strategy theorem of Bailey et al. (2015), restated in Prado (2018a) §14.7, gives the exact order of magnitude: the expected maximum Sharpe across $N$ independent trials is bounded above by $\sigma_{SR} \sqrt{2 \log N}$ and approximated to leading order by a $(1-\gamma) \Phi^{-1}(1 - 1/N) + \gamma \Phi^{-1}(1 - 1/(N e))$ expression, with $\gamma$ the Euler-Mascheroni constant. The bound is slow but relentless: at $N = 200$ trials the expected best Sharpe on pure noise is comfortably above one, and the ninety-fifth percentile is above two. Bailey and Prado (2014) make the same point at a more concrete operational level: roughly twenty iterations is enough, at the conventional five-per-cent level, to discover a “significant” strategy from a martingale-distributed data stream.

Bailey et al. (2015) illustrate the consequence by simulating two hundred mean-zero return streams, running the naive procedure, and observing that nearly every selected strategy would be advertised as a publishable result. They then track those same strategies out-of-sample. The in-sample “champions” underperform the median of their peers in 78% of splits. The procedure is not merely imprecise; it is actively anti-predictive, and for reasons we will make precise below.

3.2 Multiple testing in one line

The Bonferroni correction captures the intuition without any distributional assumptions. If a single hypothesis is tested at the five per cent level, the probability of a false positive is 0.05. If $m$ independent hypotheses are each tested at five per cent, the probability of at least one false positive is $1 - (1 - 0.05)^m$. At $m = 20$ this is already 0.64, and at $m = 100$ it is 0.994. To preserve an overall five per cent level across the family of tests, the per-test threshold must be lowered to $0.05 / m$, which for $m = 100$ is a $p$-value of $0.0005$, corresponding to a $t$-statistic of roughly 3.5. Bonferroni is conservative because real-world tests are not independent, but the direction of the adjustment is unambiguous: the more rules that are tried, the higher the bar the winner must clear.

3.3 The factor zoo: Harvey, Liu and Zhu

Harvey, Liu, and Zhu (2016) make this concrete for the published factor literature. They catalogue 316 factors proposed between 1967 and 2014, most reported with a $t$-statistic just above 2. Applying a multiple-testing correction that is appropriate for the population of tests actually conducted, they argue that the threshold for declaring a new factor “discovered” should be around $t = 3.0$ rather than 2.0. The effect is dramatic. A large fraction of the post-2000 factors fall below the adjusted bar and should be treated with scepticism. Harvey, Liu, and Zhu (2020) generalise the argument to a proper false-discovery-rate framework, using Benjamini-Yekutieli to handle dependence across tests, and arrive at a similar qualitative conclusion: the published literature contains both false discoveries (factors that look real and are not) and missed discoveries (factors that look weak but are actually real), and the appropriate statistical procedure is not a single $t$-test at 0.05 but an FDR-controlled ranking of candidates. For the present purpose the detail is less important than the headline: the conventional significance bar is far too low for the volume of testing that now occurs, both across the published literature and within any individual research project.

The corollary is unsettling for anyone choosing a factor to analyse. Any chosen factor comes from a population of candidates known to contain a substantial fraction of false discoveries. Critical analysis should engage with this reality rather than assume it away. The Jensen, Kelly, and Pedersen (2024) dataset already incorporates a first line of defence: standardised construction methodology and direct re-estimation on independent data reduce the most obvious sources of overstatement, and the factor premia they report are typically thirty to fifty per cent smaller than the original-paper claims.

3.4 Hold-out alone is not the answer

Students often reach for the simplest defensive move: set aside the last twenty per cent of the sample as a hold-out, fit on the first eighty, evaluate on the held-out tail. This is better than nothing, but Bailey et al. (2015) argue in their §1 that it is insufficient for at least three concrete reasons. First, if the data are public, the researcher has almost certainly seen the hold-out already, either directly or through widely reported stylised facts; nobody approaching UK equity data for the period 2015-2020 is genuinely surprised by the Brexit referendum or the COVID-19 drawdown. Second, a single hold-out is a single point estimate, and different hold-outs can give very different conclusions; the hold-out period becomes a new degree of freedom that is itself vulnerable to implicit community-level overfitting (everyone evaluates on the same “standard” window). Third, and most importantly for our purposes, hold-out ignores the number of trials conducted on the training set, which is the actual source of the selection bias. A researcher who tries two hundred rules on the first eighty per cent of the data and reports the best one evaluated on the final twenty per cent has not solved the problem; they have merely displaced it.

3.5 k-fold cross-validation leaks across time

A related instinct is to reach for k-fold cross-validation, a workhorse of general machine learning. On time-series data this is actively misleading. Standard k-fold shuffles observations randomly across folds, which means the training folds for any given test fold contain observations from both before and after the test period. In a financial setting where today’s features are constructed from rolling averages, momentum windows, or volatility estimates, this is look-ahead leakage in its most literal form: your training data already encodes information about the future you are claiming to predict. Even blocked k-fold (where folds are contiguous) suffers from label leakage when the predictive target spans multiple periods, as it does for any overlapping-return strategy. The correct resampling procedure for financial backtesting must respect time, must be aware of the overlap structure of labels, and must give us a distribution of out-of-sample outcomes rather than a single point estimate.

4 Walk-Forward Validation: The Minimum Viable Remedy

The simplest procedure that respects time is walk-forward validation. The data are divided chronologically into a training window and a test window. The rule is selected or its parameters fit on the training window, evaluated on the test window, and the result recorded. The windows are then rolled forward by one step and the process repeated. At the end, instead of a single point estimate, the researcher has a sequence of out-of-sample returns: one per rolled window. Walk-forward has two honest properties that a naive train/test split does not. The researcher never uses data that would not have been available at the point of the decision, and the number of out-of-sample observations is meaningful because they are non-overlapping in the relevant sense.

What to report from a walk-forward is then more than a single headline. The out-of-sample Sharpe ratio, averaged across windows, is the obvious first number. The distribution of out-of-sample Sharpe ratios across windows is more informative still, because it tells you whether the strategy’s premium is built on one lucky year or spread across the sample. The worst window is a rough proxy for the drawdown the strategy would have produced in its most hostile regime. The stability of the selected parameters, if you are re-selecting at each step, tells you whether the rule is genuinely one rule or a sequence of different rules that happen to share a name. A useful rule of thumb, consistent with the López de Prado reporting conventions, is that if the in-sample Sharpe ratio is more than twice the median out-of-sample Sharpe ratio, the strategy is probably overfit regardless of how many other boxes it ticks.

Walk-forward handles one strategy evaluated through many time windows. It does not, on its own, handle the more common research situation of many strategies evaluated through the same time window, which is the situation the selection-bias literature is mostly concerned with. For that we need a resampling procedure that asks a different question: of the strategies that looked best in-sample, how many of them stay best out-of-sample?

It is also worth flagging a caveat that Prado (2018a) Ch. 11 emphasises: a walk-forward backtest is itself a single path through history, and if configurations are iterated on that path often enough, the walk-forward itself can be overfit. Walk-forward is a necessary discipline, not a sufficient one. The tools in the next two sections, CSCV and the deflated Sharpe ratio, are what close the residual gap.

5 CSCV and the Probability of Backtest Overfitting

5.1 The picture, before the formalism

The procedure Bailey et al. (2015) propose, Combinatorially Symmetric Cross-Validation, is easiest to understand as a picture. Divide the return history into $S$ contiguous slices of equal length; ten slices is a reasonable default. For every possible way of choosing $S/2$ slices to serve as the “in-sample” set, with the remaining $S/2$ serving as the “out-of-sample” set, do the following. Compute the in-sample Sharpe ratio of each strategy in the garden. Identify the in-sample champion (the strategy with the highest in-sample Sharpe). Compute the out-of-sample Sharpe of that strategy and record its rank among the full set of strategies out-of-sample. Repeat across all ${S \choose S/2}$ symmetric splits (or a random subsample of them, if the full enumeration is too large). The result is a distribution of out-of-sample ranks for the in-sample champion, one rank per split.

If the in-sample champions really are the best strategies, their out-of-sample ranks should cluster near one. If, on the other hand, the in-sample ranking is mostly noise, the out-of-sample ranks should scatter uniformly across the full range, and in the worst case the in-sample champions should systematically land below the median.

5.2 Defining PBO

The Probability of Backtest Overfitting is the fraction of splits in which the in-sample champion underperforms the median out-of-sample Sharpe. In symbols, with $r^{IS}_s$ denoting the in-sample rank of strategy $s$ on a given split and $r^{OS}_s$ its out-of-sample rank, PBO is

\[ \phi = \Pr\!\left[\, r^{OS}_{s^*} > \text{median}(r^{OS}) \,\right], \qquad s^* = \arg\max_s r^{IS}_s, \]

evaluated as the empirical frequency across all CSCV splits. Equivalently, and more numerically stable, one applies a logit transform to the out-of-sample rank, $\tau_s = \log[r/(1 - r)]$ where $r$ is the normalised rank, and reports PBO as the fraction of splits with $\tau < 0$. A strategy with no edge at all, and with the selection procedure applied honestly, produces a PBO close to 0.5. A strategy with a clear edge produces a PBO close to zero. A strategy whose selection procedure is actively chasing noise (our warning case below) produces a PBO substantially above 0.5.

5.3 Two anchor numbers from Bailey et al.

The intuition becomes concrete when you look at the two worked examples in Bailey et al. (2015) §3. In their synthetic overfit case, built from return streams with no true edge, PBO comes out at 0.74: the in-sample champion is below the out-of-sample median in three splits out of four. Alongside this, 100% of the in-sample Sharpe ratios are positive, and 78% of the out-of-sample Sharpe ratios are negative. The selection procedure is not merely producing noise; it is reliably selecting strategies that will underperform going forward. In their second example, using a real investment strategy with a genuine signal, PBO drops to 0.04%. Only 3% of out-of-sample Sharpes are negative and only 4% of in-sample champions underperform the out-of-sample median. The same procedure, applied to the same reporting template, produces two results that differ by three orders of magnitude. That is the evidentiary gap PBO is designed to make visible.

5.4 The negative-slope phenomenon

There is a further fact, reported in Bailey et al. (2015) §3.2, that is routinely counter-intuitive on first reading. If one regresses the out-of-sample Sharpe ratio of the in-sample champion on its in-sample Sharpe ratio, across splits of an overfit backtest, the slope coefficient is negative. A higher in-sample Sharpe predicts a lower out-of-sample Sharpe. The reason is that in an overfit setting the in-sample Sharpe measures how aggressively a strategy has fitted past noise, and aggressive fitting to noise is exactly what prevents generalisation. The point can be read as a corollary of the bias-variance decomposition: a strategy that has maximised in-sample performance on limited data has typically spent most of its flexibility on features of that sample that will not recur. The more impressive the in-sample number, the more the procedure has been rewarded for over-fitting, and the more confidently one can predict that the out-of-sample number will be worse.

5.5 Interpreting PBO in Practice

Once the histogram is in hand, the interpretation follows Bailey et al. (2015) §3.1. A PBO below 0.05 meets the Neyman-Pearson-style threshold they propose: the procedure is producing a result consistent with a genuine edge and one can, with reasonable discipline, proceed to paper-trading. A PBO between 0.05 and 0.5 is a warning flag; something has selected a strategy that is barely better than random out-of-sample, and no conclusion about skill is defensible without much more context. A PBO above 0.5 is worse than a coin flip: the selection procedure is systematically anti-predictive and the reported backtest should be discarded.

5.6 A small worked demonstration

The utilities used by Lab 10 are small enough that we can reproduce the core statistics in a few lines. The function generate_noise_strategies simulates a garden of mean-zero return streams with a specified common correlation, cscv_pbo runs the CSCV machinery, and deflated_sharpe_ratio (used in the next section) inflates the benchmark. All three are plain NumPy, no mlfinlab required.

Show: PBO on pure noise vs weak edge

import numpy as np
from overfit_metrics import (
    generate_noise_strategies,
    sharpe_ratio,
    cscv_pbo,
)

T, N = 240, 200
rho = 0.2

# (a) pure noise: no strategy has an edge
X_noise = generate_noise_strategies(T=T, N=N, rho=rho, seed=42)
pbo_noise = cscv_pbo(X_noise, n_folds=10, max_splits=150).pbo

# (b) ten of the two hundred carry a genuine edge of roughly 0.5
# annualised Sharpe. At unit-variance monthly noise this corresponds
# to adding 0.50 / sqrt(12) ~= 0.144 per month to the returns of the
# first ten strategies; a factor-strength effect.
X_edge = generate_noise_strategies(T=T, N=N, rho=rho, seed=42)
X_edge[:, :10] += 0.50 / np.sqrt(12)
pbo_edge = cscv_pbo(X_edge, n_folds=10, max_splits=150).pbo

print(f"PBO on pure noise       : {pbo_noise:.3f}")
print(f"PBO with 10/200 edged   : {pbo_edge:.3f}")

PBO on pure noise       : 0.600
PBO with 10/200 edged   : 0.207

The first number sits near 0.5 (typically in the 0.35-0.60 band, depending on the random seed and the CSCV subsample), consistent with the theoretical prediction that the in-sample champion of a noise garden stays below the out-of-sample median roughly half the time. The second number is substantially lower, because once some strategies carry a genuine edge they tend to stay near the top of the ranking in every split, and the champion’s out-of-sample rank is much more often above the median. The difference is the diagnostic signal: when real structure is injected into the data, PBO responds; when it is not, PBO does not. Note that these numbers never reach the PBO = 0.74 of Bailey et al.’s synthetic example, because this garden has weaker cross-strategy correlation and a less aggressive selection rule. Their example is deliberately chosen to maximise the overfitting failure mode; this calibration is chosen to show the mechanism without requiring a pathological setup.

6 The Deflated Sharpe Ratio

6.1 PSR, the easy test

PBO answers a question about persistence: when you rank strategies in-sample, does that ranking carry over to out-of-sample? It does not, on its own, answer a question about magnitude: is the Sharpe ratio of the chosen strategy large enough to matter, given the sample size and the non-normality of returns? For that we need a different statistic.

Bailey and Prado (2014), building on Bailey et al. (2015) and earlier work, define the Probabilistic Sharpe Ratio as the probability that the true Sharpe ratio of a strategy exceeds some benchmark $SR^*$, given the sample Sharpe ratio $\widehat{SR}$, the sample size $n$, the sample skewness $\hat\gamma_3$, and the sample kurtosis $\hat\gamma_4$. The closed-form expression is

\[ \mathrm{PSR}(SR^*) = \Phi\!\left(\frac{(\widehat{SR} - SR^*)\sqrt{n-1}}{\sqrt{1 - \hat\gamma_3 \widehat{SR} + \tfrac{\hat\gamma_4 - 1}{4}\widehat{SR}^2}}\right), \]

where $\Phi$ is the standard normal CDF. The formula corrects three defects of the naive Sharpe ratio: it accounts for finite-sample noise (through the $\sqrt{n-1}$ factor in the numerator), for skewness (skewed returns inflate or deflate confidence in a point estimate of SR), and for fat tails (which are ubiquitous in financial returns and make the simple formula too optimistic).

When $SR^*$ is set to zero, PSR answers “how confident are we that the true Sharpe is positive?” This is a weak and easily satisfied null. On our noise simulation above, where the in-sample champion has a Sharpe near 1.8, $\mathrm{PSR}(0)$ will comfortably exceed 0.95 for any reasonable sample size. This is not because the strategy is any good. It is because the best of two hundred sample Sharpe ratios on mean-zero data is almost certain to be well above zero in finite samples, and $\mathrm{PSR}(0)$ is measuring exactly that.

6.2 Inflating the benchmark for selection

The Deflated Sharpe Ratio is the same formula with a more honest benchmark. Instead of asking whether the true Sharpe exceeds zero, it asks whether the true Sharpe exceeds the expected maximum Sharpe one would obtain from searching across $N$ trials with average pairwise correlation $\rho$ on data with no real edge. The benchmark is

\[ SR^* = \sigma_{SR} \cdot \left\{ (1 - \gamma) \cdot \Phi^{-1}\!\left(1 - \tfrac{1}{N}\right) + \gamma \cdot \Phi^{-1}\!\left(1 - \tfrac{1}{N e}\right) \right\}, \]

where $\gamma \approx 0.5772$ is the Euler-Mascheroni constant, $\sigma_{SR}$ is the cross-trial standard deviation of Sharpe ratios (shrunk for correlation among trials), and $N$ is the number of trials. The intuition is what matters: the more rules that were tried, and the less correlated those rules were with each other, the higher the selection-adjusted benchmark the winner must clear. DSR is then defined as $\mathrm{PSR}(SR^*)$ with this inflated benchmark. On a genuine edge, DSR will be close to one. On noise, DSR will be well below one, and typically well below PSR(0).

6.3 The PSR-DSR gap as cost of selection

The number to report is less PSR or DSR on its own than the gap between them. PSR against a zero benchmark is always reassuring in any setting where you have searched across non-trivial numbers of trials, because the best of a large number of noisy draws is almost always positive. DSR strips that reassurance away by moving the goalposts to where they belonged in the first place. The gap between the two is the quantitative cost of having searched. Running the DSR machinery on our noise garden:

Show: PSR vs DSR on the noise garden

import pandas as pd
from overfit_metrics import probabilistic_sharpe_ratio, deflated_sharpe_ratio

X = generate_noise_strategies(T=T, N=N, rho=rho, seed=42)
sr_all = np.array([sharpe_ratio(X[:, j]) for j in range(N)])
j_star = int(np.argmax(sr_all))
x_star = X[:, j_star]
sr_hat = sr_all[j_star]
n_obs = len(x_star)

skew = float(pd.Series(x_star).skew())
kurt = float(pd.Series(x_star).kurtosis() + 3.0)

psr_0 = probabilistic_sharpe_ratio(sr_hat, 0.0, n_obs, skew=skew, kurtosis=kurt)
dsr, sr_star = deflated_sharpe_ratio(
    sr_hat=sr_hat,
    sr_trials=sr_all,
    n_obs=n_obs,
    skew=skew,
    kurtosis=kurt,
    rho=rho,
)

print(f"Sample Sharpe (champion) : {sr_hat:.3f}")
print(f"Selection-adjusted SR*   : {sr_star:.3f}")
print(f"PSR against SR*=0        : {psr_0:.3f}")
print(f"DSR against SR*={sr_star:.3f}: {dsr:.3f}")
print(f"PSR - DSR (cost of selection): {psr_0 - dsr:+.3f}")

Sample Sharpe (champion) : 0.144
Selection-adjusted SR*   : 0.146
PSR against SR*=0        : 0.988
DSR against SR*=0.146: 0.493
PSR - DSR (cost of selection): +0.495

On a typical run of the noise garden the sample Sharpe of the champion (measured per-period, before annualisation) sits around 0.14, which annualises to roughly 0.5 on monthly data: enough for $\mathrm{PSR}(0)$ to comfortably exceed 0.95, because the finite-sample test of “is the true Sharpe greater than zero?” is easy to pass. But the selection-adjusted benchmark $SR^*$ is itself around 0.14 (because the expected maximum of two hundred correlated noise Sharpes is essentially what we observed), so DSR collapses to a value close to 0.5. The difference between $\mathrm{PSR}(0) \approx 0.98$ and $\mathrm{DSR} \approx 0.5$ is the quantitative statement that the Sharpe ratio was large enough to clear the “is it positive?” bar but nowhere near large enough to clear the “is it the best of a two-hundred-strategy search?” bar. On a genuine edge, as in the Bailey et al. (2015) real-strategy example, PSR(0) is still near one, but DSR also rises materially, and the gap shrinks. The gap, not the level, is the reporting number.

6.4 When DSR matters and when it does not

A natural question at this point is whether DSR is always appropriate. The honest answer is that DSR is the right statistic whenever the reported result was selected from a search, and “selected” includes more than formal hyperparameter tuning. If the universe is chosen because it looks cleaner, the rebalancing day because it gives smoother returns, or the lookback window because it makes the factor work, then $N$ is not one and a DSR-style correction is warranted. Conversely, if the reported rule is one published before the data were examined, implemented exactly as specified, and not tuned on the sample, then $N = 1$ and DSR reduces to PSR. Most applied research submissions are in the first category, not the second.

7 Honest Reporting and a Research Template

7.1 What a credible backtest write-up contains

The discipline of this chapter is easier to remember as a single reporting template than as a set of statistical facts. Whenever a backtest result is claimed, the following should be on the page. The universe and sample: which stocks, from which exchange, over which dates, at which rebalancing frequency. The trials: how many distinct configurations of the rule were evaluated before choosing the one reported, and (approximately) how correlated they were with each other. The walk-forward diagnostics: the out-of-sample Sharpe ratio, the distribution across windows, the worst window, and the turnover. The CSCV-based PBO, with the histogram of logit ranks and an explicit statement of whether the result clears the PBO < 0.05 threshold. The PSR against a zero benchmark and the DSR against the selection-adjusted benchmark, with an explicit statement of the PSR-DSR gap. Finally, a decision: given the numbers, should the strategy be promoted to paper-trading, parked pending more data, or discarded? The decision should be declared against thresholds fixed before computing the numbers, not after.

A rigorous assessment should be built around this template, and should reward reports that apply it honestly. A result whose DSR is below 0.95 and whose PBO is above 0.5 is not a weaker version of a good result; it is a different claim, and the appropriate action is to report it as such and explain why you are not promoting the strategy. A report that presents an impressive in-sample Sharpe ratio without an accompanying PBO or DSR is methodologically incomplete, just as a factor regression without HAC standard errors is methodologically incomplete in Chapter 9.

López de Prado’s Third Law of Backtesting

“Every backtest result must be reported in conjunction with all the trials involved in its production. Absent that information, it is impossible to assess the backtest’s ‘false discovery’ probability.”

— Prado (2018a), Snippet 14.5

Any serious research workflow should treat this instruction literally. The number of trials you ran and an estimate of their average cross-correlation are as much a part of the reported result as the Sharpe ratio itself. A report that presents the latter without the former is reporting a statistic whose distribution under the null has not been pinned down, and consequently a statistic that cannot be evaluated.

7.2 A simple decision rule, pre-committed

One of the cleanest ways to demonstrate research discipline is to commit to thresholds in advance and report against them. A widely used (but deliberately conservative) convention is:

PBO	DSR	Action
below 0.2	above 0.95	Promote to paper-trading
0.2 to 0.4	0.7 to 0.95	Park; collect more data before deploying
above 0.4	below 0.7	Discard

These thresholds are conventions, not laws. What matters is that they are fixed before looking at the DSR number, and then followed when the result is unfavourable. A report that promotes a strategy with PBO above 0.4 on the grounds that “it is close to the cut-off and has a nice story” is exactly the kind of post-hoc rationalisation that this apparatus is designed to prevent.

7.3 The one-line version of the week

The apparatus is elaborate but the lesson it is built to teach is not. A great-looking backtest is not, on its own, evidence that a strategy will work live. The number that most people treat as evidence (the in-sample Sharpe ratio) is the number that is most sensitive to how hard you searched. The procedures introduced in this chapter are the minimum set of corrections that make the Sharpe ratio an honest statement rather than a selection statistic.

8 From Research to Live Trading

This chapter is about the statistical discipline that separates a credible backtest from a selection artefact. It is not about the engineering discipline that separates a credible backtest from a live trading system. That second discipline is substantial in its own right: it covers feature engineering under point-in-time constraints, model-versioning for rollback, drift detection under non-stationary distributions, and the governance frameworks that regulators require for models that move real money. Those concerns are covered in the sibling appendix on production ML pipelines, retained as reference material.

For present purposes, there is no need to build a production system. The immediate task is to write a backtest whose numbers can be defended. The most consequential failure mode in that direction is still statistical rather than operational: a strategy whose selection bias has not been priced cannot be rescued by any amount of pipeline engineering downstream. A strategy that has been priced honestly, on the other hand, can be engineered into a live system with the tools described in the appendix. The two chapters are meant to be read in that order.

9 Synthesis

A single pre-specified factor can be evaluated with HAC-robust standard errors. A strategy chosen from a search space the researcher controls requires more. The three corrections in this chapter are nested. HAC handles within-sample temporal dependence. Walk-forward handles naive in-sample evaluation. CSCV with PBO handles the multiple-strategy selection problem. DSR handles the multiple-strategy magnitude problem. Each sits strictly on top of the one below it, and a credible research claim generally needs all four.

The broader argument can now be stated compactly. Empirical finance demands both data-quality discipline and statistical discipline. Once search enters the workflow, the tension between the incentive to search and the cost of searching must be made explicit in the reported numbers. The tools in this chapter are the minimum technology required to keep that tension visible in practice.

The single most effective step is to apply the reporting template in §5 before doing anything else. Declare the universe, pre-declare thresholds, state the number of trials to be run and the assumed $\rho$, and then run the search. If the headline result does not survive the template, report that outcome explicitly. Honest reporting is the core skill this chapter is designed to develop.

References

Bailey, David H., Jonathan M. Borwein, Marcos López de Prado, and Qiji Jim Zhu. 2015. “The Probability of Backtest Overfitting.” Journal of Computational Finance. https://doi.org/10.2139/ssrn.2326253.

Bailey, David H., and Marcos López de Prado. 2014. “The Deflated Sharpe Ratio: Correcting for Selection Bias, Backtest Overfitting and Non-Normality.” Journal of Portfolio Management 40 (5): 94–107. https://doi.org/10.2139/ssrn.2460551.

Fama, Eugene F., and Kenneth R. French. 1993. “Common Risk Factors in the Returns on Stocks and Bonds.” Journal of Financial Economics 33 (1): 3–56. https://doi.org/10.1016/0304-405X(93)90023-5.

Hansen, Peter R. 2005. “A Test for Superior Predictive Ability.” Journal of Business & Economic Statistics 23 (4): 365–80. https://doi.org/10.1198/073500105000000063.

Harvey, Campbell R., Yan Liu, and Heqing Zhu. 2016“... And the Cross-Section of Expected Returns.” Review of Financial Studies 29 (1): 5–68. https://doi.org/10.1093/rfs/hhv059.

———. 2020. “False (and Missed) Discoveries in Financial Economics.” Journal of Finance 75 (5): 2503–53. https://doi.org/10.1111/jofi.12960.

Jensen, Theis I., Bryan T. Kelly, and Lasse Heje Pedersen. 2024. “Is There a Replication Crisis in Finance?” Journal of Finance. https://doi.org/10.1111/jofi.13249.

Prado, Marcos López de. 2018a. Advances in Financial Machine Learning. John Wiley & Sons.

———. 2018b. “The 7 Reasons Most Backtests Fail and How to Fix Them.” SSRN Electronic Journal. https://doi.org/10.2139/ssrn.3257419.

White, Halbert. 2000. “A Reality Check for Data Snooping.” Econometrica 68 (5): 1097–1126. https://doi.org/10.1111/1468-0262.00152.

--- title: "Backtesting and Validation" subtitle: "Week 10: Why Most Backtests Lie, and How to Tell If Yours Is One of Them" format: html: toc: true toc-depth: 3 number-sections: true code-fold: true code-summary: "Show code" bibliography: - ../resources/reading.bib - ../resources/reading_supp.bib execute: echo: true eval: true warning: false message: false jupyter: fin510 --- ```{python} #| include: false import sys from pathlib import Path for candidate in ( Path("scripts/utilities").resolve(), (Path.cwd().parent / "scripts" / "utilities").resolve(), ): if candidate.exists(): sys.path.insert(0, str(candidate)) break ``` ## Why Validation Is Its Own Discipline {#sec-why-validation} A standard factor-testing workflow runs a time-series regression, reports an alpha with HAC standard errors, and asks whether the measured premium is statistically different from zero. That is the discipline of testing a single, pre-specified hypothesis. It is the right discipline for a factor that someone else has already proposed, whose definition is inherited, and whose construction is not tuned by the current researcher. This chapter is about the other half of quantitative research: the part in which the researcher specifies the rule, chooses the lookback and rebalancing schedule, tries multiple variations, and then must decide whether the reported result is signal or mirage. Most desk-level research notes and most proposed factors in the published literature live in this second world, and it is here that the usual tools of inference begin to mislead. The Sharpe ratio computed on the best of two hundred strategies is not a Sharpe ratio in the sense of @bailey2015pbo or @fama1993common. It is a *selection statistic*, and it obeys different distributional rules. @bailey2015pbo open their paper with a deliberately jarring result. They generate a garden of random strategies, none of which has any true edge, and they follow standard research practice: rank by in-sample Sharpe, pick the winner, rebadge it as "the strategy we trade", and track it out-of-sample. In their synthetic example, **100% of in-sample Sharpe ratios are positive and 78% of out-of-sample Sharpe ratios are negative**. The procedure is not broken because the data were unusual. It is broken because the researcher searched. This is not an exotic edge case. @harvey2016and tabulate roughly three hundred and sixteen factors proposed in top finance journals between 1967 and 2014, nearly all of them reported at the conventional two-sided five-per-cent bar (equivalently, large-sample $|t| \approx 2$), and argue that under any reasonable adjustment for the volume of testing conducted across the literature the appropriate threshold is closer to three. @harvey2020false extend the analysis using a Benjamini-Yekutieli false-discovery-rate procedure and conclude that a substantial fraction of the published factor literature is consistent with null data-generating processes. The problem is endemic; the solution is a set of tools that make the act of searching explicit, cost it, and build a number that survives the accounting. ::: {.callout-note} **Learning objectives.** After completing this chapter, the reader should be able to: (1) explain what a backtest is and why a large in-sample Sharpe ratio is not, by itself, evidence of a persistent edge; (2) describe the selection-bias and multiple-testing problems in plain language, including the @harvey2016and finding that the conventional $t > 2.0$ threshold is too lenient given the size of the published factor zoo; (3) compute and interpret the Probability of Backtest Overfitting (PBO) using Combinatorially Symmetric Cross-Validation (CSCV), and apply the $PBO > 0.05$ rejection threshold proposed by Bailey et al.; (4) compute the Deflated Sharpe Ratio (DSR) and interpret it as a selection-adjusted probability of genuine skill; (5) report a backtest using the López de Prado template in a reproducible and decision-ready format. ::: ## What a Backtest Is, and What It Is Not {#sec-what-is-a-backtest} A backtest is a simulation. A trading rule expressed in code is run against historical data, and the resulting return sequence is recorded. The output is a single equity curve and a small table of summary statistics: cumulative return, annualised Sharpe ratio, maximum drawdown, turnover, and perhaps a breakdown by year or regime. The whole exercise is stripped of most of the difficulties of live trading. There are no rejected orders, no information-leakage delays, no broker quirks, and no macroeconomic surprises. Every number is produced from data that already exists, and the researcher knows how every episode ends before the simulation starts. Three ingredients enter a backtest. The *rule* specifies what the strategy does: for example, "at the end of each month, buy the thirty per cent of stocks with the highest book-to-market ratio, short the thirty per cent with the lowest, hold for one month, rebalance." The *history* provides the data against which the rule is simulated: prices, volumes, fundamentals, and index memberships for a given universe over a specified window. The *performance metric* is the single number used to decide whether the result is good: usually the Sharpe ratio, sometimes the information ratio, occasionally a utility-adjusted alternative. The output of the backtest is then a story about the past, expressed in numbers that look quantitative and definitive. It is worth dwelling on why backtests *feel* so rigorous. They use real prices on real dates. They produce real-looking equity curves with drawdowns dated to actual recessions. They run on the same Pandas dataframes that professional portfolio managers use. A reader meeting a backtest for the first time typically assumes that the exercise is a factual claim about the past: the strategy did earn that return, or it did not. The feeling is misleading. The numbers are not facts about the past. They are *the result of choices made when specifying the rule*, evaluated on a single historical realisation of an uncertain world. Change the universe, the rebalancing day, the definition of the signal, or the window, and every number moves. This is not a pathology. It is simply that the backtest has a large and unacknowledged degree of freedom, and the researcher's job is to ensure that the number finally reported is not merely the luckiest corner of that degree of freedom. The pedagogical point is then straightforward. Backtests are a *filter*, not an *oracle*. They are useful for ruling out strategies that cannot even survive in-sample, for quantifying the turnover and drawdown profile of a strategy that has already passed some other test, and for sizing risk budgets. They are dangerous when used to discover strategies that should be traded with real money, because the procedure of searching for the best rule systematically biases the output upward. ## Why Backtests Lie {#sec-why-backtests-lie} ### Two hundred strategies and one winner Consider a concrete variant of the opening story. Suppose there are twenty years of monthly returns for a broad universe and a momentum-style rule. The lookback window is uncertain, as is the amount of recent-month skip and whether to use an equal-weighted or value-weighted portfolio. So multiple combinations are tried. Three lookbacks, three skips, two weightings, two rebalancing days, two holding horizons: thirty-six versions. The best one is reported. Its Sharpe ratio is 1.8. A momentum premium is then claimed. The difficulty is that, under the null hypothesis that none of the thirty-six rules has any true edge, the *maximum* of thirty-six i.i.d. sample Sharpe ratios has an expected value that is much larger than zero. The false-strategy theorem of @bailey2015pbo, restated in @deprado2018advances §14.7, gives the exact order of magnitude: the expected maximum Sharpe across $N$ independent trials is bounded above by $\sigma_{SR} \sqrt{2 \log N}$ and approximated to leading order by a $(1-\gamma) \Phi^{-1}(1 - 1/N) + \gamma \Phi^{-1}(1 - 1/(N e))$ expression, with $\gamma$ the Euler-Mascheroni constant. The bound is slow but relentless: at $N = 200$ trials the expected best Sharpe on pure noise is comfortably above one, and the ninety-fifth percentile is above two. @lopezdeprado2014dsr make the same point at a more concrete operational level: roughly **twenty iterations** is enough, at the conventional five-per-cent level, to discover a "significant" strategy from a martingale-distributed data stream. @bailey2015pbo illustrate the consequence by simulating two hundred mean-zero return streams, running the naive procedure, and observing that nearly every selected strategy would be advertised as a publishable result. They then track those same strategies out-of-sample. The in-sample "champions" underperform the median of their peers in 78% of splits. The procedure is not merely imprecise; it is actively anti-predictive, and for reasons we will make precise below. ### Multiple testing in one line The Bonferroni correction captures the intuition without any distributional assumptions. If a single hypothesis is tested at the five per cent level, the probability of a false positive is 0.05. If $m$ independent hypotheses are each tested at five per cent, the probability of at least one false positive is $1 - (1 - 0.05)^m$. At $m = 20$ this is already 0.64, and at $m = 100$ it is 0.994. To preserve an overall five per cent level across the family of tests, the per-test threshold must be lowered to $0.05 / m$, which for $m = 100$ is a $p$-value of $0.0005$, corresponding to a $t$-statistic of roughly 3.5. Bonferroni is conservative because real-world tests are not independent, but the direction of the adjustment is unambiguous: the more rules that are tried, the higher the bar the winner must clear. ### The factor zoo: Harvey, Liu and Zhu @harvey2016and make this concrete for the published factor literature. They catalogue 316 factors proposed between 1967 and 2014, most reported with a $t$-statistic just above 2. Applying a multiple-testing correction that is appropriate for the population of tests actually conducted, they argue that the threshold for declaring a new factor "discovered" should be around $t = 3.0$ rather than 2.0. The effect is dramatic. A large fraction of the post-2000 factors fall below the adjusted bar and should be treated with scepticism. @harvey2020false generalise the argument to a proper false-discovery-rate framework, using Benjamini-Yekutieli to handle dependence across tests, and arrive at a similar qualitative conclusion: the published literature contains both false discoveries (factors that look real and are not) and missed discoveries (factors that look weak but are actually real), and the appropriate statistical procedure is not a single $t$-test at 0.05 but an FDR-controlled ranking of candidates. For the present purpose the detail is less important than the headline: the conventional significance bar is far too low for the volume of testing that now occurs, both across the published literature and within any individual research project. The corollary is unsettling for anyone choosing a factor to analyse. Any chosen factor comes from a population of candidates known to contain a substantial fraction of false discoveries. Critical analysis should engage with this reality rather than assume it away. The @jensen2024replication dataset already incorporates a first line of defence: standardised construction methodology and direct re-estimation on independent data reduce the most obvious sources of overstatement, and the factor premia they report are typically thirty to fifty per cent smaller than the original-paper claims. ### Hold-out alone is not the answer Students often reach for the simplest defensive move: set aside the last twenty per cent of the sample as a hold-out, fit on the first eighty, evaluate on the held-out tail. This is better than nothing, but @bailey2015pbo argue in their §1 that it is insufficient for at least three concrete reasons. First, if the data are public, the researcher has almost certainly seen the hold-out already, either directly or through widely reported stylised facts; nobody approaching UK equity data for the period 2015-2020 is genuinely surprised by the Brexit referendum or the COVID-19 drawdown. Second, a single hold-out is a single point estimate, and different hold-outs can give very different conclusions; the hold-out period becomes a new degree of freedom that is itself vulnerable to implicit community-level overfitting (everyone evaluates on the same "standard" window). Third, and most importantly for our purposes, hold-out ignores the number of trials conducted on the training set, which is the actual source of the selection bias. A researcher who tries two hundred rules on the first eighty per cent of the data and reports the best one evaluated on the final twenty per cent has not solved the problem; they have merely displaced it. ### k-fold cross-validation leaks across time A related instinct is to reach for k-fold cross-validation, a workhorse of general machine learning. On time-series data this is actively misleading. Standard k-fold shuffles observations randomly across folds, which means the training folds for any given test fold contain observations from both before *and after* the test period. In a financial setting where today's features are constructed from rolling averages, momentum windows, or volatility estimates, this is look-ahead leakage in its most literal form: your training data already encodes information about the future you are claiming to predict. Even blocked k-fold (where folds are contiguous) suffers from label leakage when the predictive target spans multiple periods, as it does for any overlapping-return strategy. The correct resampling procedure for financial backtesting must respect time, must be aware of the overlap structure of labels, and must give us a *distribution* of out-of-sample outcomes rather than a single point estimate. ## Walk-Forward Validation: The Minimum Viable Remedy {#sec-walk-forward} The simplest procedure that respects time is *walk-forward* validation. The data are divided chronologically into a training window and a test window. The rule is selected or its parameters fit on the training window, evaluated on the test window, and the result recorded. The windows are then rolled forward by one step and the process repeated. At the end, instead of a single point estimate, the researcher has a *sequence* of out-of-sample returns: one per rolled window. Walk-forward has two honest properties that a naive train/test split does not. The researcher never uses data that would not have been available at the point of the decision, and the number of out-of-sample observations is meaningful because they are non-overlapping in the relevant sense. What to report from a walk-forward is then more than a single headline. The out-of-sample Sharpe ratio, averaged across windows, is the obvious first number. The *distribution* of out-of-sample Sharpe ratios across windows is more informative still, because it tells you whether the strategy's premium is built on one lucky year or spread across the sample. The worst window is a rough proxy for the drawdown the strategy would have produced in its most hostile regime. The stability of the selected parameters, if you are re-selecting at each step, tells you whether the rule is genuinely one rule or a sequence of different rules that happen to share a name. A useful rule of thumb, consistent with the López de Prado reporting conventions, is that if the in-sample Sharpe ratio is more than twice the median out-of-sample Sharpe ratio, the strategy is probably overfit regardless of how many other boxes it ticks. Walk-forward handles one strategy evaluated through many time windows. It does not, on its own, handle the more common research situation of many strategies evaluated through the same time window, which is the situation the selection-bias literature is mostly concerned with. For that we need a resampling procedure that asks a different question: of the strategies that looked best in-sample, how many of them stay best out-of-sample? It is also worth flagging a caveat that @deprado2018advances Ch. 11 emphasises: a walk-forward backtest is itself a single path through history, and if configurations are iterated on that path often enough, the walk-forward itself can be overfit. Walk-forward is a necessary discipline, not a sufficient one. The tools in the next two sections, CSCV and the deflated Sharpe ratio, are what close the residual gap. ## CSCV and the Probability of Backtest Overfitting {#sec-pbo} ### The picture, before the formalism The procedure @bailey2015pbo propose, Combinatorially Symmetric Cross-Validation, is easiest to understand as a picture. Divide the return history into $S$ contiguous slices of equal length; ten slices is a reasonable default. For every possible way of choosing $S/2$ slices to serve as the "in-sample" set, with the remaining $S/2$ serving as the "out-of-sample" set, do the following. Compute the in-sample Sharpe ratio of each strategy in the garden. Identify the in-sample champion (the strategy with the highest in-sample Sharpe). Compute the out-of-sample Sharpe of *that* strategy and record its rank among the full set of strategies out-of-sample. Repeat across all ${S \choose S/2}$ symmetric splits (or a random subsample of them, if the full enumeration is too large). The result is a distribution of out-of-sample ranks for the in-sample champion, one rank per split. If the in-sample champions really are the best strategies, their out-of-sample ranks should cluster near one. If, on the other hand, the in-sample ranking is mostly noise, the out-of-sample ranks should scatter uniformly across the full range, and in the worst case the in-sample champions should systematically land below the median. ### Defining PBO The *Probability of Backtest Overfitting* is the fraction of splits in which the in-sample champion underperforms the median out-of-sample Sharpe. In symbols, with $r^{IS}_s$ denoting the in-sample rank of strategy $s$ on a given split and $r^{OS}_s$ its out-of-sample rank, PBO is $$ \phi = \Pr\!\left[\, r^{OS}_{s^*} > \text{median}(r^{OS}) \,\right], \qquad s^* = \arg\max_s r^{IS}_s, $$ evaluated as the empirical frequency across all CSCV splits. Equivalently, and more numerically stable, one applies a logit transform to the out-of-sample rank, $\tau_s = \log[r/(1 - r)]$ where $r$ is the normalised rank, and reports PBO as the fraction of splits with $\tau < 0$. A strategy with no edge at all, and with the selection procedure applied honestly, produces a PBO close to 0.5. A strategy with a clear edge produces a PBO close to zero. A strategy whose selection procedure is *actively* chasing noise (our warning case below) produces a PBO substantially above 0.5. ### Two anchor numbers from Bailey et al. The intuition becomes concrete when you look at the two worked examples in @bailey2015pbo §3. In their synthetic overfit case, built from return streams with no true edge, PBO comes out at **0.74**: the in-sample champion is below the out-of-sample median in three splits out of four. Alongside this, **100% of the in-sample Sharpe ratios are positive**, and **78% of the out-of-sample Sharpe ratios are negative**. The selection procedure is not merely producing noise; it is reliably selecting strategies that will underperform going forward. In their second example, using a real investment strategy with a genuine signal, PBO drops to **0.04%**. Only 3% of out-of-sample Sharpes are negative and only 4% of in-sample champions underperform the out-of-sample median. The same procedure, applied to the same reporting template, produces two results that differ by three orders of magnitude. That is the evidentiary gap PBO is designed to make visible. ### The negative-slope phenomenon There is a further fact, reported in @bailey2015pbo §3.2, that is routinely counter-intuitive on first reading. If one regresses the out-of-sample Sharpe ratio of the in-sample champion on its in-sample Sharpe ratio, across splits of an overfit backtest, the slope coefficient is *negative*. A higher in-sample Sharpe predicts a *lower* out-of-sample Sharpe. The reason is that in an overfit setting the in-sample Sharpe measures how aggressively a strategy has fitted past noise, and aggressive fitting to noise is exactly what prevents generalisation. The point can be read as a corollary of the bias-variance decomposition: a strategy that has maximised in-sample performance on limited data has typically spent most of its flexibility on features of that sample that will not recur. The more impressive the in-sample number, the more the procedure has been rewarded for over-fitting, and the more confidently one can predict that the out-of-sample number will be worse. ### Interpreting PBO in Practice Once the histogram is in hand, the interpretation follows @bailey2015pbo §3.1. A PBO below 0.05 meets the Neyman-Pearson-style threshold they propose: the procedure is producing a result consistent with a genuine edge and one can, with reasonable discipline, proceed to paper-trading. A PBO between 0.05 and 0.5 is a warning flag; something has selected a strategy that is barely better than random out-of-sample, and no conclusion about skill is defensible without much more context. A PBO above 0.5 is worse than a coin flip: the selection procedure is systematically anti-predictive and the reported backtest should be discarded. ### A small worked demonstration The utilities used by Lab 10 are small enough that we can reproduce the core statistics in a few lines. The function `generate_noise_strategies` simulates a garden of mean-zero return streams with a specified common correlation, `cscv_pbo` runs the CSCV machinery, and `deflated_sharpe_ratio` (used in the next section) inflates the benchmark. All three are plain NumPy, no mlfinlab required. ```{python} #| code-fold: true #| code-summary: "Show: PBO on pure noise vs weak edge" import numpy as np from overfit_metrics import ( generate_noise_strategies, sharpe_ratio, cscv_pbo, ) T, N = 240, 200 rho = 0.2 # (a) pure noise: no strategy has an edge X_noise = generate_noise_strategies(T=T, N=N, rho=rho, seed=42) pbo_noise = cscv_pbo(X_noise, n_folds=10, max_splits=150).pbo # (b) ten of the two hundred carry a genuine edge of roughly 0.5 # annualised Sharpe. At unit-variance monthly noise this corresponds # to adding 0.50 / sqrt(12) ~= 0.144 per month to the returns of the # first ten strategies; a factor-strength effect. X_edge = generate_noise_strategies(T=T, N=N, rho=rho, seed=42) X_edge[:, :10] += 0.50 / np.sqrt(12) pbo_edge = cscv_pbo(X_edge, n_folds=10, max_splits=150).pbo print(f"PBO on pure noise : {pbo_noise:.3f}") print(f"PBO with 10/200 edged : {pbo_edge:.3f}") ``` The first number sits near 0.5 (typically in the 0.35-0.60 band, depending on the random seed and the CSCV subsample), consistent with the theoretical prediction that the in-sample champion of a noise garden stays below the out-of-sample median roughly half the time. The second number is substantially lower, because once some strategies carry a genuine edge they tend to stay near the top of the ranking in every split, and the champion's out-of-sample rank is much more often above the median. The difference is the diagnostic signal: when real structure is injected into the data, PBO responds; when it is not, PBO does not. Note that these numbers never reach the PBO = 0.74 of Bailey et al.'s synthetic example, because this garden has weaker cross-strategy correlation and a less aggressive selection rule. Their example is deliberately chosen to maximise the overfitting failure mode; this calibration is chosen to show the mechanism without requiring a pathological setup. ## The Deflated Sharpe Ratio {#sec-dsr} ### PSR, the easy test PBO answers a question about *persistence*: when you rank strategies in-sample, does that ranking carry over to out-of-sample? It does not, on its own, answer a question about *magnitude*: is the Sharpe ratio of the chosen strategy large enough to matter, given the sample size and the non-normality of returns? For that we need a different statistic. @lopezdeprado2014dsr, building on @bailey2015pbo and earlier work, define the *Probabilistic Sharpe Ratio* as the probability that the true Sharpe ratio of a strategy exceeds some benchmark $SR^*$, given the sample Sharpe ratio $\widehat{SR}$, the sample size $n$, the sample skewness $\hat\gamma_3$, and the sample kurtosis $\hat\gamma_4$. The closed-form expression is $$ \mathrm{PSR}(SR^*) = \Phi\!\left(\frac{(\widehat{SR} - SR^*)\sqrt{n-1}}{\sqrt{1 - \hat\gamma_3 \widehat{SR} + \tfrac{\hat\gamma_4 - 1}{4}\widehat{SR}^2}}\right), $$ where $\Phi$ is the standard normal CDF. The formula corrects three defects of the naive Sharpe ratio: it accounts for finite-sample noise (through the $\sqrt{n-1}$ factor in the numerator), for skewness (skewed returns inflate or deflate confidence in a point estimate of SR), and for fat tails (which are ubiquitous in financial returns and make the simple formula too optimistic). When $SR^*$ is set to zero, PSR answers "how confident are we that the true Sharpe is positive?" This is a weak and easily satisfied null. On our noise simulation above, where the in-sample champion has a Sharpe near 1.8, $\mathrm{PSR}(0)$ will comfortably exceed 0.95 for any reasonable sample size. This is not because the strategy is any good. It is because the *best of two hundred* sample Sharpe ratios on mean-zero data is almost certain to be well above zero in finite samples, and $\mathrm{PSR}(0)$ is measuring exactly that. ### Inflating the benchmark for selection The Deflated Sharpe Ratio is the same formula with a more honest benchmark. Instead of asking whether the true Sharpe exceeds zero, it asks whether the true Sharpe exceeds the expected maximum Sharpe one would obtain from searching across $N$ trials with average pairwise correlation $\rho$ on data with no real edge. The benchmark is $$ SR^* = \sigma_{SR} \cdot \left\{ (1 - \gamma) \cdot \Phi^{-1}\!\left(1 - \tfrac{1}{N}\right) + \gamma \cdot \Phi^{-1}\!\left(1 - \tfrac{1}{N e}\right) \right\}, $$ where $\gamma \approx 0.5772$ is the Euler-Mascheroni constant, $\sigma_{SR}$ is the cross-trial standard deviation of Sharpe ratios (shrunk for correlation among trials), and $N$ is the number of trials. The intuition is what matters: the more rules that were tried, and the less correlated those rules were with each other, the higher the selection-adjusted benchmark the winner must clear. DSR is then defined as $\mathrm{PSR}(SR^*)$ with this inflated benchmark. On a genuine edge, DSR will be close to one. On noise, DSR will be well below one, and typically well below PSR(0). ### The PSR-DSR gap as cost of selection The number to report is less PSR or DSR on its own than the *gap* between them. PSR against a zero benchmark is always reassuring in any setting where you have searched across non-trivial numbers of trials, because the best of a large number of noisy draws is almost always positive. DSR strips that reassurance away by moving the goalposts to where they belonged in the first place. The gap between the two is the quantitative cost of having searched. Running the DSR machinery on our noise garden: ```{python} #| code-fold: true #| code-summary: "Show: PSR vs DSR on the noise garden" import pandas as pd from overfit_metrics import probabilistic_sharpe_ratio, deflated_sharpe_ratio X = generate_noise_strategies(T=T, N=N, rho=rho, seed=42) sr_all = np.array([sharpe_ratio(X[:, j]) for j in range(N)]) j_star = int(np.argmax(sr_all)) x_star = X[:, j_star] sr_hat = sr_all[j_star] n_obs = len(x_star) skew = float(pd.Series(x_star).skew()) kurt = float(pd.Series(x_star).kurtosis() + 3.0) psr_0 = probabilistic_sharpe_ratio(sr_hat, 0.0, n_obs, skew=skew, kurtosis=kurt) dsr, sr_star = deflated_sharpe_ratio( sr_hat=sr_hat, sr_trials=sr_all, n_obs=n_obs, skew=skew, kurtosis=kurt, rho=rho, ) print(f"Sample Sharpe (champion) : {sr_hat:.3f}") print(f"Selection-adjusted SR* : {sr_star:.3f}") print(f"PSR against SR*=0 : {psr_0:.3f}") print(f"DSR against SR*={sr_star:.3f}: {dsr:.3f}") print(f"PSR - DSR (cost of selection): {psr_0 - dsr:+.3f}") ``` On a typical run of the noise garden the sample Sharpe of the champion (measured per-period, before annualisation) sits around 0.14, which annualises to roughly 0.5 on monthly data: enough for $\mathrm{PSR}(0)$ to comfortably exceed 0.95, because the finite-sample test of "is the true Sharpe greater than zero?" is easy to pass. But the selection-adjusted benchmark $SR^*$ is itself around 0.14 (because the expected maximum of two hundred correlated noise Sharpes is essentially what we observed), so DSR collapses to a value close to 0.5. The difference between $\mathrm{PSR}(0) \approx 0.98$ and $\mathrm{DSR} \approx 0.5$ is the quantitative statement that the Sharpe ratio was large enough to clear the "is it positive?" bar but nowhere near large enough to clear the "is it the best of a two-hundred-strategy search?" bar. On a genuine edge, as in the @bailey2015pbo real-strategy example, PSR(0) is still near one, but DSR also rises materially, and the gap shrinks. The gap, not the level, is the reporting number. ### When DSR matters and when it does not A natural question at this point is whether DSR is always appropriate. The honest answer is that DSR is the right statistic whenever the reported result was *selected* from a search, and "selected" includes more than formal hyperparameter tuning. If the universe is chosen because it looks cleaner, the rebalancing day because it gives smoother returns, or the lookback window because it makes the factor work, then $N$ is not one and a DSR-style correction is warranted. Conversely, if the reported rule is one published before the data were examined, implemented exactly as specified, and not tuned on the sample, then $N = 1$ and DSR reduces to PSR. Most applied research submissions are in the first category, not the second. ## Honest Reporting and a Research Template {#sec-reporting} ### What a credible backtest write-up contains The discipline of this chapter is easier to remember as a single reporting template than as a set of statistical facts. Whenever a backtest result is claimed, the following should be on the page. The *universe and sample*: which stocks, from which exchange, over which dates, at which rebalancing frequency. The *trials*: how many distinct configurations of the rule were evaluated before choosing the one reported, and (approximately) how correlated they were with each other. The *walk-forward diagnostics*: the out-of-sample Sharpe ratio, the distribution across windows, the worst window, and the turnover. The *CSCV-based PBO*, with the histogram of logit ranks and an explicit statement of whether the result clears the PBO < 0.05 threshold. The *PSR against a zero benchmark* and the *DSR against the selection-adjusted benchmark*, with an explicit statement of the PSR-DSR gap. Finally, a *decision*: given the numbers, should the strategy be promoted to paper-trading, parked pending more data, or discarded? The decision should be declared against thresholds fixed *before* computing the numbers, not after. A rigorous assessment should be built around this template, and should reward reports that apply it honestly. A result whose DSR is below 0.95 and whose PBO is above 0.5 is not a weaker version of a good result; it is a different claim, and the appropriate action is to report it as such and explain why you are not promoting the strategy. A report that presents an impressive in-sample Sharpe ratio without an accompanying PBO or DSR is methodologically incomplete, just as a factor regression without HAC standard errors is methodologically incomplete in Chapter 9. ::: {.callout-tip} ## López de Prado's Third Law of Backtesting > "Every backtest result must be reported in conjunction with all the trials involved in its production. Absent that information, it is impossible to assess the backtest's 'false discovery' probability." > > --- @deprado2018advances, Snippet 14.5 Any serious research workflow should treat this instruction literally. The number of trials you ran and an estimate of their average cross-correlation are as much a part of the reported result as the Sharpe ratio itself. A report that presents the latter without the former is reporting a statistic whose distribution under the null has not been pinned down, and consequently a statistic that cannot be evaluated. ::: ### A simple decision rule, pre-committed One of the cleanest ways to demonstrate research discipline is to commit to thresholds in advance and report against them. A widely used (but deliberately conservative) convention is: | PBO | DSR | Action | |-----|-----|--------| | below 0.2 | above 0.95 | Promote to paper-trading | | 0.2 to 0.4 | 0.7 to 0.95 | Park; collect more data before deploying | | above 0.4 | below 0.7 | Discard | These thresholds are conventions, not laws. What matters is that they are fixed before looking at the DSR number, and then followed when the result is unfavourable. A report that promotes a strategy with PBO above 0.4 on the grounds that "it is close to the cut-off and has a nice story" is exactly the kind of post-hoc rationalisation that this apparatus is designed to prevent. ### The one-line version of the week The apparatus is elaborate but the lesson it is built to teach is not. A great-looking backtest is not, on its own, evidence that a strategy will work live. The number that most people treat as evidence (the in-sample Sharpe ratio) is the number that is most sensitive to how hard you searched. The procedures introduced in this chapter are the minimum set of corrections that make the Sharpe ratio an honest statement rather than a selection statistic. ## From Research to Live Trading {#sec-live-trading} This chapter is about the statistical discipline that separates a credible backtest from a selection artefact. It is not about the engineering discipline that separates a credible backtest from a live trading system. That second discipline is substantial in its own right: it covers feature engineering under point-in-time constraints, model-versioning for rollback, drift detection under non-stationary distributions, and the governance frameworks that regulators require for models that move real money. Those concerns are covered in the sibling appendix on production ML pipelines, retained as reference material. For present purposes, there is no need to build a production system. The immediate task is to write a backtest whose numbers can be defended. The most consequential failure mode in that direction is still statistical rather than operational: a strategy whose selection bias has not been priced cannot be rescued by any amount of pipeline engineering downstream. A strategy that has been priced honestly, on the other hand, can be engineered into a live system with the tools described in the appendix. The two chapters are meant to be read in that order. ## Synthesis {#sec-synthesis} A single pre-specified factor can be evaluated with HAC-robust standard errors. A strategy chosen from a search space the researcher controls requires more. The three corrections in this chapter are nested. HAC handles within-sample temporal dependence. Walk-forward handles naive in-sample evaluation. CSCV with PBO handles the multiple-strategy selection problem. DSR handles the multiple-strategy *magnitude* problem. Each sits strictly on top of the one below it, and a credible research claim generally needs all four. The broader argument can now be stated compactly. Empirical finance demands both data-quality discipline and statistical discipline. Once search enters the workflow, the tension between the incentive to search and the cost of searching must be made explicit in the reported numbers. The tools in this chapter are the minimum technology required to keep that tension visible in practice. The single most effective step is to apply the reporting template in §5 before doing anything else. Declare the universe, pre-declare thresholds, state the number of trials to be run and the assumed $\rho$, and then run the search. If the headline result does not survive the template, report that outcome explicitly. Honest reporting is the core skill this chapter is designed to develop. ## Further Reading {.unnumbered} - @deprado2018advances is the canonical book-length treatment. Chapter 11 develops the backtesting-as-selection-problem framing, §11.4 articulates the walk-forward critique discussed in §@sec-walk-forward, and §14.7 derives the PSR, DSR and the selection-adjusted benchmark $SR^*$ used in §@sec-dsr. The "Seven Reasons Most Backtests Fail" working paper [@lopez2018backtesting] is a short companion read. - @bailey2015pbo is the primary reference for CSCV and PBO. Sections 1 (why hold-out fails), 2 (the CSCV procedure), and 3 (the two worked examples and the negative-slope scatter) are the most directly student-accessible, and together they justify nearly every empirical claim in §4 of this chapter. - @lopezdeprado2014dsr is the primary reference for PSR and DSR. The selection-adjusted benchmark formula with the Euler-Mascheroni constant is eq. 10; the PSR correction for finite samples and non-normality is eq. 4. - @harvey2016and establishes the factor-zoo diagnosis and the $t > 3$ recommendation. For factor selection and replication work, this is a key companion reading on the factor-zoo problem. - @harvey2020false is the formal FDR successor to @harvey2016and, with a Benjamini-Yekutieli procedure that handles dependence across factor tests. - @jensen2024replication provides a large-scale replication dataset and quantifies the typical thirty-to-fifty-per-cent attenuation of published factor premia under standardised construction. - @white2000realitycheck and @hansen2005spa are the earlier predecessors to the CSCV/PBO literature, oriented around the reality-check and superior-predictive-ability tests. They are worth reading for the continuity of the idea more than for their specific implementations. - For the engineering side, see the sibling appendix on production ML pipelines and the references therein. A companion lab applies CSCV, PBO, PSR and DSR on both a pure-noise garden and a weak-edge contrast, using a lightweight implementation with standard scientific-Python dependencies. ## References {.unnumbered} ::: {#refs} :::