Lab 6B: Backtest Overfitting (CSCV, PBO, PSR/DSR)

Before You Code: The Big Picture

The #1 problem in quantitative finance: your backtest looks great, but it fails in live trading. Why? Overfitting: you optimized parameters on the same data you tested on. Your “alpha” is actually selection bias.

The Backtest Overfitting Problem

The Scenario: You test 200 trading strategies on 20 years of data. One strategy has a Sharpe ratio of 2.5: amazing! You deploy it with real money. It loses money immediately. What happened?

The Problem: - With 200 tries, one will look good by pure luck (multiple testing) - In-sample optimization + in-sample testing = guaranteed overfitting - Traditional cross-validation doesn’t detect this (data leakage across folds)

The Solution (Bailey & López de Prado): 1. CSCV (Combinatorially Symmetric Cross-Validation): Proper walk-forward splits 2. PBO (Probability of Backtest Overfitting): Quantifies selection bias 3. PSR (Probabilistic Sharpe Ratio): Tests if Sharpe > 0 with statistical significance 4. DSR (Deflated Sharpe Ratio): Adjusts for multiple testing

The Evidence: Harvey, Liu & Zhu (2016, RFS): Most published factor strategies fail out-of-sample due to p-hacking and multiple testing. PBO/PSR help detect this before losing real money.

What You’ll Build Today

By the end of this lab, you will have:

✅ Understanding of why standard backtesting fails
✅ CSCV implementation for honest validation
✅ PBO calculation showing selection bias
✅ PSR/DSR metrics for performance significance
✅ Critical perspective on published trading strategies

Time estimate: 90-120 minutes (this is advanced material: take your time)

Why This Matters for Coursework 2

Your factor replication must use walk-forward validation and report PBO/PSR. Otherwise, your Sharpe ratio is meaningless: it’s just in-sample optimization parading as out-of-sample performance. This lab shows you how to do it right.

Objectives

Diagnose backtest overfitting with combinatorially symmetric cross‑validation (CSCV)
Estimate Probability of Backtest Overfitting (PBO)
Quantify performance significance via Probabilistic Sharpe Ratio (PSR); discuss Deflated Sharpe Ratio (DSR)

Note

This lab follows Bailey & López de Prado’s approach to selection bias: CSCV → PBO and PSR/DSR. We implement lightweight utilities and show how to compare against mlfinlab if available.

Setup

Part A : A garden of strategies on pure noise

We simulate N=200 strategies with no true edge. In a finite sample, one will “win” in‑sample by chance.

Observation: Even with zero true edge, the best in‑sample Sharpe can look compelling.

Part B : CSCV and Probability of Backtest Overfitting (PBO)

We split the time axis into contiguous folds and repeatedly pick the in‑sample “champion”, then measure its out‑of‑sample rank. PBO is the fraction of splits where the champion underperforms out‑of‑sample (negative logit rank).

(0.6, 150)

Interpretation: A high PBO indicates that selecting the in‑sample “winner” is likely to disappoint out‑of‑sample.

Part C : PSR and discussion of DSR

We compute the Probabilistic Sharpe Ratio (PSR) of the champion against a 0 benchmark. DSR additionally deflates for selection bias by using a higher benchmark Sharpe (selection threshold). If mlfinlab is installed, we compare against its DSR.

0.9994357939306875

Optional: compare with mlfinlab’s DSR (if available). Note DSR uses an elevated benchmark Sharpe that accounts for the number of trials and their correlation (see paper for details).

Optional : Empirical Selection Benchmark (SR*)

An intuitive (but approximate) benchmark SR* is the selection threshold you would have used to promote a strategy, e.g., the 95th percentile of candidate SRs or the top‑k cutoff used in model selection. This inflates the benchmark to reflect the search.

(np.float64(0.13389655969594796), 0.8896047744919638)

Tip

Guidance: PSR answers “what is the probability that the true SR > benchmark SR?”. DSR raises SR to deflate for selection bias (many trials and correlation among them). When reporting results, disclose the number of trials and use CSCV/PBO to evidence robustness.

Extension : Replace noise with weak‑edge signals

Modify the simulation so a small subset of strategies has a slight positive mean. Re‑run CSCV/PBO and PSR to see whether evidence accumulates honestly.

0.6071428571428571

Deliverables

Report the observed PBO and interpret its meaning
Report PSR for the selected strategy; if available, compare with DSR
Describe how your result changes when a few strategies have a genuine (small) edge

How to Report (Template)

Trials: We evaluated N strategies/hyper‑parameters (comment on similarity/correlation if relevant).
Selection: In‑sample selection metric = [Sharpe/alpha/etc.] with CSCV splits (k=10).
Robustness: PBO = X.XX across S splits (show logit rank histogram).
Significance: PSR = X.XX vs SR*=0 (skew=…, kurt=…, n=…)
- Optional: DSR = X.XX (assumptions: trials=N, rho=…, length=n).
Data: period, universe, costs/slippage, vintages/release timing.
Decision: [Promote/Park], rationale and next steps (e.g., live paper trading).

References

Bailey et al. (2015) : Probability of Backtest Overfitting (PBO) and CSCV
Bailey and Prado (2014) : Deflated Sharpe Ratio (DSR)
López de Prado, M. : Deflated Sharpe Ratio (DSR), SSRN
White (2000) : Reality Check for data snooping
Hansen (2005) : Superior Predictive Ability (SPA) test

References

Bailey, David H., Jonathan M. Borwein, Marcos López de Prado, and Qiji Jim Zhu. 2015. “The Probability of Backtest Overfitting.” Journal of Computational Finance. https://doi.org/10.2139/ssrn.2326253.

Bailey, David H., and Marcos López de Prado. 2014. “The Deflated Sharpe Ratio: Correcting for Selection Bias, Backtest Overfitting and Non-Normality.” Journal of Portfolio Management 40 (5): 94–107. https://doi.org/10.2139/ssrn.2460551.

--- title: "Lab 6B: Backtest Overfitting (CSCV, PBO, PSR/DSR)" format: html: toc: true bibliography: ../resources/reading.bib execute: echo: false warning: false message: false --- ## Before You Code: The Big Picture The #1 problem in quantitative finance: **your backtest looks great, but it fails in live trading**. Why? **Overfitting**: you optimized parameters on the same data you tested on. Your "alpha" is actually selection bias. ::: {.callout-note} ## The Backtest Overfitting Problem **The Scenario:** You test 200 trading strategies on 20 years of data. One strategy has a Sharpe ratio of 2.5: amazing! You deploy it with real money. It loses money immediately. What happened? **The Problem:** - With 200 tries, **one will look good by pure luck** (multiple testing) - In-sample optimization + in-sample testing = guaranteed overfitting - Traditional cross-validation doesn't detect this (data leakage across folds) **The Solution (Bailey & López de Prado):** 1. **CSCV (Combinatorially Symmetric Cross-Validation)**: Proper walk-forward splits 2. **PBO (Probability of Backtest Overfitting)**: Quantifies selection bias 3. **PSR (Probabilistic Sharpe Ratio)**: Tests if Sharpe > 0 with statistical significance 4. **DSR (Deflated Sharpe Ratio)**: Adjusts for multiple testing **The Evidence:** Harvey, Liu & Zhu (2016, RFS): Most published factor strategies fail out-of-sample due to p-hacking and multiple testing. PBO/PSR help detect this **before** losing real money. ::: ### What You'll Build Today By the end of this lab, you will have: - ✅ Understanding of why standard backtesting fails - ✅ CSCV implementation for honest validation - ✅ PBO calculation showing selection bias - ✅ PSR/DSR metrics for performance significance - ✅ Critical perspective on published trading strategies **Time estimate:** 90-120 minutes (this is advanced material: take your time) ::: {.callout-important} ## Why This Matters for Coursework 2 Your factor replication **must** use walk-forward validation and report PBO/PSR. Otherwise, your Sharpe ratio is meaningless: it's just in-sample optimization parading as out-of-sample performance. This lab shows you how to do it right. ::: # Objectives - Diagnose backtest overfitting with combinatorially symmetric cross‑validation (CSCV) - Estimate Probability of Backtest Overfitting (PBO) - Quantify performance significance via Probabilistic Sharpe Ratio (PSR); discuss Deflated Sharpe Ratio (DSR) ::: {.callout-note} This lab follows Bailey & López de Prado's approach to selection bias: CSCV → PBO and PSR/DSR. We implement lightweight utilities and show how to compare against `mlfinlab` if available. ::: # Setup ```{python} import sys, pathlib # Ensure project root (parent of labs/) is on the Python path for `scripts/` sys.path.append(str(pathlib.Path().resolve().parent)) import numpy as np import pandas as pd import matplotlib.pyplot as plt from scripts.utilities.overfit_metrics import ( generate_noise_strategies, sharpe_ratio, probabilistic_sharpe_ratio, cscv_pbo, ) # Optional: compare with mlfinlab if installed try: from mlfinlab.backtest_statistics import deflated_sharpe_ratio as dsr_mlfinlab except Exception: dsr_mlfinlab = None np.random.seed(123) ``` # Part A : A garden of strategies on pure noise We simulate `N=200` strategies with no true edge. In a finite sample, one will “win” in‑sample by chance. ```{python} T, N = 240, 200 # 20 years of monthly returns (approx) X = generate_noise_strategies(T=T, N=N, rho=0.2, seed=123) # In‑sample Sharpe ratios across strategies sr_all = np.array([sharpe_ratio(X[:, j]) for j in range(N)]) j_star = int(np.argmax(sr_all)) sr_star = sr_all[j_star] fig, ax = plt.subplots(1,1, figsize=(7,4)) ax.hist(sr_all, bins=30, color='tab:gray', alpha=0.8) ax.axvline(sr_star, color='r', linestyle='--', label=f'Winner SR≈{sr_star:.2f}') ax.set_title('In‑sample Sharpe across noise strategies') ax.legend(); plt.tight_layout(); plt.show() ``` Observation: Even with zero true edge, the best in‑sample Sharpe can look compelling. # Part B : CSCV and Probability of Backtest Overfitting (PBO) We split the time axis into contiguous folds and repeatedly pick the in‑sample “champion”, then measure its out‑of‑sample rank. PBO is the fraction of splits where the champion underperforms out‑of‑sample (negative logit rank). ```{python} res = cscv_pbo(X, n_folds=10, max_splits=150) # subsample CSCV splits for speed; increase if time allows res.pbo, res.splits_used ``` ```{python} fig, ax = plt.subplots(1,2, figsize=(10,4)) ax[0].hist(res.taus, bins=30, color='tab:blue', alpha=0.8) ax[0].axvline(0, color='r', linestyle='--', label='tau=0'); ax[0].legend() ax[0].set_title('Logit ranks of in‑sample champion (CSCV)') ax[1].hist(res.oos_ranks, bins=np.arange(1, X.shape[1]+2)-0.5, color='tab:orange', alpha=0.8) ax[1].set_title('OOS ranks of in‑sample champion') ax[1].set_xlim(0.5, min(40.5, X.shape[1]+0.5)) plt.tight_layout(); plt.show() ``` Interpretation: A high PBO indicates that selecting the in‑sample “winner” is likely to disappoint out‑of‑sample. # Part C : PSR and discussion of DSR We compute the Probabilistic Sharpe Ratio (PSR) of the champion against a 0 benchmark. DSR additionally deflates for selection bias by using a higher benchmark Sharpe (selection threshold). If `mlfinlab` is installed, we compare against its DSR. ```{python} # Champion’s in‑sample series and summary stats x_star = X[:, j_star] sr_hat = sharpe_ratio(x_star) n_obs = len(x_star) # Use normal‑like defaults for skew/kurtosis when unknown skew = pd.Series(x_star).skew() kurt = pd.Series(x_star).kurtosis() + 3 # pandas returns excess kurtosis psr_0 = probabilistic_sharpe_ratio(sr_hat, 0.0, n_obs, skew=skew, kurtosis=kurt) psr_0 ``` Optional: compare with `mlfinlab`’s DSR (if available). Note DSR uses an elevated benchmark Sharpe that accounts for the number of trials and their correlation (see paper for details). ```{python} if dsr_mlfinlab is not None: # Example parameters : you should set n_trials and correlation based on your research context n_trials = N corr = 0.2 dsr_val = dsr_mlfinlab(observed_sr=sr_hat, number_of_trials=n_trials, skew=skew, kurtosis=kurt, rho=corr, length=n_obs) dsr_val else: 'mlfinlab not available in this environment' ``` ## Optional : Empirical Selection Benchmark (SR*) An intuitive (but approximate) benchmark SR* is the selection threshold you would have used to promote a strategy, e.g., the 95th percentile of candidate SRs or the top‑k cutoff used in model selection. This inflates the benchmark to reflect the search. ```{python} # Naive empirical SR* from the in-sample garden (use with caution) sr_garden = sr_all # in-sample Sharpe across candidates sr_star_empirical = np.quantile(sr_garden, 0.95) psr_emp = probabilistic_sharpe_ratio(sr_hat, sr_star_empirical, n_obs, skew=skew, kurtosis=kurt) sr_star_empirical, psr_emp ``` ::: {.callout-tip} Guidance: PSR answers “what is the probability that the true SR > benchmark SR*?”. DSR raises SR* to deflate for selection bias (many trials and correlation among them). When reporting results, disclose the number of trials and use CSCV/PBO to evidence robustness. ::: # Extension : Replace noise with weak‑edge signals Modify the simulation so a small subset of strategies has a slight positive mean. Re‑run CSCV/PBO and PSR to see whether evidence accumulates honestly. ```{python} # Example: 10 strategies have a small true edge T, N = 240, 200 X = generate_noise_strategies(T=T, N=N, rho=0.2, seed=777) edge_idx = np.arange(10) X[:, edge_idx] += 0.05 / np.sqrt(12) # ~5% annual edge distributed monthly res2 = cscv_pbo(X, n_folds=10) res2.pbo ``` # Deliverables - Report the observed PBO and interpret its meaning - Report PSR for the selected strategy; if available, compare with DSR - Describe how your result changes when a few strategies have a genuine (small) edge ## How to Report (Template) - Trials: We evaluated N strategies/hyper‑parameters (comment on similarity/correlation if relevant). - Selection: In‑sample selection metric = [Sharpe/alpha/etc.] with CSCV splits (k=10). - Robustness: PBO = X.XX across S splits (show logit rank histogram). - Significance: PSR = X.XX vs SR*=0 (skew=..., kurt=..., n=...) - Optional: DSR = X.XX (assumptions: trials=N, rho=..., length=n). - Data: period, universe, costs/slippage, vintages/release timing. - Decision: [Promote/Park], rationale and next steps (e.g., live paper trading). # References - @bailey2015pbo : Probability of Backtest Overfitting (PBO) and CSCV - @lopezdeprado2014dsr : Deflated Sharpe Ratio (DSR) - López de Prado, M. : Deflated Sharpe Ratio (DSR), SSRN - White (2000) : Reality Check for data snooping - Hansen (2005) : Superior Predictive Ability (SPA) test