Explain factor replication as a research methodology for testing published findings
Interpret HAC (Heteroskedasticity and Autocorrelation Consistent) standard errors and understand why time-series autocorrelation matters
Evaluate factor performance using multiple metrics (Sharpe, alpha, robustness)
Identify sources of selection bias and overfitting in factor research
Apply critical thinking to assess whether documented factors are exploitable
Agenda
Part I : What is factor replication? Research methodology foundations Part II : Statistical foundations: HAC errors, alpha tests, robustness Part III : Selection bias and the replication crisis in finance Part IV : Critical analysis: What makes interpretation rigorous? Part V : Preparation for Coursework 2: Principles, not templates
Part I : Factor Replication as Research Methodology
What Are Factors?
Factors are the building blocks of modern quantitative investing. Rather than picking individual stocks, factor strategies systematically buy characteristics that historically generate excess returns.
Definition: Characteristics that explain cross-sectional variation in stock returns
Classic examples:
Value (HML): High Minus Low book-to-market : buy undervalued stocks, sell overvalued (1992)
Factors are constructed as long-short portfolios: simultaneously buying one group and selling another. This isolates factor exposure from market movements.
Mechanics:
Long leg: Buy stocks with desired characteristic (e.g., high book-to-market = value)
Short leg: Sell stocks with opposite characteristic (e.g., low book-to-market = growth)
Equal weights: Long and short legs have equal dollar amounts
Net investment: £0 (long purchases offset short sales)
Example: Value Factor (HML)
Component
Valuation
Action
Investment
Return
Long
Undervalued (high B/M)
Buy £100 value stocks
-£100
+£5 (5%)
Short
Overvalued (low B/M)
Sell £100 growth stocks
+£100
-£2 (-2%)
Net
Market-neutral
Long-short portfolio
£0
+£7 (7%)
Factor return = Long return - Short return = 5% - (-2%) = 7%
Show calculation: Long-short factor return
import pandas as pdimport numpy as np# Simulated example: Value factor constructionnp.random.seed(42)# Long leg: Value stocks (high book-to-market)value_return =0.05# 5% return# Short leg: Growth stocks (low book-to-market) # Note: Short return is negative of growth returngrowth_return =0.02# Growth stocks returned 2%short_return =-growth_return # Short position earns -2%# Factor return = Long - Shortfactor_return = value_return - growth_return# Create visualization tableconstruction = pd.DataFrame({'Component': ['Long (Value)', 'Short (Growth)', 'Factor (HML)'],'Valuation': ['Undervalued (high B/M)', 'Overvalued (low B/M)', 'Market-neutral'],'Action': ['Buy £100 value', 'Sell £100 growth', 'Net portfolio'],'Investment': ['-£100', '+£100', '£0'],'Return (%)': [value_return*100, short_return*100, factor_return*100],'Dollar P&L': ['+£5', '+£2', '+£7']})print("=== Long-Short Factor Construction Example ===\n")print(construction.to_string(index=False))print(f"\n📊 Key Insight:")print(f" Factor return ({factor_return*100:.1f}%) = Value return ({value_return*100:.1f}%) - Growth return ({growth_return*100:.1f}%)")print(f" Net investment = £0 (long purchase funded by short sale proceeds)")print(f" Factor isolates value premium, independent of market movements")
Why Zero Investment?
No capital required means returns represent pure factor exposure, not market risk. If market rises 10%, both legs move together: factor return isolates the difference.
Factor Replication: What Does It Mean?
Replication is core scientific practice. In medicine, we demand multiple trials before approving drugs. In finance, if a factor works 1970-1990, we demand it works 1991-2020.
Replication = reproduce published findings using independent data or time periods
Why replicate?
Test whether published results are real or data mining artifacts
Assess out-of-sample performance (does it work on new data?)
Evaluate economic significance (after costs, is there exploitable profit?)
Understand robustness (does it work across markets, time periods, specifications?)
Jensen, Kelly & Pedersen (2024): “Is There a Replication Crisis in Finance?”
Tested 153 published factors using consistent methodology
Many factors show 50% decline in out-of-sample performance
Cross-region replication often fails (US factors don’t work in Europe/Asia)
Conclusion: Published literature significantly overstates factor performance
Factor Replication Workflow
This is a conceptual framework, not a mechanical recipe. The scaffold notebook implements these steps, but understanding why each matters separates a pass from a distinction.
Conceptual steps:
Choose factor: Select published factor with theoretical motivation
Obtain data: Download returns from JKP portal (https://jkpfactors.com)
Interpretation: Is factor real? Exploitable after costs? What are limitations?
Each step requires judgment: what robustness checks matter depends on your specific factor
Part II : Statistical Foundations for Rigorous Replication
Signal and Noise in Financial Returns
Financial returns are inherently noisy. Even if a factor has true alpha, observed returns mix signal (predictable component) with noise (random variation). Standard errors help us distinguish signal from noise.
The challenge:
Signal: True factor alpha (e.g., value stocks genuinely outperform)
Noise: Random variation (luck, market shocks, measurement error)
Observed return = Signal + Noise
Why this matters:
With 20 years of monthly data (240 observations), noise can create spurious patterns
A factor might appear significant just by chance (noise masquerading as signal)
Standard errors quantify how much noise contaminates our signal estimate
Example: True alpha = 0% (no factor), but observed alpha = 0.5% monthly
→ Is this signal (real factor) or noise (lucky sample)?
The Fundamental Problem
Financial returns have low signal-to-noise ratio. Real Bloomberg data (2018-2025) shows: SPY has signal-to-noise = 0.042 (noise is 24× larger than signal). Only 0.2% of variance is signal; 99.8% is noise. This makes statistical inference challenging.
Measuring Signal-to-Noise: Methodology
To quantify signal vs noise, we decompose return variance into predictable (signal) and unpredictable (noise) components using conditional expectations.
Econometrically Rigorous Construction:
For returns \(r_t\), we define signal as the predictable component conditional on available information:
Example: SPY regressed on itself (market proxy) - \(R^2 \approx 1.0\) (SPY explains itself perfectly) - For individual stocks: \(R^2 \approx 0.3-0.6\) (30-60% signal, 40-70% noise) - For factors: \(R^2\) typically lower (most variance is idiosyncratic noise)
Show calculation: Econometrically rigorous signal-to-noise
import numpy as npimport pandas as pdimport statsmodels.api as smdf = load_bloomberg()# Get SPY and market (SPY as market proxy)spy_data = df[df['ticker'] =='SPY'].sort_values('date')market = spy_data['return'].valuesasset = spy_data['return'].values # SPY regressed on itself for demonstrationdates = spy_data['date'].values# Econometrically rigorous approach: CAPM regressionX = sm.add_constant(market)model = sm.OLS(asset, X).fit()# Signal = predicted returns (conditional expectation)predicted = model.fittedvaluessignal_var = np.var(predicted)# Noise = residuals (unpredictable component)residuals = model.residnoise_var = np.var(residuals)# Total variancetotal_var = np.var(asset)# Signal fraction = R² (variance explained by model)signal_fraction = model.rsquarednoise_fraction =1- model.rsquared# Signal-to-noise ratio (using conditional expectation)signal_mean = np.abs(predicted.mean())noise_std = np.std(residuals)signal_noise_ratio = signal_mean / noise_std if noise_std >0else np.nanprint("=== Econometrically Rigorous Signal-to-Noise ===\n")print("Method: CAPM Regression (Conditional Expectation)")print(f"Model: r_t = α + β × market_t + ε_t\n")print(f"Regression Results:")print(f" R² (Signal Fraction): {signal_fraction:.4f} ({signal_fraction:.1%})")print(f" 1 - R² (Noise Fraction): {noise_fraction:.4f} ({noise_fraction:.1%})")print(f" α (intercept): {model.params[0]*100:.4f}% daily")print(f" β (market exposure): {model.params[1]:.4f}")print(f"\nVariance Decomposition:")print(f" Total variance: {total_var*10000:.4f} (basis points)")print(f" Signal variance: {signal_var*10000:.4f} (Var(E[r|I]))")print(f" Noise variance: {noise_var*10000:.4f} (Var(ε))")print(f" Signal fraction: {signal_fraction:.1%}")print(f" Noise fraction: {noise_fraction:.1%}")print(f"\nSignal-to-Noise Ratio:")print(f" Signal mean: {signal_mean*100:.4f}% daily")print(f" Noise std: {noise_std*100:.4f}% daily")print(f" Signal/Noise ratio: {signal_noise_ratio:.4f}")print(f"\n💡 Econometric Interpretation:")print(f" Signal = predictable component E[r_t | market_t]")print(f" Noise = residual ε_t (unpredictable given market)")print(f" {signal_fraction:.1%} of variance is explained by market exposure")print(f" {noise_fraction:.1%} is idiosyncratic noise")print(f" This is more rigorous than unconditional mean approach!")
Econometric Rigor
This approach uses conditional expectation\(E[r_t | \mathcal{I}_{t-1}]\) rather than unconditional mean. Signal is what’s predictable given available information (market returns, factors, etc.). This aligns with econometric theory and captures time-varying expected returns, autocorrelation, and factor exposures.
Why This Matters
For individual assets, \(R^2\) from CAPM is typically 30-60% (signal), meaning 40-70% is noise. For factors themselves (long-short portfolios), \(R^2\) is often lower because most variance is idiosyncratic. This quantifies why detecting true factors is challenging: even with market information, much variance remains unpredictable.
Real-World Signal-to-Noise: Bloomberg Data
Using real financial data (2018-2025) from Bloomberg Terminal, we apply the econometrically rigorous approach: signal = predictable component from CAPM regression.
Method: Regress each asset on market (SPY) to extract conditional expectation \(E[r_t | \text{market}_t]\)
Signal Fraction = R² from CAPM (variance explained by market exposure)
Show visualization: Professional signal-to-noise analysis
import pandas as pdimport numpy as npimport matplotlib.pyplot as plt_csv_path = data_root /"bloomberg_database"/"signal_noise_metrics.csv"ifnot _csv_path.exists(): _csv_path = Path("data/bloomberg_database/signal_noise_metrics.csv")metrics = pd.read_csv(_csv_path)# Select example assetsexample_assets = ['SPY', 'AAPL', 'VIX', 'BTCUSD']display_metrics = metrics[metrics['asset'].isin(example_assets)].copy()# Create professional visualizationfig, axes = plt.subplots(1, 2, figsize=(14, 5))# Plot 1: Signal-to-Noise Ratio Comparisonax1 = axes[0]colors = {'Equity': '#2E86AB', 'Risk Gauge': '#F24236', 'Crypto': '#06A77D'}for asset_type in display_metrics['asset_type'].unique(): subset = display_metrics[display_metrics['asset_type'] == asset_type] ax1.scatter(subset['signal_noise_ratio'], subset['sharpe_ratio'], label=asset_type, alpha=0.8, s=150, color=colors.get(asset_type, 'gray'), edgecolors='white', linewidth=2)# Add asset labelsfor _, row in display_metrics.iterrows(): ax1.annotate(row['asset'], (row['signal_noise_ratio'], row['sharpe_ratio']), xytext=(5, 5), textcoords='offset points', fontsize=9, fontweight='bold')ax1.set_xlabel('Signal-to-Noise Ratio (|Mean| / Std)', fontsize=11, fontweight='bold')ax1.set_ylabel('Sharpe Ratio (Annual)', fontsize=11, fontweight='bold')ax1.set_title('Signal-to-Noise vs Risk-Adjusted Returns', fontsize=12, fontweight='bold', pad=10)ax1.legend(loc='upper left', frameon=True, fancybox=True, shadow=True)ax1.grid(True, alpha=0.3, linestyle='--')# Plot 2: Signal vs Noise Fraction (Stacked Bar Chart)ax2 = axes[1]assets = display_metrics['asset'].valuessignal_fracs = display_metrics['signal_fraction'].values *100noise_fracs = display_metrics['noise_fraction'].values *100x_pos = np.arange(len(assets))width =0.6bars1 = ax2.barh(x_pos, signal_fracs, width, label='Signal', color='#2E86AB', alpha=0.8)bars2 = ax2.barh(x_pos, noise_fracs, width, left=signal_fracs, label='Noise', color='#F24236', alpha=0.8)# Add percentage labelsfor i, (s, n) inenumerate(zip(signal_fracs, noise_fracs)):if s >0.1: # Only label if signal is visible ax2.text(s/2, i, f'{s:.2f}%', ha='center', va='center', fontweight='bold', fontsize=8, color='white') ax2.text(s + n/2, i, f'{n:.1f}%', ha='center', va='center', fontweight='bold', fontsize=9, color='white')ax2.set_yticks(x_pos)ax2.set_yticklabels(assets, fontweight='bold')ax2.set_xlabel('Variance Fraction (%)', fontsize=11, fontweight='bold')ax2.set_title('Signal vs Noise Decomposition', fontsize=12, fontweight='bold', pad=10)ax2.legend(loc='lower right', frameon=True, fancybox=True, shadow=True)ax2.grid(True, alpha=0.3, axis='x', linestyle='--')ax2.set_xlim(0, 100)plt.tight_layout()plt.show()
Show calculation: Real-world signal-to-noise from Bloomberg data
import pandas as pdimport numpy as npdf = load_bloomberg()_csv_path = data_root /"bloomberg_database"/"signal_noise_metrics.csv"ifnot _csv_path.exists(): _csv_path = Path("data/bloomberg_database/signal_noise_metrics.csv")metrics = pd.read_csv(_csv_path)# Select example assets and recalculate using CAPM approachexample_assets = ['SPY', 'AAPL', 'VIX', 'BTCUSD']# Recalculate metrics using CAPM regression (econometrically rigorous)import statsmodels.api as sm# Get market returns (SPY)market_data = df[df['ticker'] =='SPY'].sort_values('date')market_returns = market_data['return'].valuesrecalculated_metrics = []for asset_name in example_assets: asset_data = df[df['ticker'] == asset_name].sort_values('date')iflen(asset_data) <30:continue# Merge to align dates merged = pd.merge( asset_data[['date', 'return']], market_data[['date', 'return']], on='date', suffixes=('_asset', '_market') )iflen(merged) <30:continue asset_ret = merged['return_asset'].values market_ret = merged['return_market'].values# CAPM regression X = sm.add_constant(market_ret) model = sm.OLS(asset_ret, X).fit()# Signal fraction = R² (variance explained by market) signal_fraction = model.rsquared noise_fraction =1- model.rsquared# Signal-to-noise ratio predicted = model.fittedvalues residuals = model.resid signal_mean = np.abs(predicted.mean()) noise_std = np.std(residuals) signal_noise_ratio = signal_mean / noise_std if noise_std >0else np.nan recalculated_metrics.append({'asset': asset_name,'asset_type': asset_data['asset_type'].iloc[0] if'asset_type'in asset_data.columns else'Unknown','signal_fraction': signal_fraction,'noise_fraction': noise_fraction,'signal_noise_ratio': signal_noise_ratio,'r_squared': signal_fraction,'beta': model.params[1] iflen(model.params) >1else np.nan })display_metrics = pd.DataFrame(recalculated_metrics)# Display metrics tableprint("\n=== Real-World Signal-to-Noise Metrics (CAPM-based) ===\n")if'r_squared'in display_metrics.columns:print(display_metrics[['asset', 'asset_type', 'r_squared', 'signal_fraction', 'noise_fraction', 'beta']].round(4).to_string(index=False))else:print(display_metrics[['asset', 'asset_type', 'signal_fraction', 'noise_fraction']].round(4).to_string(index=False))print("\n📊 Interpretation (CAPM-based):")print(" - Signal fraction = R² from CAPM regression")print(" - SPY: R² ≈ 1.0 (perfectly explained by itself)")print(" - Individual stocks: R² ≈ 0.5-0.6 (50-60% explained by market)")print(" - Factors/Crypto: R² ≈ 0.1-0.2 (10-20% explained, rest is noise)")print(" - This captures predictable component conditional on market information!")# Show AAPL example in detail (more interesting than SPY)aapl_data = df[df['ticker'] =='AAPL'].sort_values('date')merged = pd.merge(aapl_data[['date', 'return']], market_data[['date', 'return']], on='date', suffixes=('_aapl', '_market'))X = sm.add_constant(merged['return_market'].values)model = sm.OLS(merged['return_aapl'].values, X).fit()print(f"\n🔍 AAPL Example (CAPM Regression):")print(f" R² (Signal Fraction): {model.rsquared:.1%}")print(f" 1-R² (Noise Fraction): {1-model.rsquared:.1%}")print(f" β (Market Exposure): {model.params[1]:.2f}")print(f" α (Intercept): {model.params[0]*100:.4f}% daily")print(f" → {model.rsquared:.1%} of variance is predictable from market")print(f" → {1-model.rsquared:.1%} is idiosyncratic noise")
The Reality Check
Using conditional expectation (CAPM), signal fractions vary dramatically by asset type. Individual equities: 50-60% signal (market exposure explains most variance). Factors and crypto: 10-20% signal (80-90% noise). This econometrically rigorous approach captures what’s truly predictable given market information: much more informative than unconditional mean.
Why Standard Errors Matter
Standard errors quantify uncertainty in estimates. In factor replication, we’re testing whether observed alpha is signal (true factor premium) or noise (random variation).
|t| < 1.96 → cannot reject null hypothesis (could be random chance)
Harvey (2017) recommends t > 3 for finance (multiple testing correction)
Why factors need higher t-statistics: With 80-90% noise fraction, standard errors are large. Need t > 3 to confidently distinguish signal from noise.
Common Mistake
Alpha = 1% monthly with t = 0.5 is not significant (likely noise). Alpha = 0.3% monthly with t = 3 is significant (likely signal). The t-statistic matters more than the magnitude: especially for factors with high noise fraction.
How Standard Errors Are Constructed
Standard errors measure both estimation precision and sampling variability. They are calculated from the variance-covariance matrix (error variance ÷ sample size), but their frequentist interpretation is as the standard deviation of the sampling distribution under hypothetical repeated sampling. High noise variance → large standard errors → imprecise estimates.
where \(\hat{\sigma}^2\) is estimated error variance.
Key components:
Error variance (\(\hat{\sigma}^2\)): How much returns deviate from predicted values
Sample size (\(n\)): More observations → smaller SE (more precision)
Variation in X: More variation in predictor → smaller SE (better identification)
For financial returns:
High error variance: Returns are volatile (noise is large)
Limited sample size: Only 20-30 years of monthly data available
Result: Standard errors are relatively large, making significance hard to achieve
Intuition: If returns were perfectly predictable, error variance = 0, SE = 0. But returns are noisy, so SE > 0. Standard errors tell us how much uncertainty remains.
Show calculation: Standard error components
import numpy as npimport pandas as pd# Simulate factor returnsnp.random.seed(42)n =240# 20 years monthlytrue_alpha =0.003# 0.3% monthly true alphamarket = np.random.normal(0.008, 0.04, n)factor = true_alpha +0.2* market + np.random.normal(0, 0.03, n)# OLS regressionX = np.column_stack([np.ones(n), market])beta_hat = np.linalg.lstsq(X, factor, rcond=None)[0]residuals = factor - X @ beta_hatsigma_sq = np.var(residuals, ddof=2) # Error variance# Standard error calculationX_centered = market - market.mean()sum_sq_X = np.sum(X_centered **2)se_alpha = np.sqrt(sigma_sq / n) # Simplified for interceptse_beta = np.sqrt(sigma_sq / sum_sq_X) # For slope# Display componentscomponents = pd.DataFrame({'Component': ['Sample size (n)', 'Error variance (σ²)', 'Sum of squares (X)', 'SE(alpha)', 'SE(beta)'],'Value': [n, f'{sigma_sq:.6f}', f'{sum_sq_X:.4f}', f'{se_alpha:.4f}', f'{se_beta:.4f}']})print("=== Standard Error Construction Components ===\n")print(components.to_string(index=False))print(f"\n📊 Interpretation:")print(f" Error variance = {sigma_sq:.6f} (high → noisy returns)")print(f" Sample size = {n} (limited → less precision)")print(f" SE(alpha) = {se_alpha:.4f} ({se_alpha*100:.2f}% monthly)")print(f" t-statistic = {beta_hat[0]/se_alpha:.2f}")ifabs(beta_hat[0]/se_alpha) >1.96:print(f" ✓ Alpha is statistically significant")else:print(f" ✗ Alpha is NOT statistically significant")
Why Financial Data Is Challenging
High error variance + limited sample size = large standard errors. This makes it hard to detect true factors (signal) when noise dominates. HAC standard errors account for additional complications (autocorrelation, heteroskedasticity), making SEs even larger.
Time-Series Data: Autocorrelation Problem
Financial time-series violate a key OLS assumption: independence of observations. If monthly returns are correlated, you don’t have 120 independent observations over 10 years: you have fewer “effective” observations.
Impact: HAC (Newey-West) standard errors typically 1.5-2× larger than OLS for monthly factors
A factor with OLS t = 2.5 might have HAC t = 1.8 (no longer significant). Always use HAC for time-series financial data.
Detecting Autocorrelation: Bloomberg Data Evidence
Using real financial data, we can measure autocorrelation and test whether it’s statistically significant. This demonstrates why HAC corrections are essential.
Autocorrelation Function (ACF): Correlation between \(r_t\) and \(r_{t-k}\) for lags \(k = 1, 2, ...\)
Ljung-Box Test: Tests null hypothesis of no autocorrelation up to lag \(k\)
H₀: No autocorrelation (ρ₁ = ρ₂ = … = ρₖ = 0)
If p-value < 0.05 → reject H₀ → autocorrelation present → OLS SEs are wrong
Show analysis: Detecting autocorrelation in real returns
import pandas as pdimport numpy as npimport matplotlib.pyplot as pltfrom statsmodels.stats.diagnostic import acorr_ljungboxfrom statsmodels.graphics.tsaplots import plot_acfdf = load_bloomberg()# Analyze autocorrelation for SPYspy_returns = df[df['ticker'] =='SPY']['return'].dropna().values# Create visualizationfig, axes = plt.subplots(1, 3, figsize=(12, 4))# Plot 1: ACF for returnsax1 = axes[0]plot_acf(spy_returns, lags=20, ax=ax1, alpha=0.05, title='SPY Returns: Autocorrelation Function')ax1.set_xlabel('Lag (days)', fontsize=10)ax1.set_ylabel('Autocorrelation', fontsize=10)# Plot 2: ACF for squared returns (volatility clustering)ax2 = axes[1]plot_acf(spy_returns**2, lags=20, ax=ax2, alpha=0.05, title='SPY Squared Returns: Volatility Clustering')ax2.set_xlabel('Lag (days)', fontsize=10)ax2.set_ylabel('Autocorrelation', fontsize=10)# Plot 3: Ljung-Box test resultsax3 = axes[2]lb_returns = acorr_ljungbox(spy_returns, lags=range(1, 21), return_df=True)lb_squared = acorr_ljungbox(spy_returns**2, lags=range(1, 21), return_df=True)lags =range(1, 21)ax3.plot(lags, lb_returns['lb_pvalue'], 'b-o', label='Returns', markersize=4)ax3.plot(lags, lb_squared['lb_pvalue'], 'r-s', label='Squared Returns', markersize=4)ax3.axhline(y=0.05, color='gray', linestyle='--', label='5% significance')ax3.set_xlabel('Lag (days)', fontsize=10)ax3.set_ylabel('p-value', fontsize=10)ax3.set_title('Ljung-Box Test: p-values by Lag', fontsize=11, fontweight='bold')ax3.legend(loc='upper right', fontsize=8)ax3.set_ylim(0, 1)plt.tight_layout()plt.show()# Print test resultsprint("=== Autocorrelation Analysis: SPY Daily Returns ===\n")# First-order autocorrelationfrom scipy.stats import pearsonriflen(spy_returns) >1: acf1, _ = pearsonr(spy_returns[:-1], spy_returns[1:])print(f"First-order autocorrelation (ρ₁): {acf1:.4f}")# Ljung-Box at lag 10lb_10 = acorr_ljungbox(spy_returns, lags=[10], return_df=True)lb_10_sq = acorr_ljungbox(spy_returns**2, lags=[10], return_df=True)print(f"\nLjung-Box Test (lag 10):")print(f" Returns: Q = {lb_10['lb_stat'].values[0]:.2f}, p = {lb_10['lb_pvalue'].values[0]:.4f}")print(f" Squared Returns: Q = {lb_10_sq['lb_stat'].values[0]:.2f}, p = {lb_10_sq['lb_pvalue'].values[0]:.4f}")print(f"\n📊 Interpretation:")if lb_10['lb_pvalue'].values[0] <0.05:print(f" ✗ Returns show significant autocorrelation (p < 0.05)")print(f" → OLS standard errors are biased downward")else:print(f" ✓ Returns show no significant autocorrelation (p ≥ 0.05)")print(f" → OLS standard errors may be acceptable for returns")if lb_10_sq['lb_pvalue'].values[0] <0.05:print(f" ✗ Squared returns show significant autocorrelation (p < 0.05)")print(f" → Volatility clustering present → heteroskedasticity")print(f" → OLS standard errors are still biased → use HAC!")
Key Finding
Even if return autocorrelation is weak, squared returns (volatility) typically show strong autocorrelation. This is volatility clustering (GARCH): high/low volatility periods persist. HAC corrections address both autocorrelation AND heteroskedasticity.
HAC vs OLS Standard Errors: Practical Impact
Let’s compare OLS and HAC standard errors using a CAPM regression on real Bloomberg data. The difference shows why HAC is essential.
Methodology: Regress asset returns on market (SPY) using both OLS and HAC standard errors
Show comparison: OLS vs HAC standard errors
import pandas as pdimport numpy as npimport statsmodels.api as smdf = load_bloomberg()# Get market (SPY) and asset (AAPL) returnsspy_data = df[df['ticker'] =='SPY'].sort_values('date')aapl_data = df[df['ticker'] =='AAPL'].sort_values('date')# Merge to align datesmerged = pd.merge( aapl_data[['date', 'return']], spy_data[['date', 'return']], on='date', suffixes=('_aapl', '_market')).dropna()y = merged['return_aapl'].valuesX = sm.add_constant(merged['return_market'].values)# OLS regression (standard errors assume i.i.d. errors)model_ols = sm.OLS(y, X).fit()# HAC regression (Newey-West standard errors, lag = 10)model_hac = sm.OLS(y, X).fit(cov_type='HAC', cov_kwds={'maxlags': 10})# Create comparison tableresults = pd.DataFrame({'Parameter': ['Alpha (α)', 'Beta (β)'],'Estimate': [model_ols.params[0], model_ols.params[1]],'OLS SE': [model_ols.bse[0], model_ols.bse[1]],'HAC SE': [model_hac.bse[0], model_hac.bse[1]],'OLS t-stat': [model_ols.tvalues[0], model_ols.tvalues[1]],'HAC t-stat': [model_hac.tvalues[0], model_hac.tvalues[1]],})# Calculate SE inflation factorresults['SE Inflation'] = results['HAC SE'] / results['OLS SE']print("=== OLS vs HAC Standard Errors: CAPM Regression ===\n")print(f"Asset: AAPL regressed on Market (SPY)")print(f"Sample: {len(y):,} daily observations\n")print("Parameter Estimates and Standard Errors:")print("-"*75)print(f"{'Parameter':<12}{'Estimate':>10}{'OLS SE':>10}{'HAC SE':>10}{'OLS t':>8}{'HAC t':>8}{'SE Ratio':>10}")print("-"*75)for _, row in results.iterrows():print(f"{row['Parameter']:<12}{row['Estimate']:>10.5f}{row['OLS SE']:>10.5f} "f"{row['HAC SE']:>10.5f}{row['OLS t-stat']:>8.2f}{row['HAC t-stat']:>8.2f} "f"{row['SE Inflation']:>10.2f}x")print("-"*75)# Interpretationalpha_ols_sig =abs(model_ols.tvalues[0]) >1.96alpha_hac_sig =abs(model_hac.tvalues[0]) >1.96print(f"\n📊 Key Findings:")print(f" HAC SE / OLS SE ratio: {results['SE Inflation'].mean():.2f}x on average")print(f" Alpha significance:")print(f" OLS: |t| = {abs(model_ols.tvalues[0]):.2f} → {'Significant'if alpha_ols_sig else'Not significant'} at 5%")print(f" HAC: |t| = {abs(model_hac.tvalues[0]):.2f} → {'Significant'if alpha_hac_sig else'Not significant'} at 5%")if alpha_ols_sig andnot alpha_hac_sig:print(f"\n⚠️ CRITICAL: Alpha appears significant with OLS but NOT with HAC!")print(f" This is a FALSE POSITIVE prevented by using HAC standard errors.")elif alpha_ols_sig and alpha_hac_sig:print(f"\n✓ Alpha is significant with both OLS and HAC.")print(f" But t-statistic is lower with HAC: more conservative inference.")else:print(f"\n Alpha not significant with either method.")print(f" HAC gives more reliable inference regardless.")print(f"\n💡 Lesson:")print(f" Always use HAC (cov_type='HAC') for time-series financial regressions.")print(f" OLS standard errors understate uncertainty → inflate t-statistics → false positives.")
Practical Implementation
In statsmodels: model.fit(cov_type='HAC', cov_kwds={'maxlags': 10}) gives Newey-West HAC standard errors. For monthly data, use maxlags=6; for daily data, maxlags=20-30.
Alpha Tests: CAPM Regression
The CAPM alpha test decomposes factor returns into two components: market exposure (β) and excess return beyond market (α). Only alpha matters: beta just tells you market risk.
A single significant result is weak evidence. Researchers have many degrees of freedom: what Gelman and Loken (2014) call the “garden of forking paths”: if you try enough specifications, one will appear significant by chance. Robustness guards against false discoveries.
Robustness checks test if results hold under alternative specifications:
Sample split: Does factor work in first half AND second half? (minimum check)
Subperiod analysis: Does alpha remain positive in each decade?
Alternative construction: Tertiles vs. quintiles, value-weighted vs. equal-weighted?
Transaction costs: Is net alpha positive after 0.2-0.5% monthly costs?
Ethical Econometrics
Don’t cherry-pick checks that passed. If factor works 2000-2010 but not 2010-2020, report both. Selective reporting is a breach of research ethics: transparent, complete disclosure is fundamental to responsible empirical practice and is rewarded in the 35% Critical Analysis component.
Part III : Selection Bias and the Replication Crisis
The Multiple Testing Problem
This is the core problem creating the replication crisis. With 5% significance threshold, testing 100 hypotheses generates ~5 false positives even if all nulls are true.
Academic research process (the problem):
Researcher tests 50 potential factors
45 don’t work (α ≈ 0, not significant)
5 appear significant (α > 0, t > 2) by chance (5% false positive rate)
Result: Published literature massively overrepresents spurious findings
Harvey (2017) estimates over 300 equity factors published, but only ~10-15 are genuinely robust (95% are questionable)
Simulation: The Multiple Testing Problem in Action
Setup: 1,000 researchers each test 10 factors. All factors are pure noise (true α = 0). At 5% significance level, how many “discoveries” emerge?
Show simulation: False discoveries from pure noise
import numpy as npimport pandas as pdimport matplotlib.pyplot as pltfrom scipy import statsnp.random.seed(42)# Simulation parametersn_researchers =1000factors_per_researcher =10n_months =240# 20 years of monthly datasignificance_level =0.05true_alpha =0.0# ALL factors are noise (null is true)# Simulate: each factor is pure noisetotal_factors = n_researchers * factors_per_researchert_statistics = []p_values = []for i inrange(total_factors):# Generate random factor returns (mean = 0, sd = 3% monthly) factor_returns = np.random.normal(true_alpha, 0.03, n_months)# One-sample t-test: is mean significantly different from 0? t_stat, p_val = stats.ttest_1samp(factor_returns, 0) t_statistics.append(t_stat) p_values.append(p_val)t_statistics = np.array(t_statistics)p_values = np.array(p_values)# Count "significant" results (false discoveries)significant_5pct = np.sum(p_values <0.05)significant_1pct = np.sum(p_values <0.01)significant_harvey = np.sum(np.abs(t_statistics) >3.0) # Harvey's threshold# Create visualisationfig, axes = plt.subplots(1, 2, figsize=(12, 5))# Plot 1: Distribution of t-statisticsax1 = axes[0]ax1.hist(t_statistics, bins=50, density=True, alpha=0.7, color='steelblue', edgecolor='white')# Overlay theoretical t-distributionx = np.linspace(-5, 5, 100)ax1.plot(x, stats.t.pdf(x, df=n_months-1), 'r-', linewidth=2, label='Theoretical t-dist')# Mark significance thresholdsax1.axvline(x=1.96, color='orange', linestyle='--', linewidth=2, label='t = ±1.96 (5%)')ax1.axvline(x=-1.96, color='orange', linestyle='--', linewidth=2)ax1.axvline(x=3.0, color='red', linestyle='--', linewidth=2, label='t = ±3.0 (Harvey)')ax1.axvline(x=-3.0, color='red', linestyle='--', linewidth=2)ax1.set_xlabel('t-statistic', fontsize=11, fontweight='bold')ax1.set_ylabel('Density', fontsize=11, fontweight='bold')ax1.set_title('Distribution of t-statistics\n(All 10,000 factors are PURE NOISE)', fontsize=12, fontweight='bold')ax1.legend(loc='upper right', fontsize=9)ax1.set_xlim(-5, 5)# Plot 2: False discovery countsax2 = axes[1]categories = ['p < 0.05\n(Standard)', 'p < 0.01\n(Stricter)', '|t| > 3\n(Harvey)']counts = [significant_5pct, significant_1pct, significant_harvey]expected = [total_factors *0.05, total_factors *0.01, total_factors *0.0027] # 0.27% for |t|>3x_pos = np.arange(len(categories))width =0.35bars1 = ax2.bar(x_pos - width/2, counts, width, label='Observed False Discoveries', color='crimson', alpha=0.8)bars2 = ax2.bar(x_pos + width/2, expected, width, label='Expected (Theory)', color='steelblue', alpha=0.8)# Add count labelsfor bar, count inzip(bars1, counts): ax2.text(bar.get_x() + bar.get_width()/2, bar.get_height() +10, f'{int(count)}', ha='center', fontweight='bold', fontsize=11)ax2.set_ylabel('Number of "Significant" Results', fontsize=11, fontweight='bold')ax2.set_title('False Discoveries from Pure Noise\n(All 10,000 factors have TRUE α = 0)', fontsize=12, fontweight='bold')ax2.set_xticks(x_pos)ax2.set_xticklabels(categories, fontsize=10)ax2.legend(loc='upper right', fontsize=9)ax2.set_ylim(0, max(counts) *1.2)plt.tight_layout()plt.show()# Print summaryprint("="*70)print("MULTIPLE TESTING SIMULATION: THE REPLICATION CRISIS IN ACTION")print("="*70)print(f"\nSetup:")print(f" Researchers: {n_researchers:,}")print(f" Factors per researcher: {factors_per_researcher}")print(f" Total factors tested: {total_factors:,}")print(f" True alpha (all factors): {true_alpha} (ALL ARE NOISE)")print(f" Sample size per factor: {n_months} months")print(f"\nFalse Discoveries (Type I Errors):")print(f" At p < 0.05: {significant_5pct:,} factors appear significant ({significant_5pct/total_factors*100:.1f}%)")print(f" At p < 0.01: {significant_1pct:,} factors appear significant ({significant_1pct/total_factors*100:.1f}%)")print(f" At |t| > 3: {significant_harvey:,} factors appear significant ({significant_harvey/total_factors*100:.2f}%)")print(f"\n📊 Key Insight:")print(f" {significant_5pct} papers could be published claiming 'significant alpha'")print(f" ALL {significant_5pct} are FALSE DISCOVERIES (true α = 0)")print(f" Using t > 3 (Harvey's threshold) reduces false discoveries to {significant_harvey}")print(f"\n⚠️ This is why the replication crisis exists!")print(f" Published literature is full of noise masquerading as signal.")
The Sobering Reality
From 10,000 pure noise factors, approximately 500 will appear significant at the 5% level. If only these are published, the literature looks like 500 “discoveries”: but 100% are false positives. Harvey’s t > 3 threshold reduces this to ~27.
Guarding Against Selection Bias
Selection bias is hard to eliminate but can be mitigated with rigorous practices. These separate good research from bad.
Best practices in research:
Pre-registration: Specify hypothesis before seeing data (medical trials standard)
Out-of-sample testing: Test on data not available when factor was published
Cross-region replication: Factors should work globally if they’re real
Multiple testing corrections: Use t > 3 threshold (Harvey 2017) instead of t > 2
Economic theory: Value/momentum have theoretical foundations; “vowel tickers outperform” doesn’t
For Coursework 2: Intellectual Honesty Earns Marks
If you tried 3 factors, disclose that (don’t pretend you tested only 1)
Report robustness failures, not just successes
If alpha is t = 2.1, acknowledge it’s marginal, not “strong evidence”
The 35% Critical Analysis component explicitly rewards honest limitation discussion
Part IV : Critical Analysis: What Makes Interpretation Rigorous?
Beyond Reporting Numbers: Ask Questions
Weak analysis (reporting):
“Value factor earns 0.5% monthly alpha with t = 2.3 (significant). Sharpe ratio is 0.4. Results are robust to sample split.”
Strong analysis (interpretation):
“Value earns 0.5% monthly alpha (6% annualised). This is economically meaningful but modest. Statistical significance (t = 2.3) suggests it’s not pure luck, but close to threshold. Sample split shows alpha is stable (0.6% first half, 0.4% second half), increasing confidence. However, transaction costs (~0.2% monthly for value rebalancing) would reduce net alpha to 0.3% (3.6% annualised). Is 3.6% net alpha sufficient to compensate for tracking error and implementation frictions? Original paper reported 8% annualised: our replication shows 50% lower alpha, consistent with post-publication decline documented by Jensen et al. (2024).”
Interpreting Your Factor Results
When you analyse your factor’s performance, interrogate your conclusions:
Statistical vs economic significance: A t-stat of 2.1 clears the 1.96 threshold: but how confident would you be investing real money on that evidence?
Scale matters: What does 0.1% monthly alpha actually mean for an investor over a year? Is that worth pursuing?
Benchmarking performance: If the market delivers a Sharpe ratio around 0.4, what should you conclude about a factor with Sharpe of 0.3?
Robustness integrity: If some of your robustness tests pass and others fail, what story does that tell about your factor?
From paper to portfolio: What happens between calculating returns on a spreadsheet and actually implementing a trading strategy?
The Implementation Gap
Academic factor returns assume frictionless trading. Real portfolios face transaction costs, market impact, and timing constraints. How might these affect your conclusions?
Part V : Preparation for Coursework 2: Principles, Not Templates
What the Scaffold Provides vs. What You Must Provide
The scaffold notebook is deliberately comprehensive: we want you to focus on understanding and interpretation, not debugging code. The 35% Critical Analysis component is where marks are won or lost.
Scaffold provides (execution):
Working code for data loading, alpha regression, robustness checks
All necessary functions pre-written (HAC standard errors, sample splits)
Publication-quality tables and figures ready for your report
You must provide (interpretation grounded in YOUR results):
Numerical engagement: “My alpha is X bp/month (t = Y). The original paper found Z bp. This N% difference likely reflects…”
Specific robustness narrative: Which tests passed? Which failed? What does that specific pattern tell you?
Your judgment, defended: Would you invest £10,000 of your own money in this factor? Why or why not, given YOUR numbers?
Process reflection: What did you expect to find? What surprised you? What would you do differently?
Generic explanations of “why HAC matters” or “what limitations exist” won’t earn marks. Examiners want to see you grapple with your specific results.
Strategic focus: Spend 1-2 hours on code, 8-10 hours making sense of what YOUR numbers mean
Questions to Ask About YOUR Results
When you have your output, interrogate it. These questions connect today’s principles to YOUR specific analysis.
About your methodology:
Did your robustness tests pass or fail? What does that specific pattern suggest?
How does your sample period compare to the original paper’s? Does that explain any differences?
About your statistics:
Is your t-stat comfortably above 2, or hovering near the threshold? What’s the practical difference?
Your alpha is X bp monthly. What does that mean for a £10m portfolio over a year?
About your judgment:
Given YOUR numbers, would you recommend this factor to a pension fund? Why or why not?
If your results are weaker than the original paper, is that replication failure: or exactly what you’d expect?
Using AI Tools Appropriately
AI tools like ChatGPT and Copilot are permitted: but how you use them determines whether they help or hurt your work.
AI can help you:
Understand concepts: “Explain HAC standard errors in simple terms”
Debug code: “Why is this pandas merge failing?”
Learn techniques: “Show me how to calculate Newey-West standard errors”
AI cannot help you:
Interpret YOUR specific results: It doesn’t know your alpha is 0.28% with t = 1.9
Explain YOUR robustness pattern: It can’t see that your early-sample passed but late-sample failed
Defend YOUR recommendation: Generic “factors can be useful” isn’t a position
The Specificity Test
If your critical analysis section could have been written without ever looking at your actual output, that’s a problem: regardless of whether AI wrote it or you did.
The test: Would an examiner reading your analysis know which specific factor you studied and what YOUR results showed?
Academic Integrity: Detection and Verification
To maintain fairness for all students, I have developed a multi-model GenAI detection architecture that analyses submission patterns across multiple dimensions.
What this means:
All coursework submissions are processed through this system
The system flags submissions with characteristics suggesting over-reliance on AI-generated content
Flags are reviewed by me personally: the system assists, it doesn’t decide
Oral Examination Rights
I reserve the right to orally examine any student whose submission is flagged by this system. You may be asked to explain your analysis, walk through your reasoning, and demonstrate understanding of your own work.
This is not about catching you out: it’s about ensuring your degree means something. Students who genuinely engage with their analysis have nothing to worry about.
Demonstration: HAC Standard Errors in Practice
Show code: OLS vs HAC comparison
import pandas as pdimport numpy as npimport statsmodels.api as smfrom statsmodels.regression.linear_model import OLSfrom statsmodels.stats.sandwich_covariance import cov_hac# Simulate factor returns with autocorrelation (illustrative)np.random.seed(42)n =240# 20 years monthlymarket = np.random.normal(0.008, 0.04, n)# Factor with positive alpha and autocorrelationfactor =0.005+0.3* market + np.random.normal(0, 0.03, n)for i inrange(1, n): factor[i] +=0.3* factor[i-1] # Autocorrelation# RegressionX = sm.add_constant(market)model_ols = OLS(factor, X).fit()# HAC standard errors (Newey-West with 6 lags)cov_hac_matrix = cov_hac(model_ols, nlags=6)se_hac = np.sqrt(np.diag(cov_hac_matrix))t_hac = model_ols.params / se_hac# Comparison tablecomparison = pd.DataFrame({'Coefficient': model_ols.params,'OLS SE': model_ols.bse,'HAC SE': se_hac,'OLS t-stat': model_ols.tvalues,'HAC t-stat': t_hac}, index=['Alpha', 'Beta'])print("=== OLS vs HAC Standard Errors ===\n")print(comparison.round(4))print(f"\n📊 Interpretation:")print(f" Alpha = {model_ols.params[0]*100:.2f}% monthly ({model_ols.params[0]*12*100:.1f}% annualised)")print(f" OLS: t = {model_ols.tvalues[0]:.2f} (significant at 5% if |t| > 1.96)")print(f" HAC: t = {t_hac[0]:.2f} (adjusts for autocorrelation)")print(f" HAC standard error is {se_hac[0]/model_ols.bse[0]:.1f}x larger than OLS")ifabs(model_ols.tvalues[0]) >1.96andabs(t_hac[0]) <1.96:print(f" ⚠️ Result is significant with OLS but NOT with HAC!")print(f" Using OLS would lead to false positive. HAC prevents this.")
What this demonstration shows:
HAC standard errors are typically 1.3-2× larger than OLS for autocorrelated data
t-statistics drop correspondingly (OLS t = 2.3 might become HAC t = 1.7)
Results “significant” with OLS can become insignificant with HAC
This is why HAC is required for honest time-series inference
Coursework Requirement
Using OLS for time-series factor data loses marks for incorrect methodology. Always use HAC. The scaffold implements this automatically: you just need to understand why it matters for interpretation.
Next Steps: Week 11 Preview
Week 11 complements today by covering the prediction pathway (Coursework 2 Option B). Same principle-focused approach: understanding concepts, not copying templates.
Week 11 focus: Market prediction using factors
Predict next-month market return using lagged factor data
Compare OLS vs regularised models (ridge regression handles multicollinearity)
Walk-forward validation for honest out-of-sample testing (prevents look-ahead bias)
Evaluate predictive power: R² OOS, directional accuracy, economic value
Same pedagogical philosophy:
Principles and understanding over step-by-step instructions
Critical interpretation over mechanical execution
Preparation for 35% Critical Analysis component
Connection: Replication (today) tests if factors exist. Prediction (next week) tests if factors forecast returns. Both require rigorous methodology and critical interpretation.
Robustness checks (sample split, subperiod, costs) separate signal from noise
Critical analysis earns marks:
Interpret economically: 0.5% monthly = 6% annualised, but is it meaningful after costs?
Acknowledge uncertainty: t = 2.1 is marginally significant, not “strong evidence”
Honest limitations discussion: what didn’t you test? Selection bias? Post-publication decline?
Start coursework early: read Jensen et al. (2024), run scaffold, focus on interpretation
Banz, Rolf W. 1981. “The Relationship Between Return and Market Value of Common Stocks.”Journal of Financial Economics 9 (1): 3–18. https://doi.org/10.1016/0304-405X(81)90018-0.
Gelman, Andrew, and Eric Loken. 2014. “The Statistical Crisis in Science.”American Scientist 102 (6): 460–65. https://doi.org/10.1511/2014.111.460.
Harvey, Campbell R. 2017. “Presidential Address: The Scientific Outlook in Financial Economics.”Journal of Finance 72 (4): 1399–1440. https://doi.org/10.1111/jofi.12530.
Harvey, Campbell R., Yan Liu, and Heqing Zhu. 2020. “False (and Missed) Discoveries in Financial Economics.”Journal of Finance 75 (5): 2503–53. https://doi.org/10.1111/jofi.12960.
Jegadeesh, Narasimhan, and Sheridan Titman. 1993. “Returns to Buying Winners and Selling Losers: Implications for Stock Market Efficiency.”Journal of Finance 48 (1): 65–91. https://doi.org/10.1111/j.1540-6261.1993.tb04702.x.
Jensen, Theis I., Bryan T. Kelly, and Lasse Heje Pedersen. 2024. “Is There a Replication Crisis in Finance?”Journal of Finance. https://doi.org/10.1111/jofi.13249.