Week 10: Factor Replication : Principles & Critical Analysis

Learning Objectives

  • Explain factor replication as a research methodology for testing published findings
  • Interpret HAC (Heteroskedasticity and Autocorrelation Consistent) standard errors and understand why time-series autocorrelation matters
  • Evaluate factor performance using multiple metrics (Sharpe, alpha, robustness)
  • Identify sources of selection bias and overfitting in factor research
  • Apply critical thinking to assess whether documented factors are exploitable

Agenda

Part I : What is factor replication? Research methodology foundations
Part II : Statistical foundations: HAC errors, alpha tests, robustness
Part III : Selection bias and the replication crisis in finance
Part IV : Critical analysis: What makes interpretation rigorous?
Part V : Preparation for Coursework 2: Principles, not templates

Part I : Factor Replication as Research Methodology

What Are Factors?

Factors are the building blocks of modern quantitative investing. Rather than picking individual stocks, factor strategies systematically buy characteristics that historically generate excess returns.

Definition: Characteristics that explain cross-sectional variation in stock returns

Classic examples:

  • Value (HML): High Minus Low book-to-market : buy undervalued stocks, sell overvalued (1992)
  • Momentum (MOM): Buy past 6-12 month winners, sell losers (behavioural persistence) (1993)
  • Size (SMB): Small Minus Big : small-cap premium (though weakening post-publication) (1981; Fama and French 1992)
  • Quality (RMW): Robust Minus Weak profitability : sustainable competitive advantages (2013; Fama and French 2015)

Long-Short Construction: Zero-Investment Portfolios

Factors are constructed as long-short portfolios: simultaneously buying one group and selling another. This isolates factor exposure from market movements.

Mechanics:

  • Long leg: Buy stocks with desired characteristic (e.g., high book-to-market = value)
  • Short leg: Sell stocks with opposite characteristic (e.g., low book-to-market = growth)
  • Equal weights: Long and short legs have equal dollar amounts
  • Net investment: £0 (long purchases offset short sales)

Example: Value Factor (HML)

Component Valuation Action Investment Return
Long Undervalued (high B/M) Buy £100 value stocks -£100 +£5 (5%)
Short Overvalued (low B/M) Sell £100 growth stocks +£100 -£2 (-2%)
Net Market-neutral Long-short portfolio £0 +£7 (7%)

Factor return = Long return - Short return = 5% - (-2%) = 7%

Show calculation: Long-short factor return
import pandas as pd
import numpy as np

# Simulated example: Value factor construction
np.random.seed(42)

# Long leg: Value stocks (high book-to-market)
value_return = 0.05  # 5% return

# Short leg: Growth stocks (low book-to-market)  
# Note: Short return is negative of growth return
growth_return = 0.02  # Growth stocks returned 2%
short_return = -growth_return  # Short position earns -2%

# Factor return = Long - Short
factor_return = value_return - growth_return

# Create visualization table
construction = pd.DataFrame({
    'Component': ['Long (Value)', 'Short (Growth)', 'Factor (HML)'],
    'Valuation': ['Undervalued (high B/M)', 'Overvalued (low B/M)', 'Market-neutral'],
    'Action': ['Buy £100 value', 'Sell £100 growth', 'Net portfolio'],
    'Investment': ['-£100', '+£100', '£0'],
    'Return (%)': [value_return*100, short_return*100, factor_return*100],
    'Dollar P&L': ['+£5', '+£2', '+£7']
})

print("=== Long-Short Factor Construction Example ===\n")
print(construction.to_string(index=False))
print(f"\n📊 Key Insight:")
print(f"   Factor return ({factor_return*100:.1f}%) = Value return ({value_return*100:.1f}%) - Growth return ({growth_return*100:.1f}%)")
print(f"   Net investment = £0 (long purchase funded by short sale proceeds)")
print(f"   Factor isolates value premium, independent of market movements")

Why Zero Investment?

No capital required means returns represent pure factor exposure, not market risk. If market rises 10%, both legs move together: factor return isolates the difference.

Factor Replication: What Does It Mean?

Replication is core scientific practice. In medicine, we demand multiple trials before approving drugs. In finance, if a factor works 1970-1990, we demand it works 1991-2020.

Replication = reproduce published findings using independent data or time periods

Why replicate?

  • Test whether published results are real or data mining artifacts
  • Assess out-of-sample performance (does it work on new data?)
  • Evaluate economic significance (after costs, is there exploitable profit?)
  • Understand robustness (does it work across markets, time periods, specifications?)

Jensen, Kelly & Pedersen (2024): “Is There a Replication Crisis in Finance?”

  • Tested 153 published factors using consistent methodology
  • Many factors show 50% decline in out-of-sample performance
  • Cross-region replication often fails (US factors don’t work in Europe/Asia)
  • Conclusion: Published literature significantly overstates factor performance

Factor Replication Workflow

This is a conceptual framework, not a mechanical recipe. The scaffold notebook implements these steps, but understanding why each matters separates a pass from a distinction.

Conceptual steps:

  1. Choose factor: Select published factor with theoretical motivation
  2. Obtain data: Download returns from JKP portal (https://jkpfactors.com)
  3. Descriptive analysis: Mean, volatility, Sharpe ratio, cumulative returns
  4. Alpha test: Regress factor on market using HAC standard errors
  5. Robustness checks: Sample splits, subperiod analysis, cost adjustments
  6. Interpretation: Is factor real? Exploitable after costs? What are limitations?

Each step requires judgment: what robustness checks matter depends on your specific factor

Part II : Statistical Foundations for Rigorous Replication

Signal and Noise in Financial Returns

Financial returns are inherently noisy. Even if a factor has true alpha, observed returns mix signal (predictable component) with noise (random variation). Standard errors help us distinguish signal from noise.

The challenge:

  • Signal: True factor alpha (e.g., value stocks genuinely outperform)
  • Noise: Random variation (luck, market shocks, measurement error)
  • Observed return = Signal + Noise

Why this matters:

  • With 20 years of monthly data (240 observations), noise can create spurious patterns
  • A factor might appear significant just by chance (noise masquerading as signal)
  • Standard errors quantify how much noise contaminates our signal estimate

Example: True alpha = 0% (no factor), but observed alpha = 0.5% monthly
→ Is this signal (real factor) or noise (lucky sample)?

The Fundamental Problem

Financial returns have low signal-to-noise ratio. Real Bloomberg data (2018-2025) shows: SPY has signal-to-noise = 0.042 (noise is 24× larger than signal). Only 0.2% of variance is signal; 99.8% is noise. This makes statistical inference challenging.

Measuring Signal-to-Noise: Methodology

To quantify signal vs noise, we decompose return variance into predictable (signal) and unpredictable (noise) components using conditional expectations.

Econometrically Rigorous Construction:

For returns \(r_t\), we define signal as the predictable component conditional on available information:

  1. Conditional Expectation Model: \(E[r_t | \mathcal{I}_{t-1}] = \alpha + \beta \cdot \text{market}_t\)
    • Predictable component based on market exposure (CAPM)
    • Captures time-varying expected returns, not just constant mean
  2. Variance Decomposition:
    • Total Variance: \(\text{Var}(r_t) = \sigma^2\)
    • Signal Variance: \(\text{Var}(E[r_t | \mathcal{I}_{t-1}])\) (variance of conditional expectation)
    • Noise Variance: \(\text{Var}(r_t - E[r_t | \mathcal{I}_{t-1}])\) (variance of residuals)
    • Signal Fraction: \(R^2\) from prediction model (proportion of variance explained)
    • Noise Fraction: \(1 - R^2\) (proportion unexplained)

Why This Is More Rigorous:

  • Uses conditional expectation \(E[r_t | \mathcal{I}_{t-1}]\) rather than unconditional mean
  • Captures predictable components: market exposure, autocorrelation, time-varying expected returns
  • Signal = variance explained by information set; Noise = residual variance
  • Aligns with econometric theory: signal is what’s predictable given available information

Implementation via CAPM Regression:

Regress asset returns on market: \(r_t = \alpha + \beta \cdot r_{m,t} + \varepsilon_t\)

  • Signal = \(\hat{\alpha} + \hat{\beta} \cdot r_{m,t}\) (predicted returns)
  • Noise = \(\hat{\varepsilon}_t\) (residuals)
  • Signal Fraction = \(R^2\) (variance explained by market)
  • Noise Fraction = \(1 - R^2\) (unexplained variance)

Example: SPY regressed on itself (market proxy) - \(R^2 \approx 1.0\) (SPY explains itself perfectly) - For individual stocks: \(R^2 \approx 0.3-0.6\) (30-60% signal, 40-70% noise) - For factors: \(R^2\) typically lower (most variance is idiosyncratic noise)

Show calculation: Econometrically rigorous signal-to-noise
import numpy as np
import pandas as pd
import statsmodels.api as sm

df = load_bloomberg()
# Get SPY and market (SPY as market proxy)
spy_data = df[df['ticker'] == 'SPY'].sort_values('date')
market = spy_data['return'].values
asset = spy_data['return'].values  # SPY regressed on itself for demonstration
dates = spy_data['date'].values

# Econometrically rigorous approach: CAPM regression
X = sm.add_constant(market)
model = sm.OLS(asset, X).fit()

# Signal = predicted returns (conditional expectation)
predicted = model.fittedvalues
signal_var = np.var(predicted)

# Noise = residuals (unpredictable component)
residuals = model.resid
noise_var = np.var(residuals)

# Total variance
total_var = np.var(asset)

# Signal fraction = R² (variance explained by model)
signal_fraction = model.rsquared
noise_fraction = 1 - model.rsquared

# Signal-to-noise ratio (using conditional expectation)
signal_mean = np.abs(predicted.mean())
noise_std = np.std(residuals)
signal_noise_ratio = signal_mean / noise_std if noise_std > 0 else np.nan

print("=== Econometrically Rigorous Signal-to-Noise ===\n")
print("Method: CAPM Regression (Conditional Expectation)")
print(f"Model: r_t = α + β × market_t + ε_t\n")

print(f"Regression Results:")
print(f"  R² (Signal Fraction):  {signal_fraction:.4f} ({signal_fraction:.1%})")
print(f"  1 - R² (Noise Fraction): {noise_fraction:.4f} ({noise_fraction:.1%})")
print(f"  α (intercept):         {model.params[0]*100:.4f}% daily")
print(f"  β (market exposure):    {model.params[1]:.4f}")

print(f"\nVariance Decomposition:")
print(f"  Total variance:        {total_var*10000:.4f} (basis points)")
print(f"  Signal variance:       {signal_var*10000:.4f} (Var(E[r|I]))")
print(f"  Noise variance:        {noise_var*10000:.4f} (Var(ε))")
print(f"  Signal fraction:       {signal_fraction:.1%}")
print(f"  Noise fraction:        {noise_fraction:.1%}")

print(f"\nSignal-to-Noise Ratio:")
print(f"  Signal mean:           {signal_mean*100:.4f}% daily")
print(f"  Noise std:             {noise_std*100:.4f}% daily")
print(f"  Signal/Noise ratio:    {signal_noise_ratio:.4f}")

print(f"\n💡 Econometric Interpretation:")
print(f"   Signal = predictable component E[r_t | market_t]")
print(f"   Noise = residual ε_t (unpredictable given market)")
print(f"   {signal_fraction:.1%} of variance is explained by market exposure")
print(f"   {noise_fraction:.1%} is idiosyncratic noise")
print(f"   This is more rigorous than unconditional mean approach!")

Econometric Rigor

This approach uses conditional expectation \(E[r_t | \mathcal{I}_{t-1}]\) rather than unconditional mean. Signal is what’s predictable given available information (market returns, factors, etc.). This aligns with econometric theory and captures time-varying expected returns, autocorrelation, and factor exposures.

Why This Matters

For individual assets, \(R^2\) from CAPM is typically 30-60% (signal), meaning 40-70% is noise. For factors themselves (long-short portfolios), \(R^2\) is often lower because most variance is idiosyncratic. This quantifies why detecting true factors is challenging: even with market information, much variance remains unpredictable.

Real-World Signal-to-Noise: Bloomberg Data

Using real financial data (2018-2025) from Bloomberg Terminal, we apply the econometrically rigorous approach: signal = predictable component from CAPM regression.

Method: Regress each asset on market (SPY) to extract conditional expectation \(E[r_t | \text{market}_t]\)

Signal Fraction = R² from CAPM (variance explained by market exposure)

Show visualization: Professional signal-to-noise analysis
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

_csv_path = data_root / "bloomberg_database" / "signal_noise_metrics.csv"
if not _csv_path.exists():
    _csv_path = Path("data/bloomberg_database/signal_noise_metrics.csv")
metrics = pd.read_csv(_csv_path)

# Select example assets
example_assets = ['SPY', 'AAPL', 'VIX', 'BTCUSD']
display_metrics = metrics[metrics['asset'].isin(example_assets)].copy()

# Create professional visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Signal-to-Noise Ratio Comparison
ax1 = axes[0]
colors = {'Equity': '#2E86AB', 'Risk Gauge': '#F24236', 'Crypto': '#06A77D'}
for asset_type in display_metrics['asset_type'].unique():
    subset = display_metrics[display_metrics['asset_type'] == asset_type]
    ax1.scatter(subset['signal_noise_ratio'], subset['sharpe_ratio'], 
               label=asset_type, alpha=0.8, s=150, color=colors.get(asset_type, 'gray'),
               edgecolors='white', linewidth=2)

# Add asset labels
for _, row in display_metrics.iterrows():
    ax1.annotate(row['asset'], 
                (row['signal_noise_ratio'], row['sharpe_ratio']),
                xytext=(5, 5), textcoords='offset points', fontsize=9, fontweight='bold')

ax1.set_xlabel('Signal-to-Noise Ratio (|Mean| / Std)', fontsize=11, fontweight='bold')
ax1.set_ylabel('Sharpe Ratio (Annual)', fontsize=11, fontweight='bold')
ax1.set_title('Signal-to-Noise vs Risk-Adjusted Returns', fontsize=12, fontweight='bold', pad=10)
ax1.legend(loc='upper left', frameon=True, fancybox=True, shadow=True)
ax1.grid(True, alpha=0.3, linestyle='--')

# Plot 2: Signal vs Noise Fraction (Stacked Bar Chart)
ax2 = axes[1]
assets = display_metrics['asset'].values
signal_fracs = display_metrics['signal_fraction'].values * 100
noise_fracs = display_metrics['noise_fraction'].values * 100

x_pos = np.arange(len(assets))
width = 0.6

bars1 = ax2.barh(x_pos, signal_fracs, width, label='Signal', color='#2E86AB', alpha=0.8)
bars2 = ax2.barh(x_pos, noise_fracs, width, left=signal_fracs, label='Noise', 
                 color='#F24236', alpha=0.8)

# Add percentage labels
for i, (s, n) in enumerate(zip(signal_fracs, noise_fracs)):
    if s > 0.1:  # Only label if signal is visible
        ax2.text(s/2, i, f'{s:.2f}%', ha='center', va='center', fontweight='bold', 
                fontsize=8, color='white')
    ax2.text(s + n/2, i, f'{n:.1f}%', ha='center', va='center', fontweight='bold',
            fontsize=9, color='white')

ax2.set_yticks(x_pos)
ax2.set_yticklabels(assets, fontweight='bold')
ax2.set_xlabel('Variance Fraction (%)', fontsize=11, fontweight='bold')
ax2.set_title('Signal vs Noise Decomposition', fontsize=12, fontweight='bold', pad=10)
ax2.legend(loc='lower right', frameon=True, fancybox=True, shadow=True)
ax2.grid(True, alpha=0.3, axis='x', linestyle='--')
ax2.set_xlim(0, 100)

plt.tight_layout()
plt.show()
Show calculation: Real-world signal-to-noise from Bloomberg data
import pandas as pd
import numpy as np

df = load_bloomberg()
_csv_path = data_root / "bloomberg_database" / "signal_noise_metrics.csv"
if not _csv_path.exists():
    _csv_path = Path("data/bloomberg_database/signal_noise_metrics.csv")
metrics = pd.read_csv(_csv_path)

# Select example assets and recalculate using CAPM approach
example_assets = ['SPY', 'AAPL', 'VIX', 'BTCUSD']

# Recalculate metrics using CAPM regression (econometrically rigorous)
import statsmodels.api as sm

# Get market returns (SPY)
market_data = df[df['ticker'] == 'SPY'].sort_values('date')
market_returns = market_data['return'].values

recalculated_metrics = []
for asset_name in example_assets:
    asset_data = df[df['ticker'] == asset_name].sort_values('date')
    if len(asset_data) < 30:
        continue
    
    # Merge to align dates
    merged = pd.merge(
        asset_data[['date', 'return']],
        market_data[['date', 'return']],
        on='date',
        suffixes=('_asset', '_market')
    )
    
    if len(merged) < 30:
        continue
    
    asset_ret = merged['return_asset'].values
    market_ret = merged['return_market'].values
    
    # CAPM regression
    X = sm.add_constant(market_ret)
    model = sm.OLS(asset_ret, X).fit()
    
    # Signal fraction = R² (variance explained by market)
    signal_fraction = model.rsquared
    noise_fraction = 1 - model.rsquared
    
    # Signal-to-noise ratio
    predicted = model.fittedvalues
    residuals = model.resid
    signal_mean = np.abs(predicted.mean())
    noise_std = np.std(residuals)
    signal_noise_ratio = signal_mean / noise_std if noise_std > 0 else np.nan
    
    recalculated_metrics.append({
        'asset': asset_name,
        'asset_type': asset_data['asset_type'].iloc[0] if 'asset_type' in asset_data.columns else 'Unknown',
        'signal_fraction': signal_fraction,
        'noise_fraction': noise_fraction,
        'signal_noise_ratio': signal_noise_ratio,
        'r_squared': signal_fraction,
        'beta': model.params[1] if len(model.params) > 1 else np.nan
    })

display_metrics = pd.DataFrame(recalculated_metrics)

# Display metrics table
print("\n=== Real-World Signal-to-Noise Metrics (CAPM-based) ===\n")
if 'r_squared' in display_metrics.columns:
    print(display_metrics[['asset', 'asset_type', 'r_squared', 
                          'signal_fraction', 'noise_fraction', 'beta']].round(4).to_string(index=False))
else:
    print(display_metrics[['asset', 'asset_type', 'signal_fraction', 
                          'noise_fraction']].round(4).to_string(index=False))

print("\n📊 Interpretation (CAPM-based):")
print("   - Signal fraction = R² from CAPM regression")
print("   - SPY: R² ≈ 1.0 (perfectly explained by itself)")
print("   - Individual stocks: R² ≈ 0.5-0.6 (50-60% explained by market)")
print("   - Factors/Crypto: R² ≈ 0.1-0.2 (10-20% explained, rest is noise)")
print("   - This captures predictable component conditional on market information!")

# Show AAPL example in detail (more interesting than SPY)
aapl_data = df[df['ticker'] == 'AAPL'].sort_values('date')
merged = pd.merge(aapl_data[['date', 'return']], market_data[['date', 'return']],
                 on='date', suffixes=('_aapl', '_market'))

X = sm.add_constant(merged['return_market'].values)
model = sm.OLS(merged['return_aapl'].values, X).fit()

print(f"\n🔍 AAPL Example (CAPM Regression):")
print(f"   R² (Signal Fraction): {model.rsquared:.1%}")
print(f"   1-R² (Noise Fraction): {1-model.rsquared:.1%}")
print(f"   β (Market Exposure): {model.params[1]:.2f}")
print(f"   α (Intercept): {model.params[0]*100:.4f}% daily")
print(f"   → {model.rsquared:.1%} of variance is predictable from market")
print(f"   → {1-model.rsquared:.1%} is idiosyncratic noise")

The Reality Check

Using conditional expectation (CAPM), signal fractions vary dramatically by asset type. Individual equities: 50-60% signal (market exposure explains most variance). Factors and crypto: 10-20% signal (80-90% noise). This econometrically rigorous approach captures what’s truly predictable given market information: much more informative than unconditional mean.

Why Standard Errors Matter

Standard errors quantify uncertainty in estimates. In factor replication, we’re testing whether observed alpha is signal (true factor premium) or noise (random variation).

Connection to signal-to-noise analysis:

  • Individual stocks (AAPL): R² ≈ 0.55 → 45% noise → larger standard errors
  • Factors (long-short portfolios): R² ≈ 0.1-0.2 → 80-90% noise → much larger standard errors
  • Implication: Factor alpha estimates are less precise than stock alpha estimates

Statistical significance = “Is observed alpha signal or noise?”

t-statistic = Alpha / Standard Error

  • |t| > 1.96 → statistically significant at 5% level (conventional threshold)
  • |t| < 1.96 → cannot reject null hypothesis (could be random chance)
  • Harvey (2017) recommends t > 3 for finance (multiple testing correction)

Why factors need higher t-statistics: With 80-90% noise fraction, standard errors are large. Need t > 3 to confidently distinguish signal from noise.

Common Mistake

Alpha = 1% monthly with t = 0.5 is not significant (likely noise). Alpha = 0.3% monthly with t = 3 is significant (likely signal). The t-statistic matters more than the magnitude: especially for factors with high noise fraction.

How Standard Errors Are Constructed

Standard errors measure both estimation precision and sampling variability. They are calculated from the variance-covariance matrix (error variance ÷ sample size), but their frequentist interpretation is as the standard deviation of the sampling distribution under hypothetical repeated sampling. High noise variance → large standard errors → imprecise estimates.

Basic OLS standard error formula:

For regression coefficient \(\hat{\beta}\):

\[ SE(\hat{\beta}) = \sqrt{\frac{\hat{\sigma}^2}{\sum (X_i - \bar{X})^2}} \]

where \(\hat{\sigma}^2\) is estimated error variance.

Key components:

  • Error variance (\(\hat{\sigma}^2\)): How much returns deviate from predicted values
  • Sample size (\(n\)): More observations → smaller SE (more precision)
  • Variation in X: More variation in predictor → smaller SE (better identification)

For financial returns:

  • High error variance: Returns are volatile (noise is large)
  • Limited sample size: Only 20-30 years of monthly data available
  • Result: Standard errors are relatively large, making significance hard to achieve

Intuition: If returns were perfectly predictable, error variance = 0, SE = 0. But returns are noisy, so SE > 0. Standard errors tell us how much uncertainty remains.

Show calculation: Standard error components
import numpy as np
import pandas as pd

# Simulate factor returns
np.random.seed(42)
n = 240  # 20 years monthly
true_alpha = 0.003  # 0.3% monthly true alpha
market = np.random.normal(0.008, 0.04, n)
factor = true_alpha + 0.2 * market + np.random.normal(0, 0.03, n)

# OLS regression
X = np.column_stack([np.ones(n), market])
beta_hat = np.linalg.lstsq(X, factor, rcond=None)[0]
residuals = factor - X @ beta_hat
sigma_sq = np.var(residuals, ddof=2)  # Error variance

# Standard error calculation
X_centered = market - market.mean()
sum_sq_X = np.sum(X_centered ** 2)
se_alpha = np.sqrt(sigma_sq / n)  # Simplified for intercept
se_beta = np.sqrt(sigma_sq / sum_sq_X)  # For slope

# Display components
components = pd.DataFrame({
    'Component': ['Sample size (n)', 'Error variance (σ²)', 'Sum of squares (X)', 
                  'SE(alpha)', 'SE(beta)'],
    'Value': [n, f'{sigma_sq:.6f}', f'{sum_sq_X:.4f}', 
              f'{se_alpha:.4f}', f'{se_beta:.4f}']
})

print("=== Standard Error Construction Components ===\n")
print(components.to_string(index=False))
print(f"\n📊 Interpretation:")
print(f"   Error variance = {sigma_sq:.6f} (high → noisy returns)")
print(f"   Sample size = {n} (limited → less precision)")
print(f"   SE(alpha) = {se_alpha:.4f} ({se_alpha*100:.2f}% monthly)")
print(f"   t-statistic = {beta_hat[0]/se_alpha:.2f}")
if abs(beta_hat[0]/se_alpha) > 1.96:
    print(f"   ✓ Alpha is statistically significant")
else:
    print(f"   ✗ Alpha is NOT statistically significant")

Why Financial Data Is Challenging

High error variance + limited sample size = large standard errors. This makes it hard to detect true factors (signal) when noise dominates. HAC standard errors account for additional complications (autocorrelation, heteroskedasticity), making SEs even larger.

Time-Series Data: Autocorrelation Problem

Financial time-series violate a key OLS assumption: independence of observations. If monthly returns are correlated, you don’t have 120 independent observations over 10 years: you have fewer “effective” observations.

Financial returns exhibit serial correlation:

  • Momentum: Positive returns predict future positive returns (6-12 months)
  • Volatility clustering: High volatility today predicts high volatility tomorrow
  • Market regimes: Bull and bear markets persist over time

Problem for inference:

  • Standard OLS assumes observations are independent (εₜ and εₜ₊₁ uncorrelated)
  • Autocorrelation breaks this assumption
  • Result: OLS understates standard errors → inflates t-statistics → false positives

Impact: HAC (Newey-West) standard errors typically 1.5-2× larger than OLS for monthly factors

A factor with OLS t = 2.5 might have HAC t = 1.8 (no longer significant). Always use HAC for time-series financial data.

Detecting Autocorrelation: Bloomberg Data Evidence

  • Using real financial data, we can measure autocorrelation and test whether it’s statistically significant. This demonstrates why HAC corrections are essential.

Autocorrelation Function (ACF): Correlation between \(r_t\) and \(r_{t-k}\) for lags \(k = 1, 2, ...\)

Ljung-Box Test: Tests null hypothesis of no autocorrelation up to lag \(k\)

  • H₀: No autocorrelation (ρ₁ = ρ₂ = … = ρₖ = 0)
  • If p-value < 0.05 → reject H₀ → autocorrelation present → OLS SEs are wrong
Show analysis: Detecting autocorrelation in real returns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.stats.diagnostic import acorr_ljungbox
from statsmodels.graphics.tsaplots import plot_acf

df = load_bloomberg()

# Analyze autocorrelation for SPY
spy_returns = df[df['ticker'] == 'SPY']['return'].dropna().values

# Create visualization
fig, axes = plt.subplots(1, 3, figsize=(12, 4))

# Plot 1: ACF for returns
ax1 = axes[0]
plot_acf(spy_returns, lags=20, ax=ax1, alpha=0.05, 
         title='SPY Returns: Autocorrelation Function')
ax1.set_xlabel('Lag (days)', fontsize=10)
ax1.set_ylabel('Autocorrelation', fontsize=10)

# Plot 2: ACF for squared returns (volatility clustering)
ax2 = axes[1]
plot_acf(spy_returns**2, lags=20, ax=ax2, alpha=0.05,
         title='SPY Squared Returns: Volatility Clustering')
ax2.set_xlabel('Lag (days)', fontsize=10)
ax2.set_ylabel('Autocorrelation', fontsize=10)

# Plot 3: Ljung-Box test results
ax3 = axes[2]
lb_returns = acorr_ljungbox(spy_returns, lags=range(1, 21), return_df=True)
lb_squared = acorr_ljungbox(spy_returns**2, lags=range(1, 21), return_df=True)

lags = range(1, 21)
ax3.plot(lags, lb_returns['lb_pvalue'], 'b-o', label='Returns', markersize=4)
ax3.plot(lags, lb_squared['lb_pvalue'], 'r-s', label='Squared Returns', markersize=4)
ax3.axhline(y=0.05, color='gray', linestyle='--', label='5% significance')
ax3.set_xlabel('Lag (days)', fontsize=10)
ax3.set_ylabel('p-value', fontsize=10)
ax3.set_title('Ljung-Box Test: p-values by Lag', fontsize=11, fontweight='bold')
ax3.legend(loc='upper right', fontsize=8)
ax3.set_ylim(0, 1)

plt.tight_layout()
plt.show()

# Print test results
print("=== Autocorrelation Analysis: SPY Daily Returns ===\n")

# First-order autocorrelation
from scipy.stats import pearsonr
if len(spy_returns) > 1:
    acf1, _ = pearsonr(spy_returns[:-1], spy_returns[1:])
    print(f"First-order autocorrelation (ρ₁): {acf1:.4f}")

# Ljung-Box at lag 10
lb_10 = acorr_ljungbox(spy_returns, lags=[10], return_df=True)
lb_10_sq = acorr_ljungbox(spy_returns**2, lags=[10], return_df=True)

print(f"\nLjung-Box Test (lag 10):")
print(f"  Returns:         Q = {lb_10['lb_stat'].values[0]:.2f}, p = {lb_10['lb_pvalue'].values[0]:.4f}")
print(f"  Squared Returns: Q = {lb_10_sq['lb_stat'].values[0]:.2f}, p = {lb_10_sq['lb_pvalue'].values[0]:.4f}")

print(f"\n📊 Interpretation:")
if lb_10['lb_pvalue'].values[0] < 0.05:
    print(f"   ✗ Returns show significant autocorrelation (p < 0.05)")
    print(f"     → OLS standard errors are biased downward")
else:
    print(f"   ✓ Returns show no significant autocorrelation (p ≥ 0.05)")
    print(f"     → OLS standard errors may be acceptable for returns")

if lb_10_sq['lb_pvalue'].values[0] < 0.05:
    print(f"   ✗ Squared returns show significant autocorrelation (p < 0.05)")
    print(f"     → Volatility clustering present → heteroskedasticity")
    print(f"     → OLS standard errors are still biased → use HAC!")

Key Finding

Even if return autocorrelation is weak, squared returns (volatility) typically show strong autocorrelation. This is volatility clustering (GARCH): high/low volatility periods persist. HAC corrections address both autocorrelation AND heteroskedasticity.

HAC vs OLS Standard Errors: Practical Impact

Let’s compare OLS and HAC standard errors using a CAPM regression on real Bloomberg data. The difference shows why HAC is essential.

Methodology: Regress asset returns on market (SPY) using both OLS and HAC standard errors

Show comparison: OLS vs HAC standard errors
import pandas as pd
import numpy as np
import statsmodels.api as sm

df = load_bloomberg()

# Get market (SPY) and asset (AAPL) returns
spy_data = df[df['ticker'] == 'SPY'].sort_values('date')
aapl_data = df[df['ticker'] == 'AAPL'].sort_values('date')

# Merge to align dates
merged = pd.merge(
    aapl_data[['date', 'return']],
    spy_data[['date', 'return']],
    on='date',
    suffixes=('_aapl', '_market')
).dropna()

y = merged['return_aapl'].values
X = sm.add_constant(merged['return_market'].values)

# OLS regression (standard errors assume i.i.d. errors)
model_ols = sm.OLS(y, X).fit()

# HAC regression (Newey-West standard errors, lag = 10)
model_hac = sm.OLS(y, X).fit(cov_type='HAC', cov_kwds={'maxlags': 10})

# Create comparison table
results = pd.DataFrame({
    'Parameter': ['Alpha (α)', 'Beta (β)'],
    'Estimate': [model_ols.params[0], model_ols.params[1]],
    'OLS SE': [model_ols.bse[0], model_ols.bse[1]],
    'HAC SE': [model_hac.bse[0], model_hac.bse[1]],
    'OLS t-stat': [model_ols.tvalues[0], model_ols.tvalues[1]],
    'HAC t-stat': [model_hac.tvalues[0], model_hac.tvalues[1]],
})

# Calculate SE inflation factor
results['SE Inflation'] = results['HAC SE'] / results['OLS SE']

print("=== OLS vs HAC Standard Errors: CAPM Regression ===\n")
print(f"Asset: AAPL regressed on Market (SPY)")
print(f"Sample: {len(y):,} daily observations\n")

print("Parameter Estimates and Standard Errors:")
print("-" * 75)
print(f"{'Parameter':<12} {'Estimate':>10} {'OLS SE':>10} {'HAC SE':>10} {'OLS t':>8} {'HAC t':>8} {'SE Ratio':>10}")
print("-" * 75)

for _, row in results.iterrows():
    print(f"{row['Parameter']:<12} {row['Estimate']:>10.5f} {row['OLS SE']:>10.5f} "
          f"{row['HAC SE']:>10.5f} {row['OLS t-stat']:>8.2f} {row['HAC t-stat']:>8.2f} "
          f"{row['SE Inflation']:>10.2f}x")

print("-" * 75)

# Interpretation
alpha_ols_sig = abs(model_ols.tvalues[0]) > 1.96
alpha_hac_sig = abs(model_hac.tvalues[0]) > 1.96

print(f"\n📊 Key Findings:")
print(f"   HAC SE / OLS SE ratio: {results['SE Inflation'].mean():.2f}x on average")
print(f"   Alpha significance:")
print(f"     OLS: |t| = {abs(model_ols.tvalues[0]):.2f}{'Significant' if alpha_ols_sig else 'Not significant'} at 5%")
print(f"     HAC: |t| = {abs(model_hac.tvalues[0]):.2f}{'Significant' if alpha_hac_sig else 'Not significant'} at 5%")

if alpha_ols_sig and not alpha_hac_sig:
    print(f"\n⚠️  CRITICAL: Alpha appears significant with OLS but NOT with HAC!")
    print(f"   This is a FALSE POSITIVE prevented by using HAC standard errors.")
elif alpha_ols_sig and alpha_hac_sig:
    print(f"\n✓  Alpha is significant with both OLS and HAC.")
    print(f"   But t-statistic is lower with HAC: more conservative inference.")
else:
    print(f"\n   Alpha not significant with either method.")
    print(f"   HAC gives more reliable inference regardless.")

print(f"\n💡 Lesson:")
print(f"   Always use HAC (cov_type='HAC') for time-series financial regressions.")
print(f"   OLS standard errors understate uncertainty → inflate t-statistics → false positives.")

Practical Implementation

In statsmodels: model.fit(cov_type='HAC', cov_kwds={'maxlags': 10}) gives Newey-West HAC standard errors. For monthly data, use maxlags=6; for daily data, maxlags=20-30.

Alpha Tests: CAPM Regression

The CAPM alpha test decomposes factor returns into two components: market exposure (β) and excess return beyond market (α). Only alpha matters: beta just tells you market risk.

Capital Asset Pricing Model regression:

\[ R_{factor,t} = \alpha + \beta \cdot R_{market,t} + \varepsilon_t \]

Interpretation:

  • Alpha (α): Excess return not explained by market exposure (“skill” component)
  • Beta (β): Factor’s sensitivity to market movements
  • : Fraction of factor variance explained by market
  • Null hypothesis: α = 0 (no excess return beyond market)

Example: Momentum earns 1.2% monthly, beta = 0.2, market earns 0.8% monthly

→ CAPM predicts momentum return = α + 0.2 × 0.8% = α + 0.16%
→ Observed return 1.2%, so α = 1.04% monthly (if HAC t > 1.96, it’s significant)

Robustness: Why One Test Isn’t Enough

A single significant result is weak evidence. Researchers have many degrees of freedom: what Gelman and Loken (2014) call the “garden of forking paths”: if you try enough specifications, one will appear significant by chance. Robustness guards against false discoveries.

Robustness checks test if results hold under alternative specifications:

  • Sample split: Does factor work in first half AND second half? (minimum check)
  • Subperiod analysis: Does alpha remain positive in each decade?
  • Alternative construction: Tertiles vs. quintiles, value-weighted vs. equal-weighted?
  • Cross-region: US factors often don’t replicate in Europe/Asia (Jensen, Kelly, and Pedersen 2024)
  • Transaction costs: Is net alpha positive after 0.2-0.5% monthly costs?

Ethical Econometrics

Don’t cherry-pick checks that passed. If factor works 2000-2010 but not 2010-2020, report both. Selective reporting is a breach of research ethics: transparent, complete disclosure is fundamental to responsible empirical practice and is rewarded in the 35% Critical Analysis component.

Part III : Selection Bias and the Replication Crisis

The Multiple Testing Problem

This is the core problem creating the replication crisis. With 5% significance threshold, testing 100 hypotheses generates ~5 false positives even if all nulls are true.

Academic research process (the problem):

  1. Researcher tests 50 potential factors
  2. 45 don’t work (α ≈ 0, not significant)
  3. 5 appear significant (α > 0, t > 2) by chance (5% false positive rate)
  4. Researcher publishes the 5 “successful” factors
  5. Failed tests go in file drawer (never published)
  6. Journals prefer positive results; null results don’t advance careers

Result: Published literature massively overrepresents spurious findings

Harvey (2017) estimates over 300 equity factors published, but only ~10-15 are genuinely robust (95% are questionable)

Simulation: The Multiple Testing Problem in Action

Setup: 1,000 researchers each test 10 factors. All factors are pure noise (true α = 0). At 5% significance level, how many “discoveries” emerge?

Show simulation: False discoveries from pure noise
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats

np.random.seed(42)

# Simulation parameters
n_researchers = 1000
factors_per_researcher = 10
n_months = 240  # 20 years of monthly data
significance_level = 0.05
true_alpha = 0.0  # ALL factors are noise (null is true)

# Simulate: each factor is pure noise
total_factors = n_researchers * factors_per_researcher
t_statistics = []
p_values = []

for i in range(total_factors):
    # Generate random factor returns (mean = 0, sd = 3% monthly)
    factor_returns = np.random.normal(true_alpha, 0.03, n_months)
    
    # One-sample t-test: is mean significantly different from 0?
    t_stat, p_val = stats.ttest_1samp(factor_returns, 0)
    t_statistics.append(t_stat)
    p_values.append(p_val)

t_statistics = np.array(t_statistics)
p_values = np.array(p_values)

# Count "significant" results (false discoveries)
significant_5pct = np.sum(p_values < 0.05)
significant_1pct = np.sum(p_values < 0.01)
significant_harvey = np.sum(np.abs(t_statistics) > 3.0)  # Harvey's threshold

# Create visualisation
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Plot 1: Distribution of t-statistics
ax1 = axes[0]
ax1.hist(t_statistics, bins=50, density=True, alpha=0.7, color='steelblue', edgecolor='white')

# Overlay theoretical t-distribution
x = np.linspace(-5, 5, 100)
ax1.plot(x, stats.t.pdf(x, df=n_months-1), 'r-', linewidth=2, label='Theoretical t-dist')

# Mark significance thresholds
ax1.axvline(x=1.96, color='orange', linestyle='--', linewidth=2, label='t = ±1.96 (5%)')
ax1.axvline(x=-1.96, color='orange', linestyle='--', linewidth=2)
ax1.axvline(x=3.0, color='red', linestyle='--', linewidth=2, label='t = ±3.0 (Harvey)')
ax1.axvline(x=-3.0, color='red', linestyle='--', linewidth=2)

ax1.set_xlabel('t-statistic', fontsize=11, fontweight='bold')
ax1.set_ylabel('Density', fontsize=11, fontweight='bold')
ax1.set_title('Distribution of t-statistics\n(All 10,000 factors are PURE NOISE)', fontsize=12, fontweight='bold')
ax1.legend(loc='upper right', fontsize=9)
ax1.set_xlim(-5, 5)

# Plot 2: False discovery counts
ax2 = axes[1]
categories = ['p < 0.05\n(Standard)', 'p < 0.01\n(Stricter)', '|t| > 3\n(Harvey)']
counts = [significant_5pct, significant_1pct, significant_harvey]
expected = [total_factors * 0.05, total_factors * 0.01, total_factors * 0.0027]  # 0.27% for |t|>3

x_pos = np.arange(len(categories))
width = 0.35

bars1 = ax2.bar(x_pos - width/2, counts, width, label='Observed False Discoveries', color='crimson', alpha=0.8)
bars2 = ax2.bar(x_pos + width/2, expected, width, label='Expected (Theory)', color='steelblue', alpha=0.8)

# Add count labels
for bar, count in zip(bars1, counts):
    ax2.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 10, 
             f'{int(count)}', ha='center', fontweight='bold', fontsize=11)

ax2.set_ylabel('Number of "Significant" Results', fontsize=11, fontweight='bold')
ax2.set_title('False Discoveries from Pure Noise\n(All 10,000 factors have TRUE α = 0)', fontsize=12, fontweight='bold')
ax2.set_xticks(x_pos)
ax2.set_xticklabels(categories, fontsize=10)
ax2.legend(loc='upper right', fontsize=9)
ax2.set_ylim(0, max(counts) * 1.2)

plt.tight_layout()
plt.show()

# Print summary
print("=" * 70)
print("MULTIPLE TESTING SIMULATION: THE REPLICATION CRISIS IN ACTION")
print("=" * 70)
print(f"\nSetup:")
print(f"  Researchers:           {n_researchers:,}")
print(f"  Factors per researcher: {factors_per_researcher}")
print(f"  Total factors tested:   {total_factors:,}")
print(f"  True alpha (all factors): {true_alpha} (ALL ARE NOISE)")
print(f"  Sample size per factor:  {n_months} months")

print(f"\nFalse Discoveries (Type I Errors):")
print(f"  At p < 0.05: {significant_5pct:,} factors appear significant ({significant_5pct/total_factors*100:.1f}%)")
print(f"  At p < 0.01: {significant_1pct:,} factors appear significant ({significant_1pct/total_factors*100:.1f}%)")
print(f"  At |t| > 3:  {significant_harvey:,} factors appear significant ({significant_harvey/total_factors*100:.2f}%)")

print(f"\n📊 Key Insight:")
print(f"  {significant_5pct} papers could be published claiming 'significant alpha'")
print(f"  ALL {significant_5pct} are FALSE DISCOVERIES (true α = 0)")
print(f"  Using t > 3 (Harvey's threshold) reduces false discoveries to {significant_harvey}")
print(f"\n⚠️  This is why the replication crisis exists!")
print(f"  Published literature is full of noise masquerading as signal.")

The Sobering Reality

From 10,000 pure noise factors, approximately 500 will appear significant at the 5% level. If only these are published, the literature looks like 500 “discoveries”: but 100% are false positives. Harvey’s t > 3 threshold reduces this to ~27.

Guarding Against Selection Bias

Selection bias is hard to eliminate but can be mitigated with rigorous practices. These separate good research from bad.

Best practices in research:

  • Pre-registration: Specify hypothesis before seeing data (medical trials standard)
  • Out-of-sample testing: Test on data not available when factor was published
  • Cross-region replication: Factors should work globally if they’re real
  • Multiple testing corrections: Use t > 3 threshold (Harvey 2017) instead of t > 2
  • Economic theory: Value/momentum have theoretical foundations; “vowel tickers outperform” doesn’t

For Coursework 2: Intellectual Honesty Earns Marks

  • If you tried 3 factors, disclose that (don’t pretend you tested only 1)
  • Report robustness failures, not just successes
  • If alpha is t = 2.1, acknowledge it’s marginal, not “strong evidence”
  • The 35% Critical Analysis component explicitly rewards honest limitation discussion

Part IV : Critical Analysis: What Makes Interpretation Rigorous?

Beyond Reporting Numbers: Ask Questions

Weak analysis (reporting):

“Value factor earns 0.5% monthly alpha with t = 2.3 (significant). Sharpe ratio is 0.4. Results are robust to sample split.”

Strong analysis (interpretation):

“Value earns 0.5% monthly alpha (6% annualised). This is economically meaningful but modest. Statistical significance (t = 2.3) suggests it’s not pure luck, but close to threshold. Sample split shows alpha is stable (0.6% first half, 0.4% second half), increasing confidence. However, transaction costs (~0.2% monthly for value rebalancing) would reduce net alpha to 0.3% (3.6% annualised). Is 3.6% net alpha sufficient to compensate for tracking error and implementation frictions? Original paper reported 8% annualised: our replication shows 50% lower alpha, consistent with post-publication decline documented by Jensen et al. (2024).”

Interpreting Your Factor Results

When you analyse your factor’s performance, interrogate your conclusions:

  • Statistical vs economic significance: A t-stat of 2.1 clears the 1.96 threshold: but how confident would you be investing real money on that evidence?
  • Scale matters: What does 0.1% monthly alpha actually mean for an investor over a year? Is that worth pursuing?
  • Benchmarking performance: If the market delivers a Sharpe ratio around 0.4, what should you conclude about a factor with Sharpe of 0.3?
  • Robustness integrity: If some of your robustness tests pass and others fail, what story does that tell about your factor?
  • From paper to portfolio: What happens between calculating returns on a spreadsheet and actually implementing a trading strategy?

The Implementation Gap

Academic factor returns assume frictionless trading. Real portfolios face transaction costs, market impact, and timing constraints. How might these affect your conclusions?

Part V : Preparation for Coursework 2: Principles, Not Templates

What the Scaffold Provides vs. What You Must Provide

The scaffold notebook is deliberately comprehensive: we want you to focus on understanding and interpretation, not debugging code. The 35% Critical Analysis component is where marks are won or lost.

Scaffold provides (execution):

  • Working code for data loading, alpha regression, robustness checks
  • All necessary functions pre-written (HAC standard errors, sample splits)
  • Publication-quality tables and figures ready for your report

You must provide (interpretation grounded in YOUR results):

  • Numerical engagement: “My alpha is X bp/month (t = Y). The original paper found Z bp. This N% difference likely reflects…”
  • Specific robustness narrative: Which tests passed? Which failed? What does that specific pattern tell you?
  • Your judgment, defended: Would you invest £10,000 of your own money in this factor? Why or why not, given YOUR numbers?
  • Process reflection: What did you expect to find? What surprised you? What would you do differently?

Generic explanations of “why HAC matters” or “what limitations exist” won’t earn marks. Examiners want to see you grapple with your specific results.

Strategic focus: Spend 1-2 hours on code, 8-10 hours making sense of what YOUR numbers mean

Questions to Ask About YOUR Results

When you have your output, interrogate it. These questions connect today’s principles to YOUR specific analysis.

About your methodology:

  • Did your robustness tests pass or fail? What does that specific pattern suggest?
  • How does your sample period compare to the original paper’s? Does that explain any differences?

About your statistics:

  • Is your t-stat comfortably above 2, or hovering near the threshold? What’s the practical difference?
  • Your alpha is X bp monthly. What does that mean for a £10m portfolio over a year?

About your judgment:

  • Given YOUR numbers, would you recommend this factor to a pension fund? Why or why not?
  • If your results are weaker than the original paper, is that replication failure: or exactly what you’d expect?

Using AI Tools Appropriately

AI tools like ChatGPT and Copilot are permitted: but how you use them determines whether they help or hurt your work.

AI can help you:

  • Understand concepts: “Explain HAC standard errors in simple terms”
  • Debug code: “Why is this pandas merge failing?”
  • Learn techniques: “Show me how to calculate Newey-West standard errors”

AI cannot help you:

  • Interpret YOUR specific results: It doesn’t know your alpha is 0.28% with t = 1.9
  • Explain YOUR robustness pattern: It can’t see that your early-sample passed but late-sample failed
  • Defend YOUR recommendation: Generic “factors can be useful” isn’t a position

The Specificity Test

If your critical analysis section could have been written without ever looking at your actual output, that’s a problem: regardless of whether AI wrote it or you did.

AI Use: What Helps vs. What Hurts

Appropriate use (helps your learning):

Task Example prompt Why it’s fine
Concept clarification “Explain why autocorrelation inflates t-statistics” Builds understanding
Code debugging “Why does my HAC calculation give NaN?” Technical problem-solving
Writing feedback “Is this paragraph clear?” Improves communication

Problematic use (hurts your marks):

Task Example prompt Why it fails
Generic interpretation “Write a limitations section for a factor study” Not specific to YOUR results
Boilerplate analysis “Explain what alpha significance means” Could apply to ANY study
Wholesale generation “Write my critical analysis section” No engagement with YOUR numbers

The test: Would an examiner reading your analysis know which specific factor you studied and what YOUR results showed?

Academic Integrity: Detection and Verification

To maintain fairness for all students, I have developed a multi-model GenAI detection architecture that analyses submission patterns across multiple dimensions.

What this means:

  • All coursework submissions are processed through this system
  • The system flags submissions with characteristics suggesting over-reliance on AI-generated content
  • Flags are reviewed by me personally: the system assists, it doesn’t decide

Oral Examination Rights

I reserve the right to orally examine any student whose submission is flagged by this system. You may be asked to explain your analysis, walk through your reasoning, and demonstrate understanding of your own work.

This is not about catching you out: it’s about ensuring your degree means something. Students who genuinely engage with their analysis have nothing to worry about.

Demonstration: HAC Standard Errors in Practice

Show code: OLS vs HAC comparison
import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.regression.linear_model import OLS
from statsmodels.stats.sandwich_covariance import cov_hac

# Simulate factor returns with autocorrelation (illustrative)
np.random.seed(42)
n = 240  # 20 years monthly
market = np.random.normal(0.008, 0.04, n)
# Factor with positive alpha and autocorrelation
factor = 0.005 + 0.3 * market + np.random.normal(0, 0.03, n)
for i in range(1, n):
    factor[i] += 0.3 * factor[i-1]  # Autocorrelation

# Regression
X = sm.add_constant(market)
model_ols = OLS(factor, X).fit()

# HAC standard errors (Newey-West with 6 lags)
cov_hac_matrix = cov_hac(model_ols, nlags=6)
se_hac = np.sqrt(np.diag(cov_hac_matrix))
t_hac = model_ols.params / se_hac

# Comparison table
comparison = pd.DataFrame({
    'Coefficient': model_ols.params,
    'OLS SE': model_ols.bse,
    'HAC SE': se_hac,
    'OLS t-stat': model_ols.tvalues,
    'HAC t-stat': t_hac
}, index=['Alpha', 'Beta'])

print("=== OLS vs HAC Standard Errors ===\n")
print(comparison.round(4))
print(f"\n📊 Interpretation:")
print(f"   Alpha = {model_ols.params[0]*100:.2f}% monthly ({model_ols.params[0]*12*100:.1f}% annualised)")
print(f"   OLS: t = {model_ols.tvalues[0]:.2f} (significant at 5% if |t| > 1.96)")
print(f"   HAC: t = {t_hac[0]:.2f} (adjusts for autocorrelation)")
print(f"   HAC standard error is {se_hac[0]/model_ols.bse[0]:.1f}x larger than OLS")
if abs(model_ols.tvalues[0]) > 1.96 and abs(t_hac[0]) < 1.96:
    print(f"   ⚠️  Result is significant with OLS but NOT with HAC!")
    print(f"       Using OLS would lead to false positive. HAC prevents this.")

What this demonstration shows:

  • HAC standard errors are typically 1.3-2× larger than OLS for autocorrelated data
  • t-statistics drop correspondingly (OLS t = 2.3 might become HAC t = 1.7)
  • Results “significant” with OLS can become insignificant with HAC
  • This is why HAC is required for honest time-series inference

Coursework Requirement

Using OLS for time-series factor data loses marks for incorrect methodology. Always use HAC. The scaffold implements this automatically: you just need to understand why it matters for interpretation.

Next Steps: Week 11 Preview

Week 11 complements today by covering the prediction pathway (Coursework 2 Option B). Same principle-focused approach: understanding concepts, not copying templates.

Week 11 focus: Market prediction using factors

  • Predict next-month market return using lagged factor data
  • Compare OLS vs regularised models (ridge regression handles multicollinearity)
  • Walk-forward validation for honest out-of-sample testing (prevents look-ahead bias)
  • Evaluate predictive power: R² OOS, directional accuracy, economic value

Same pedagogical philosophy:

  • Principles and understanding over step-by-step instructions
  • Critical interpretation over mechanical execution
  • Preparation for 35% Critical Analysis component

Connection: Replication (today) tests if factors exist. Prediction (next week) tests if factors forecast returns. Both require rigorous methodology and critical interpretation.

Summary: Week 10 Key Takeaways

Today provided foundational principles for factor replication. The core message: understanding principles enables critical analysis. Scaffold gives outputs; understanding gives interpretation.

Methodology:

  • Factor replication tests if published findings are real (out-of-sample, robustness required)
  • Jensen et al. (2024): Many factors show 50% decline in performance due to selection bias

Statistical foundations:

  • HAC standard errors correct for autocorrelation (typically 1.5-2× larger than OLS)
  • Alpha isolates excess returns beyond market exposure
  • Robustness checks (sample split, subperiod, costs) separate signal from noise

Critical analysis earns marks:

  • Interpret economically: 0.5% monthly = 6% annualised, but is it meaningful after costs?
  • Acknowledge uncertainty: t = 2.1 is marginally significant, not “strong evidence”
  • Honest limitations discussion: what didn’t you test? Selection bias? Post-publication decline?

Start coursework early: read Jensen et al. (2024), run scaffold, focus on interpretation

Banz, Rolf W. 1981. “The Relationship Between Return and Market Value of Common Stocks.” Journal of Financial Economics 9 (1): 3–18. https://doi.org/10.1016/0304-405X(81)90018-0.
Fama, Eugene F., and Kenneth R. French. 1992. “The Cross-Section of Expected Stock Returns.” Journal of Finance 47 (2): 427–65. https://doi.org/10.1111/j.1540-6261.1992.tb04398.x.
———. 2015. “A Five-Factor Asset Pricing Model.” Journal of Financial Economics 116 (1): 1–22. https://doi.org/10.1016/j.jfineco.2014.10.010.
Gelman, Andrew, and Eric Loken. 2014. “The Statistical Crisis in Science.” American Scientist 102 (6): 460–65. https://doi.org/10.1511/2014.111.460.
Harvey, Campbell R. 2017. “Presidential Address: The Scientific Outlook in Financial Economics.” Journal of Finance 72 (4): 1399–1440. https://doi.org/10.1111/jofi.12530.
Harvey, Campbell R., Yan Liu, and Heqing Zhu. 2020. “False (and Missed) Discoveries in Financial Economics.” Journal of Finance 75 (5): 2503–53. https://doi.org/10.1111/jofi.12960.
Jegadeesh, Narasimhan, and Sheridan Titman. 1993. “Returns to Buying Winners and Selling Losers: Implications for Stock Market Efficiency.” Journal of Finance 48 (1): 65–91. https://doi.org/10.1111/j.1540-6261.1993.tb04702.x.
Jensen, Theis I., Bryan T. Kelly, and Lasse Heje Pedersen. 2024. “Is There a Replication Crisis in Finance?” Journal of Finance. https://doi.org/10.1111/jofi.13249.
Novy-Marx, Robert. 2013. “The Other Side of Value: The Gross Profitability Premium.” Journal of Financial Economics 108 (1): 1–28. https://doi.org/10.1016/j.jfineco.2013.01.003.