Week 10: Factor Replication : Principles & Critical Analysis

Learning Objectives

Explain factor replication as a research methodology for testing published findings
Interpret HAC (Heteroskedasticity and Autocorrelation Consistent) standard errors and understand why time-series autocorrelation matters
Evaluate factor performance using multiple metrics (Sharpe, alpha, robustness)
Identify sources of selection bias and overfitting in factor research
Apply critical thinking to assess whether documented factors are exploitable

Opening frame (90 seconds)

Welcome to Week 10. Over the next two weeks, we pivot toward preparing you for Coursework 2: but not by giving you step-by-step instructions. The scaffold notebook already provides working code. These weeks are about understanding principles so you can use that code intelligently and write the critical analysis that comprises 35% of your mark.

Big question: When you read that “value stocks outperform growth stocks by 5% annually,” how do you know if that’s real or just data mining? Factor replication is the methodology we use to test published findings with intellectual honesty.

Connect to earlier weeks: Remember the bias-variance tradeoff from Week 1? Factor research is plagued by overfitting: researchers mine hundreds of characteristics, publish the ones that worked in-sample, and ignore out-of-sample failure. Jensen, Kelly & Pedersen (2024) document this systematically.

What we’re NOT doing today: We’re not providing templates for Coursework 2. We’re not walking through submission-ready tables. We’re teaching you how to think about factor replication so you can interpret results, identify limitations, and write critical analysis.

Learning objectives roadmap: By the end, you’ll understand why HAC standard errors matter, what alpha tests reveal, how to interpret robustness checks, and what questions to ask when evaluating factor research.

Assessment relevance: Coursework 2 marking emphasises Critical Analysis & Originality (35%): interpretation, robustness, honest discussion of limits. Today prepares you for that component by teaching principles, not execution.

Engagement: “How many of you have read finance papers claiming ‘this strategy beats the market’? How many worked when you tried them?” (Probably few. That’s the replication crisis.)

Transition: “Let’s start with what factor replication actually means.”

Agenda

Part I : What is factor replication? Research methodology foundations
Part II : Statistical foundations: HAC errors, alpha tests, robustness
Part III : Selection bias and the replication crisis in finance
Part IV : Critical analysis: What makes interpretation rigorous?
Part V : Preparation for Coursework 2: Principles, not templates

Part I : Factor Replication as Research Methodology

What Are Factors?

Factors are the building blocks of modern quantitative investing. Rather than picking individual stocks, factor strategies systematically buy characteristics that historically generate excess returns.

Definition: Characteristics that explain cross-sectional variation in stock returns

Classic examples:

Value (HML): High Minus Low book-to-market : buy undervalued stocks, sell overvalued (1992)
Momentum (MOM): Buy past 6-12 month winners, sell losers (behavioural persistence) (1993)
Size (SMB): Small Minus Big : small-cap premium (though weakening post-publication) (1981; Fama and French 1992)
Quality (RMW): Robust Minus Weak profitability : sustainable competitive advantages (2013; Fama and French 2015)

Foundation concepts (120 seconds)

Factors are the building blocks of modern asset pricing. Rather than picking individual stocks, factor investing involves buying characteristics that historically generate excess returns. This is the intellectual foundation of quantitative investing and robo-advisory portfolio construction.

Value (HML): High Minus Low book-to-market ratio. Buy stocks trading below book value (cheap), sell stocks trading above book value (expensive). Fama & French documented this in 1992. Theory: cheap stocks are undervalued or riskier; either way, they should earn premium returns.

Momentum (MOM): Buy stocks that rose over past 6-12 months, sell stocks that fell. Jegadeesh & Titman (1993) documented this. Theory: behavioural: investors underreact to news, creating drift. Or risk-based: momentum stocks are riskier in crashes.

Size (SMB): Small Minus Big. Buy small-cap stocks, sell large-cap stocks. Documented by Banz (1981), popularised by Fama & French (1992). Theory: small firms are riskier (less diversified, less liquid), so they earn premium. But evidence has weakened post-publication.

Quality (RMW): Robust Minus Weak profitability. Buy firms with high operating profitability, sell low profitability. Novy-Marx (2013), Fama & French (2015). Theory: profitable firms have sustainable competitive advantages.

Long-short construction: Factors are zero-investment portfolios. You don’t need capital: buy £100 of value stocks, short £100 of growth stocks, net investment = £0. Returns represent pure factor exposure, isolated from market movements. This is how academics test whether characteristics matter.

Why factors matter: If factors are real (risk-based or behavioural), you can harvest them systematically. If factors are spurious (data mining), they won’t persist out-of-sample. Factor replication tests which is true.

Student engagement: “Have you heard claims like ‘buy small-cap stocks, they outperform’? That’s a factor claim. Today we learn how to evaluate if it’s real.”

Transition: “Factors are characteristics. But how are they constructed? Let’s understand long-short portfolios.”

Long-Short Construction: Zero-Investment Portfolios

Factors are constructed as long-short portfolios: simultaneously buying one group and selling another. This isolates factor exposure from market movements.

Mechanics:

Long leg: Buy stocks with desired characteristic (e.g., high book-to-market = value)
Short leg: Sell stocks with opposite characteristic (e.g., low book-to-market = growth)
Equal weights: Long and short legs have equal dollar amounts
Net investment: £0 (long purchases offset short sales)

Example: Value Factor (HML)

Component	Valuation	Action	Investment	Return
Long	Undervalued (high B/M)	Buy £100 value stocks	-£100	+£5 (5%)
Short	Overvalued (low B/M)	Sell £100 growth stocks	+£100	-£2 (-2%)
Net	Market-neutral	Long-short portfolio	£0	+£7 (7%)

Factor return = Long return - Short return = 5% - (-2%) = 7%

Show calculation: Long-short factor return

import pandas as pd
import numpy as np

# Simulated example: Value factor construction
np.random.seed(42)

# Long leg: Value stocks (high book-to-market)
value_return = 0.05  # 5% return

# Short leg: Growth stocks (low book-to-market)  
# Note: Short return is negative of growth return
growth_return = 0.02  # Growth stocks returned 2%
short_return = -growth_return  # Short position earns -2%

# Factor return = Long - Short
factor_return = value_return - growth_return

# Create visualization table
construction = pd.DataFrame({
    'Component': ['Long (Value)', 'Short (Growth)', 'Factor (HML)'],
    'Valuation': ['Undervalued (high B/M)', 'Overvalued (low B/M)', 'Market-neutral'],
    'Action': ['Buy £100 value', 'Sell £100 growth', 'Net portfolio'],
    'Investment': ['-£100', '+£100', '£0'],
    'Return (%)': [value_return*100, short_return*100, factor_return*100],
    'Dollar P&L': ['+£5', '+£2', '+£7']
})

print("=== Long-Short Factor Construction Example ===\n")
print(construction.to_string(index=False))
print(f"\n📊 Key Insight:")
print(f"   Factor return ({factor_return*100:.1f}%) = Value return ({value_return*100:.1f}%) - Growth return ({growth_return*100:.1f}%)")
print(f"   Net investment = £0 (long purchase funded by short sale proceeds)")
print(f"   Factor isolates value premium, independent of market movements")

Why Zero Investment?

No capital required means returns represent pure factor exposure, not market risk. If market rises 10%, both legs move together: factor return isolates the difference.

Long-short construction explained (150 seconds)

This is a crucial concept that students often struggle with. Long-short portfolios are the foundation of factor investing, but the mechanics aren’t always intuitive. Let’s break it down step-by-step.

What “long” means: Buying stocks. You invest capital, own the stocks, earn returns if prices rise. Standard investing.

What “short” means: Selling stocks you don’t own. You borrow stocks, sell them immediately, then buy them back later. If prices fall, you profit (buy back cheaper). If prices rise, you lose (must buy back at higher price). Shorting requires margin (collateral), but conceptually you’re betting against the stock.

Long-short combination: Buy £100 of value stocks, short £100 of growth stocks. Net investment = £0 (long purchase uses proceeds from short sale). This is a zero-investment portfolio: you don’t need capital to implement it (in theory; in practice, margin requirements exist).

Why equal weights?: Long and short legs have equal dollar amounts (£100 each). This ensures factor return isolates the characteristic difference, not market exposure. If both legs are £100, market movements affect both equally: factor return captures only the difference between value and growth performance.

Return calculation: Factor return = Long return - Short return. If value stocks return 5% and growth stocks return -2%, factor return = 5% - (-2%) = 7%. This 7% represents pure value premium, isolated from market movements.

Market neutrality: If market rises 10%, both value and growth stocks likely rise. But if value rises 12% and growth rises 8%, factor return = 12% - 8% = 4%. Factor return captures relative performance, not absolute market performance. This is why factors have low beta (market exposure): they’re hedged.

Valuation explanation: Value stocks are undervalued: they trade at low prices relative to book value (high book-to-market ratio). Growth stocks are overvalued: they trade at high prices relative to book value (low book-to-market ratio). The value factor bets that undervalued stocks will outperform overvalued stocks as markets correct mispricing.

Visual example: Use the table on the slide. Walk through each row: Valuation (undervalued vs overvalued), Long action (buy value), Short action (sell growth), Net investment (£0), Returns (5% vs -2%), Factor return (7%). Make it concrete: value stocks are “cheap” relative to fundamentals, growth stocks are “expensive.”

Why this matters: Academic factor research uses long-short construction to test whether characteristics matter independently of market risk. If value factor earns positive returns, it means value stocks outperform growth stocks even after controlling for market movements. This is what alpha tests measure.

Practical implementation: In reality, shorting requires margin (collateral), borrowing costs, and transaction costs. But conceptually, long-short construction isolates factor exposure. For coursework, you’ll use pre-constructed factor returns from JKP: they’ve already done the long-short construction.

Student confusion points: (1) “How can you invest £0?” Answer: Long purchase uses proceeds from short sale. (2) “What if both legs lose money?” Answer: Factor return can be negative: if value underperforms growth, factor return is negative. (3) “Why not just buy value stocks?” Answer: That would include market risk. Long-short isolates factor risk.

Connection to coursework: JKP provides pre-constructed factor returns. You don’t need to build long-short portfolios yourself. But understanding the construction helps you interpret what factor returns mean: they’re relative performance, not absolute returns.

Transition: “Long-short construction isolates factor exposure. Now let’s discuss what replication means: testing whether these factors are real.”

Factor Replication: What Does It Mean?

Replication is core scientific practice. In medicine, we demand multiple trials before approving drugs. In finance, if a factor works 1970-1990, we demand it works 1991-2020.

Replication = reproduce published findings using independent data or time periods

Why replicate?

Test whether published results are real or data mining artifacts
Assess out-of-sample performance (does it work on new data?)
Evaluate economic significance (after costs, is there exploitable profit?)
Understand robustness (does it work across markets, time periods, specifications?)

Jensen, Kelly & Pedersen (2024): “Is There a Replication Crisis in Finance?”

Tested 153 published factors using consistent methodology
Many factors show 50% decline in out-of-sample performance
Cross-region replication often fails (US factors don’t work in Europe/Asia)
Conclusion: Published literature significantly overstates factor performance

Replication methodology (150 seconds)

Replication is a core scientific practice. In medicine, if a drug works in one trial, we demand it works in other trials before approval. In finance, if a factor works 1970-1990, we demand it works 1991-2020. Replication tests robustness and guards against false discoveries.

What replication tests: (1) Out-of-sample performance: does the factor work on data not available when the paper was published? (2) Cross-region robustness: if documented in US, does it work in Europe, Asia? (3) Specification robustness: if you change breakpoints (e.g., sort stocks into thirds vs. fifths), do results hold? (4) Economic significance: after transaction costs, taxes, and implementation frictions, is there still exploitable profit?

Why replication matters: Academic publishing has a selection bias problem. Researchers test hundreds of characteristics, publish the ones that “work,” and file-drawer the failures. This creates false discovery rates far above nominal 5%. Harvey, Liu & Zhu (2020) estimate over 300 factors have been published, but most are spurious.

Jensen, Kelly & Pedersen (2024): This is the definitive replication study in finance. They collect 153 published equity factors (value, momentum, profitability, investment, etc.), implement them consistently using the same data and methodology, and test out-of-sample performance. Key findings: (1) Many factors show weaker out-of-sample performance than in-sample. (2) Cross-region replication often fails: US factors don’t work in Europe/Asia. (3) Transaction costs eliminate profits for many factors. (4) Selection bias is pervasive: the published literature overstates true factor performance.

JKP data portal: https://jkpfactors.com provides factor returns for 153 published factors across multiple regions and time periods. This is the dataset you’ll use for Coursework 2. It’s the gold standard for factor replication research.

Replication crisis: This isn’t unique to finance. Psychology, medicine, economics all face replication crises. The lesson: published findings are hypotheses, not facts. Replication tests whether they generalise.

Student reflection: “If you read a paper claiming ‘this strategy earns 10% alpha,’ should you invest your money based on that paper alone?” (No! Replicate it first. Test out-of-sample. Check transaction costs. Most published strategies fail these tests.)

Assessment connection: Coursework 2 asks you to replicate a factor. The 35% Critical Analysis component rewards you for discussing limitations, robustness, and whether results are exploitable. This requires understanding replication methodology, not just running code.

Transition: “Replication is the methodology. Now let’s examine the statistical tools that make replication rigorous.”

Factor Replication Workflow

This is a conceptual framework, not a mechanical recipe. The scaffold notebook implements these steps, but understanding why each matters separates a pass from a distinction.

Conceptual steps:

Choose factor: Select published factor with theoretical motivation
Obtain data: Download returns from JKP portal (https://jkpfactors.com)
Descriptive analysis: Mean, volatility, Sharpe ratio, cumulative returns
Alpha test: Regress factor on market using HAC standard errors
Robustness checks: Sample splits, subperiod analysis, cost adjustments
Interpretation: Is factor real? Exploitable after costs? What are limitations?

Each step requires judgment: what robustness checks matter depends on your specific factor

Workflow as thinking process (120 seconds)

This is a conceptual workflow, not a recipe. The scaffold notebook implements these steps, but understanding why each step matters is what separates a pass from a distinction.

Step 1: Choose factor: Don’t pick randomly. Choose a factor with theoretical motivation (why should it work?) and empirical documentation (published in reputable journal). For Coursework 2, value (HML) and momentum (MOM) are classic choices because they’re well-documented and theoretically motivated. But you could choose quality, size, investment, or others from JKP.

Step 2: Obtain data: JKP portal provides pre-constructed factor returns. You register (free), select region (e.g., Global, USA, Europe), select frequency (monthly), download CSV. Alternative: construct factors yourself from raw stock data (requires WRDS or similar): more work, but demonstrates deeper understanding. For coursework, using JKP is fine.

Step 3: Descriptive analysis: Calculate summary statistics: mean return, standard deviation, Sharpe ratio (mean / std × √12 for annualised monthly data). Plot cumulative returns to visualise performance over time. This answers: “Does the factor earn positive returns on average? Is it volatile? How does it compare to market?”

Step 4: Alpha test: This is the core test. Regress factor returns on market returns (CAPM). Alpha (intercept) measures excess return not explained by market exposure. If alpha is positive and statistically significant, factor earns returns beyond market risk. If alpha is zero, factor is just levered market exposure. Use HAC standard errors (Newey-West) because returns are autocorrelated. We’ll discuss HAC in depth shortly.

Step 5: Robustness checks: Split sample into two periods: does factor work in both? Change factor construction (e.g., sort into thirds instead of quintiles): do results hold? Test in different regions: does US factor work in Europe? Robustness separates real patterns from lucky noise.

Step 6: Interpretation: This is where critical thinking matters. Don’t just report “alpha = 0.5%, t = 2.1.” Ask: (1) Is 0.5% monthly economically meaningful after transaction costs? (2) Is the factor stable across sub-periods, or does it work only in some decades? (3) What’s the theoretical explanation: risk or mispricing? (4) Are there implementation frictions (liquidity, short-selling constraints)? (5) Has factor performance declined post-publication (as arbitrageurs exploit it)? These questions comprise the 35% Critical Analysis component.

Judgment, not checklist: The workflow is a guide, not a recipe. Every factor requires different robustness checks. Momentum might need tests during crises (momentum crashes in 2009). Value might need tests during tech bubbles (value underperformed 1995-2000). Quality might need profitability definitions compared. Think about what matters for your factor.

Student task: “If you replicate momentum and find it earns 1% monthly alpha with t = 3, what questions would you ask before concluding ‘momentum is exploitable’?” (Transaction costs? Post-publication decline? Crash risk? Implementation constraints?)

Assessment: The scaffold handles steps 1-5 mechanically. Your report’s interpretation (step 6) determines whether you earn 60% or 75%.

Transition: “The workflow is conceptual. Now let’s examine the statistical foundations that make replication rigorous: starting with HAC standard errors.”

Part II : Statistical Foundations for Rigorous Replication

Signal and Noise in Financial Returns

Financial returns are inherently noisy. Even if a factor has true alpha, observed returns mix signal (predictable component) with noise (random variation). Standard errors help us distinguish signal from noise.

The challenge:

Signal: True factor alpha (e.g., value stocks genuinely outperform)
Noise: Random variation (luck, market shocks, measurement error)
Observed return = Signal + Noise

Why this matters:

With 20 years of monthly data (240 observations), noise can create spurious patterns
A factor might appear significant just by chance (noise masquerading as signal)
Standard errors quantify how much noise contaminates our signal estimate

Example: True alpha = 0% (no factor), but observed alpha = 0.5% monthly
→ Is this signal (real factor) or noise (lucky sample)?

The Fundamental Problem

Financial returns have low signal-to-noise ratio. Real Bloomberg data (2018-2025) shows: SPY has signal-to-noise = 0.042 (noise is 24× larger than signal). Only 0.2% of variance is signal; 99.8% is noise. This makes statistical inference challenging.

Signal and noise foundations (120 seconds)

This slide sets up why standard errors matter in finance. Financial returns are fundamentally noisy: even if a factor has true alpha, observed returns mix signal (the predictable component) with noise (random variation). Standard errors help us distinguish between these.

The signal-noise problem: In finance, signal-to-noise ratios are low. Monthly market returns have mean ≈ 0.8% and volatility ≈ 4%. This means returns are mostly noise: signal is small relative to variation. This makes statistical inference challenging: it’s hard to detect true patterns when noise dominates.

Why this matters for factor replication: Even if a factor has zero true alpha (no signal), you might observe positive alpha in-sample just by luck. With 20 years of monthly data (240 observations), random variation can create spurious patterns that look like real factors. Standard errors quantify this uncertainty: they tell you how much noise contaminates your signal estimate.

Example: Suppose true alpha is zero (no factor exists). But in your sample, you observe alpha = 0.5% monthly. Is this signal (real factor) or noise (lucky sample)? Standard errors answer this: if SE = 0.2%, t = 2.5 (significant). If SE = 0.5%, t = 1.0 (not significant). Same observed alpha, different conclusions based on uncertainty.

Student engagement: “If I flip a coin 10 times and get 7 heads, is the coin biased?” (Maybe, but could be luck. Need more flips to be sure. Same logic: need to account for noise in financial returns.)

Connection to factor research: Many published factors might be noise masquerading as signal. Standard errors help us identify which factors are real (signal) vs. spurious (noise). This is why rigorous statistical inference is essential: it prevents false discoveries.

Transition: “Signal and noise are mixed in observed returns. How do we quantify this? Let’s understand the methodology first.”

Measuring Signal-to-Noise: Methodology

To quantify signal vs noise, we decompose return variance into predictable (signal) and unpredictable (noise) components using conditional expectations.

Econometrically Rigorous Construction:

For returns \(r_t\), we define signal as the predictable component conditional on available information:

Conditional Expectation Model: \(E[r_t | \mathcal{I}_{t-1}] = \alpha + \beta \cdot \text{market}_t\)
- Predictable component based on market exposure (CAPM)
- Captures time-varying expected returns, not just constant mean
Variance Decomposition:
- Total Variance: \(\text{Var}(r_t) = \sigma^2\)
- Signal Variance: \(\text{Var}(E[r_t | \mathcal{I}_{t-1}])\) (variance of conditional expectation)
- Noise Variance: \(\text{Var}(r_t - E[r_t | \mathcal{I}_{t-1}])\) (variance of residuals)
- Signal Fraction: \(R^2\) from prediction model (proportion of variance explained)
- Noise Fraction: \(1 - R^2\) (proportion unexplained)

Why This Is More Rigorous:

Uses conditional expectation \(E[r_t | \mathcal{I}_{t-1}]\) rather than unconditional mean
Captures predictable components: market exposure, autocorrelation, time-varying expected returns
Signal = variance explained by information set; Noise = residual variance
Aligns with econometric theory: signal is what’s predictable given available information

Implementation via CAPM Regression:

Regress asset returns on market: \(r_t = \alpha + \beta \cdot r_{m,t} + \varepsilon_t\)

Signal = \(\hat{\alpha} + \hat{\beta} \cdot r_{m,t}\) (predicted returns)
Noise = \(\hat{\varepsilon}_t\) (residuals)
Signal Fraction = \(R^2\) (variance explained by market)
Noise Fraction = \(1 - R^2\) (unexplained variance)

Example: SPY regressed on itself (market proxy) - \(R^2 \approx 1.0\) (SPY explains itself perfectly) - For individual stocks: \(R^2 \approx 0.3-0.6\) (30-60% signal, 40-70% noise) - For factors: \(R^2\) typically lower (most variance is idiosyncratic noise)

Show calculation: Econometrically rigorous signal-to-noise

import numpy as np
import pandas as pd
import statsmodels.api as sm

df = load_bloomberg()
# Get SPY and market (SPY as market proxy)
spy_data = df[df['ticker'] == 'SPY'].sort_values('date')
market = spy_data['return'].values
asset = spy_data['return'].values  # SPY regressed on itself for demonstration
dates = spy_data['date'].values

# Econometrically rigorous approach: CAPM regression
X = sm.add_constant(market)
model = sm.OLS(asset, X).fit()

# Signal = predicted returns (conditional expectation)
predicted = model.fittedvalues
signal_var = np.var(predicted)

# Noise = residuals (unpredictable component)
residuals = model.resid
noise_var = np.var(residuals)

# Total variance
total_var = np.var(asset)

# Signal fraction = R² (variance explained by model)
signal_fraction = model.rsquared
noise_fraction = 1 - model.rsquared

# Signal-to-noise ratio (using conditional expectation)
signal_mean = np.abs(predicted.mean())
noise_std = np.std(residuals)
signal_noise_ratio = signal_mean / noise_std if noise_std > 0 else np.nan

print("=== Econometrically Rigorous Signal-to-Noise ===\n")
print("Method: CAPM Regression (Conditional Expectation)")
print(f"Model: r_t = α + β × market_t + ε_t\n")

print(f"Regression Results:")
print(f"  R² (Signal Fraction):  {signal_fraction:.4f} ({signal_fraction:.1%})")
print(f"  1 - R² (Noise Fraction): {noise_fraction:.4f} ({noise_fraction:.1%})")
print(f"  α (intercept):         {model.params[0]*100:.4f}% daily")
print(f"  β (market exposure):    {model.params[1]:.4f}")

print(f"\nVariance Decomposition:")
print(f"  Total variance:        {total_var*10000:.4f} (basis points)")
print(f"  Signal variance:       {signal_var*10000:.4f} (Var(E[r|I]))")
print(f"  Noise variance:        {noise_var*10000:.4f} (Var(ε))")
print(f"  Signal fraction:       {signal_fraction:.1%}")
print(f"  Noise fraction:        {noise_fraction:.1%}")

print(f"\nSignal-to-Noise Ratio:")
print(f"  Signal mean:           {signal_mean*100:.4f}% daily")
print(f"  Noise std:             {noise_std*100:.4f}% daily")
print(f"  Signal/Noise ratio:    {signal_noise_ratio:.4f}")

print(f"\n💡 Econometric Interpretation:")
print(f"   Signal = predictable component E[r_t | market_t]")
print(f"   Noise = residual ε_t (unpredictable given market)")
print(f"   {signal_fraction:.1%} of variance is explained by market exposure")
print(f"   {noise_fraction:.1%} is idiosyncratic noise")
print(f"   This is more rigorous than unconditional mean approach!")

Econometric Rigor

This approach uses conditional expectation \(E[r_t | \mathcal{I}_{t-1}]\) rather than unconditional mean. Signal is what’s predictable given available information (market returns, factors, etc.). This aligns with econometric theory and captures time-varying expected returns, autocorrelation, and factor exposures.

Why This Matters

For individual assets, \(R^2\) from CAPM is typically 30-60% (signal), meaning 40-70% is noise. For factors themselves (long-short portfolios), \(R^2\) is often lower because most variance is idiosyncratic. This quantifies why detecting true factors is challenging: even with market information, much variance remains unpredictable.

Signal-to-noise methodology (150 seconds)

This slide explains HOW we construct signal-to-noise metrics. Students need to understand the methodology before seeing results: otherwise the numbers are meaningless.

Mathematical foundation: We decompose total variance into signal (predictable) and noise (unpredictable) components. Signal = mean return (constant), noise = deviation from mean. This is a variance decomposition exercise.

Signal-to-noise ratio: Simple ratio of mean to standard deviation. If mean = 0.05% and std = 1.23%, ratio = 0.041. This means signal is only 4.1% of noise: noise dominates.

Variance decomposition: Total variance = Var(returns). Signal variance = μ² (variance of constant mean). Noise variance = σ² - μ² (residual variance). Signal fraction = μ²/σ², noise fraction = 1 - μ²/σ².

Why μ² is signal variance: If returns were perfectly predictable at μ, variance would be zero. But we observe variance σ². The component μ² represents variance due to the mean itself (signal). The remainder (σ² - μ²) is noise.

Example calculation: Walk through SPY example step-by-step. Mean = 0.05%, std = 1.23%. Signal-to-noise = 0.041. Signal variance = (0.05%)² = 0.000025%. Total variance = (1.23%)² = 1.51%. Signal fraction = 0.002 = 0.2%. Noise fraction = 99.8%.

Student engagement: “If signal fraction is 0.2%, what does that mean for factor replication?” (It means 99.8% of variance is noise: very hard to detect true factors. Need large samples and proper standard errors.)

Connection to standard errors: Signal-to-noise ratio directly relates to t-statistics. Low signal-to-noise means large standard errors relative to mean, making significance hard to achieve.

Transition: “Now that we understand the methodology, let’s apply it to real Bloomberg data.”

Real-World Signal-to-Noise: Bloomberg Data

Using real financial data (2018-2025) from Bloomberg Terminal, we apply the econometrically rigorous approach: signal = predictable component from CAPM regression.

Method: Regress each asset on market (SPY) to extract conditional expectation \(E[r_t | \text{market}_t]\)

Signal Fraction = R² from CAPM (variance explained by market exposure)

Show visualization: Professional signal-to-noise analysis

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

_csv_path = data_root / "bloomberg_database" / "signal_noise_metrics.csv"
if not _csv_path.exists():
    _csv_path = Path("data/bloomberg_database/signal_noise_metrics.csv")
metrics = pd.read_csv(_csv_path)

# Select example assets
example_assets = ['SPY', 'AAPL', 'VIX', 'BTCUSD']
display_metrics = metrics[metrics['asset'].isin(example_assets)].copy()

# Create professional visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Signal-to-Noise Ratio Comparison
ax1 = axes[0]
colors = {'Equity': '#2E86AB', 'Risk Gauge': '#F24236', 'Crypto': '#06A77D'}
for asset_type in display_metrics['asset_type'].unique():
    subset = display_metrics[display_metrics['asset_type'] == asset_type]
    ax1.scatter(subset['signal_noise_ratio'], subset['sharpe_ratio'], 
               label=asset_type, alpha=0.8, s=150, color=colors.get(asset_type, 'gray'),
               edgecolors='white', linewidth=2)

# Add asset labels
for _, row in display_metrics.iterrows():
    ax1.annotate(row['asset'], 
                (row['signal_noise_ratio'], row['sharpe_ratio']),
                xytext=(5, 5), textcoords='offset points', fontsize=9, fontweight='bold')

ax1.set_xlabel('Signal-to-Noise Ratio (|Mean| / Std)', fontsize=11, fontweight='bold')
ax1.set_ylabel('Sharpe Ratio (Annual)', fontsize=11, fontweight='bold')
ax1.set_title('Signal-to-Noise vs Risk-Adjusted Returns', fontsize=12, fontweight='bold', pad=10)
ax1.legend(loc='upper left', frameon=True, fancybox=True, shadow=True)
ax1.grid(True, alpha=0.3, linestyle='--')

# Plot 2: Signal vs Noise Fraction (Stacked Bar Chart)
ax2 = axes[1]
assets = display_metrics['asset'].values
signal_fracs = display_metrics['signal_fraction'].values * 100
noise_fracs = display_metrics['noise_fraction'].values * 100

x_pos = np.arange(len(assets))
width = 0.6

bars1 = ax2.barh(x_pos, signal_fracs, width, label='Signal', color='#2E86AB', alpha=0.8)
bars2 = ax2.barh(x_pos, noise_fracs, width, left=signal_fracs, label='Noise', 
                 color='#F24236', alpha=0.8)

# Add percentage labels
for i, (s, n) in enumerate(zip(signal_fracs, noise_fracs)):
    if s > 0.1:  # Only label if signal is visible
        ax2.text(s/2, i, f'{s:.2f}%', ha='center', va='center', fontweight='bold', 
                fontsize=8, color='white')
    ax2.text(s + n/2, i, f'{n:.1f}%', ha='center', va='center', fontweight='bold',
            fontsize=9, color='white')

ax2.set_yticks(x_pos)
ax2.set_yticklabels(assets, fontweight='bold')
ax2.set_xlabel('Variance Fraction (%)', fontsize=11, fontweight='bold')
ax2.set_title('Signal vs Noise Decomposition', fontsize=12, fontweight='bold', pad=10)
ax2.legend(loc='lower right', frameon=True, fancybox=True, shadow=True)
ax2.grid(True, alpha=0.3, axis='x', linestyle='--')
ax2.set_xlim(0, 100)

plt.tight_layout()
plt.show()

Show calculation: Real-world signal-to-noise from Bloomberg data

import pandas as pd
import numpy as np

df = load_bloomberg()
_csv_path = data_root / "bloomberg_database" / "signal_noise_metrics.csv"
if not _csv_path.exists():
    _csv_path = Path("data/bloomberg_database/signal_noise_metrics.csv")
metrics = pd.read_csv(_csv_path)

# Select example assets and recalculate using CAPM approach
example_assets = ['SPY', 'AAPL', 'VIX', 'BTCUSD']

# Recalculate metrics using CAPM regression (econometrically rigorous)
import statsmodels.api as sm

# Get market returns (SPY)
market_data = df[df['ticker'] == 'SPY'].sort_values('date')
market_returns = market_data['return'].values

recalculated_metrics = []
for asset_name in example_assets:
    asset_data = df[df['ticker'] == asset_name].sort_values('date')
    if len(asset_data) < 30:
        continue
    
    # Merge to align dates
    merged = pd.merge(
        asset_data[['date', 'return']],
        market_data[['date', 'return']],
        on='date',
        suffixes=('_asset', '_market')
    )
    
    if len(merged) < 30:
        continue
    
    asset_ret = merged['return_asset'].values
    market_ret = merged['return_market'].values
    
    # CAPM regression
    X = sm.add_constant(market_ret)
    model = sm.OLS(asset_ret, X).fit()
    
    # Signal fraction = R² (variance explained by market)
    signal_fraction = model.rsquared
    noise_fraction = 1 - model.rsquared
    
    # Signal-to-noise ratio
    predicted = model.fittedvalues
    residuals = model.resid
    signal_mean = np.abs(predicted.mean())
    noise_std = np.std(residuals)
    signal_noise_ratio = signal_mean / noise_std if noise_std > 0 else np.nan
    
    recalculated_metrics.append({
        'asset': asset_name,
        'asset_type': asset_data['asset_type'].iloc[0] if 'asset_type' in asset_data.columns else 'Unknown',
        'signal_fraction': signal_fraction,
        'noise_fraction': noise_fraction,
        'signal_noise_ratio': signal_noise_ratio,
        'r_squared': signal_fraction,
        'beta': model.params[1] if len(model.params) > 1 else np.nan
    })

display_metrics = pd.DataFrame(recalculated_metrics)

# Display metrics table
print("\n=== Real-World Signal-to-Noise Metrics (CAPM-based) ===\n")
if 'r_squared' in display_metrics.columns:
    print(display_metrics[['asset', 'asset_type', 'r_squared', 
                          'signal_fraction', 'noise_fraction', 'beta']].round(4).to_string(index=False))
else:
    print(display_metrics[['asset', 'asset_type', 'signal_fraction', 
                          'noise_fraction']].round(4).to_string(index=False))

print("\n📊 Interpretation (CAPM-based):")
print("   - Signal fraction = R² from CAPM regression")
print("   - SPY: R² ≈ 1.0 (perfectly explained by itself)")
print("   - Individual stocks: R² ≈ 0.5-0.6 (50-60% explained by market)")
print("   - Factors/Crypto: R² ≈ 0.1-0.2 (10-20% explained, rest is noise)")
print("   - This captures predictable component conditional on market information!")

# Show AAPL example in detail (more interesting than SPY)
aapl_data = df[df['ticker'] == 'AAPL'].sort_values('date')
merged = pd.merge(aapl_data[['date', 'return']], market_data[['date', 'return']],
                 on='date', suffixes=('_aapl', '_market'))

X = sm.add_constant(merged['return_market'].values)
model = sm.OLS(merged['return_aapl'].values, X).fit()

print(f"\n🔍 AAPL Example (CAPM Regression):")
print(f"   R² (Signal Fraction): {model.rsquared:.1%}")
print(f"   1-R² (Noise Fraction): {1-model.rsquared:.1%}")
print(f"   β (Market Exposure): {model.params[1]:.2f}")
print(f"   α (Intercept): {model.params[0]*100:.4f}% daily")
print(f"   → {model.rsquared:.1%} of variance is predictable from market")
print(f"   → {1-model.rsquared:.1%} is idiosyncratic noise")

The Reality Check

Using conditional expectation (CAPM), signal fractions vary dramatically by asset type. Individual equities: 50-60% signal (market exposure explains most variance). Factors and crypto: 10-20% signal (80-90% noise). This econometrically rigorous approach captures what’s truly predictable given market information: much more informative than unconditional mean.

Real-world signal-to-noise demonstration (150 seconds)

This slide uses actual Bloomberg Terminal data to demonstrate signal-to-noise ratios in real financial markets. This makes the abstract concept concrete and shows students that the problem isn’t theoretical: it’s the reality of financial data.

Data source: Bloomberg database with 9 assets (5 equities, 3 risk gauges, 1 crypto) from 2018-2025. This is professional-grade data, not simulated.

Key findings using CAPM approach: Signal fractions vary dramatically by asset type. SPY regressed on itself: R² ≈ 1.0 (100% signal: perfect prediction). Individual stocks (AAPL): R² ≈ 0.5-0.6 (50-60% signal from market exposure, 40-50% noise). Factors and crypto: R² ≈ 0.1-0.2 (10-20% signal, 80-90% noise).

AAPL example: Regressed on market (SPY), R² ≈ 0.55. This means 55% of variance is predictable from market exposure (signal), 45% is idiosyncratic noise. This is more informative than unconditional mean approach: it captures what’s predictable given market information.

Why this matters: This quantifies why factor replication is so challenging. Even if a factor has true alpha, noise dominates. Standard errors help us determine whether observed alpha is signal (real factor) or noise (lucky sample).

Student engagement: “Look at these numbers. If 99.8% of variance is noise, how confident can you be that a factor with t = 2.0 is real?” (Not very confident. Need t > 3, or better yet, out-of-sample validation.)

Connection to coursework: When students replicate factors in Coursework 2, they’ll see similar patterns. Understanding signal-to-noise helps them interpret results correctly: small t-statistics don’t mean the factor is weak, they mean noise is large.

Transition: “Real data confirms: noise dominates. Standard errors quantify this uncertainty. But how are they constructed?”

Why Standard Errors Matter

Standard errors quantify uncertainty in estimates. In factor replication, we’re testing whether observed alpha is signal (true factor premium) or noise (random variation).

Connection to signal-to-noise analysis:

Individual stocks (AAPL): R² ≈ 0.55 → 45% noise → larger standard errors
Factors (long-short portfolios): R² ≈ 0.1-0.2 → 80-90% noise → much larger standard errors
Implication: Factor alpha estimates are less precise than stock alpha estimates

Statistical significance = “Is observed alpha signal or noise?”

t-statistic = Alpha / Standard Error

|t| > 1.96 → statistically significant at 5% level (conventional threshold)
|t| < 1.96 → cannot reject null hypothesis (could be random chance)
Harvey (2017) recommends t > 3 for finance (multiple testing correction)

Why factors need higher t-statistics: With 80-90% noise fraction, standard errors are large. Need t > 3 to confidently distinguish signal from noise.

Common Mistake

Alpha = 1% monthly with t = 0.5 is not significant (likely noise). Alpha = 0.3% monthly with t = 3 is significant (likely signal). The t-statistic matters more than the magnitude: especially for factors with high noise fraction.

Statistical inference foundations (120 seconds)

Standard errors quantify uncertainty in estimates. They are calculated from the variance-covariance matrix of the estimator (depending on error variance and sample size), but their frequentist interpretation is as the standard deviation of the sampling distribution: how much estimates would vary across hypothetical repeated samples from the same population. This gives us both a measure of estimation precision and a basis for hypothesis testing.

Why it matters in factor replication: From our signal-to-noise analysis, factors have 80-90% noise fraction. This means factor alpha estimates have large standard errors. Even if true alpha is zero, you might observe positive alpha in-sample just by luck (noise). Statistical significance tests whether observed alpha is larger than we’d expect from pure noise.

Connection to CAPM findings: Individual stocks (R² ≈ 0.55) have smaller standard errors than factors (R² ≈ 0.1-0.2). This is why factor replication requires higher t-statistics: the noise fraction is much larger.

t-statistic interpretation: Divide estimate by its standard error. If |t| > 1.96, there’s less than 5% probability the observed result arose by chance (under null hypothesis of no effect). This is the conventional significance threshold in finance. Some fields use |t| > 2.5 or higher due to multiple testing concerns (Harvey (2017) recommends t > 3 for finance).

Common mistake: Ignoring standard errors and focusing only on point estimates. Alpha = 1% with t = 0.5 is not significant (could be luck). Alpha = 0.3% with t = 3 is significant (unlikely to be luck). The t-statistic matters more than the magnitude.

Confidence intervals: 95% confidence interval = estimate ± 1.96 × SE. This gives range of plausible values. If alpha = 0.5% with SE = 0.2%, CI = [0.1%, 0.9%]. This tells you alpha is positive but uncertain in magnitude.

Type I vs. Type II errors: Type I = false positive (conclude factor works when it doesn’t). Type II = false negative (conclude factor doesn’t work when it does). Significance tests control Type I error at 5%. But with many factors tested, even 5% false positive rate generates many spurious findings (multiple testing problem).

Student engagement: “If I flip a coin 100 times and get 60 heads, is the coin biased?” (Maybe. Need to calculate if 60 is significantly different from 50, accounting for sampling variance. Same logic applies to factor returns.)

Assessment: Understanding standard errors is essential for interpreting results. Don’t just report t-statistics: explain what they mean and what uncertainty remains.

Transition: “Standard errors quantify uncertainty. But how are they actually calculated? Understanding the construction helps us see why they matter for financial data.”

How Standard Errors Are Constructed

Standard errors measure both estimation precision and sampling variability. They are calculated from the variance-covariance matrix (error variance ÷ sample size), but their frequentist interpretation is as the standard deviation of the sampling distribution under hypothetical repeated sampling. High noise variance → large standard errors → imprecise estimates.

Basic OLS standard error formula:

For regression coefficient \(\hat{\beta}\):

\[ SE(\hat{\beta}) = \sqrt{\frac{\hat{\sigma}^2}{\sum (X_i - \bar{X})^2}} \]

where \(\hat{\sigma}^2\) is estimated error variance.

Key components:

Error variance (\(\hat{\sigma}^2\)): How much returns deviate from predicted values
Sample size (\(n\)): More observations → smaller SE (more precision)
Variation in X: More variation in predictor → smaller SE (better identification)

For financial returns:

High error variance: Returns are volatile (noise is large)
Limited sample size: Only 20-30 years of monthly data available
Result: Standard errors are relatively large, making significance hard to achieve

Intuition: If returns were perfectly predictable, error variance = 0, SE = 0. But returns are noisy, so SE > 0. Standard errors tell us how much uncertainty remains.

Show calculation: Standard error components

import numpy as np
import pandas as pd

# Simulate factor returns
np.random.seed(42)
n = 240  # 20 years monthly
true_alpha = 0.003  # 0.3% monthly true alpha
market = np.random.normal(0.008, 0.04, n)
factor = true_alpha + 0.2 * market + np.random.normal(0, 0.03, n)

# OLS regression
X = np.column_stack([np.ones(n), market])
beta_hat = np.linalg.lstsq(X, factor, rcond=None)[0]
residuals = factor - X @ beta_hat
sigma_sq = np.var(residuals, ddof=2)  # Error variance

# Standard error calculation
X_centered = market - market.mean()
sum_sq_X = np.sum(X_centered ** 2)
se_alpha = np.sqrt(sigma_sq / n)  # Simplified for intercept
se_beta = np.sqrt(sigma_sq / sum_sq_X)  # For slope

# Display components
components = pd.DataFrame({
    'Component': ['Sample size (n)', 'Error variance (σ²)', 'Sum of squares (X)', 
                  'SE(alpha)', 'SE(beta)'],
    'Value': [n, f'{sigma_sq:.6f}', f'{sum_sq_X:.4f}', 
              f'{se_alpha:.4f}', f'{se_beta:.4f}']
})

print("=== Standard Error Construction Components ===\n")
print(components.to_string(index=False))
print(f"\n📊 Interpretation:")
print(f"   Error variance = {sigma_sq:.6f} (high → noisy returns)")
print(f"   Sample size = {n} (limited → less precision)")
print(f"   SE(alpha) = {se_alpha:.4f} ({se_alpha*100:.2f}% monthly)")
print(f"   t-statistic = {beta_hat[0]/se_alpha:.2f}")
if abs(beta_hat[0]/se_alpha) > 1.96:
    print(f"   ✓ Alpha is statistically significant")
else:
    print(f"   ✗ Alpha is NOT statistically significant")

Why Financial Data Is Challenging

High error variance + limited sample size = large standard errors. This makes it hard to detect true factors (signal) when noise dominates. HAC standard errors account for additional complications (autocorrelation, heteroskedasticity), making SEs even larger.

Standard error construction (150 seconds)

This slide goes deeper into how standard errors are actually constructed. Understanding the formula helps students see why financial data is challenging and why standard errors matter.

Basic OLS formula: Standard error for regression coefficient depends on three components: (1) Error variance (how much returns deviate from predicted values), (2) Sample size (more observations = more precision), (3) Variation in predictor (more variation = better identification).

For financial returns: All three components work against us. (1) Error variance is high: returns are volatile, so residuals are large. (2) Sample size is limited: we only have 20-30 years of monthly data (240-360 observations), not thousands. (3) Market variation exists but is moderate: not extreme variation that would help identification.

Result: Standard errors are relatively large. This makes it hard to achieve statistical significance. Even if true alpha exists, we might not detect it due to noise. This is why rigorous statistical methods (HAC) are essential: they account for additional complications that make inference even harder.

Code demonstration: Shows the actual calculation. Students see error variance, sample size, and how they combine to produce standard errors. The point: financial data has high noise relative to signal, making standard errors large and significance hard to achieve.

Intuition: If returns were perfectly predictable (error variance = 0), standard errors would be zero: we’d know alpha exactly. But returns are noisy, so standard errors are positive. They quantify how much uncertainty remains in our estimates.

Connection to factor research: Many factors might have small true alpha, but we can’t detect it because standard errors are too large (noise dominates). This is why replication is essential: it tests whether observed patterns persist across samples.

Student engagement: “Why do we need 20+ years of data to test factors? Why not just use 1 year?” (Sample size affects standard errors. More data = smaller SE = easier to detect signal. But even 20 years might not be enough if noise is high.)

Transition: “Standard errors quantify uncertainty. But in time-series data, the calculation is more complex: autocorrelation and heteroskedasticity require HAC adjustments.”

Time-Series Data: Autocorrelation Problem

Financial time-series violate a key OLS assumption: independence of observations. If monthly returns are correlated, you don’t have 120 independent observations over 10 years: you have fewer “effective” observations.

Financial returns exhibit serial correlation:

Momentum: Positive returns predict future positive returns (6-12 months)
Volatility clustering: High volatility today predicts high volatility tomorrow
Market regimes: Bull and bear markets persist over time

Problem for inference:

Standard OLS assumes observations are independent (εₜ and εₜ₊₁ uncorrelated)
Autocorrelation breaks this assumption
Result: OLS understates standard errors → inflates t-statistics → false positives

Impact: HAC (Newey-West) standard errors typically 1.5-2× larger than OLS for monthly factors

A factor with OLS t = 2.5 might have HAC t = 1.8 (no longer significant). Always use HAC for time-series financial data.

Autocorrelation and inference (150 seconds)

Autocorrelation means time-series observations are correlated with their own past values. Financial returns exhibit multiple forms of autocorrelation: (1) Momentum: positive returns predict future positive returns over 6-12 months. (2) Mean reversion: over longer horizons (3-5 years), returns revert to mean. (3) Volatility clustering: high volatility today predicts high volatility tomorrow (GARCH). (4) Market regimes: bull markets and bear markets persist.

Why it matters for inference: Standard OLS regression assumes observations are independent (i.e., εₜ and εₜ₊₁ are uncorrelated). When returns are autocorrelated, this assumption fails. Consequence: OLS standard errors are too small (underestimate true uncertainty), which means t-statistics are too large (overstate significance). You might conclude a factor is significant when it’s actually just noise.

Transition: “Let’s see actual autocorrelation in real financial data from Bloomberg.”

Detecting Autocorrelation: Bloomberg Data Evidence

Using real financial data, we can measure autocorrelation and test whether it’s statistically significant. This demonstrates why HAC corrections are essential.

Autocorrelation Function (ACF): Correlation between \(r_t\) and \(r_{t-k}\) for lags \(k = 1, 2, ...\)

Ljung-Box Test: Tests null hypothesis of no autocorrelation up to lag \(k\)

H₀: No autocorrelation (ρ₁ = ρ₂ = … = ρₖ = 0)
If p-value < 0.05 → reject H₀ → autocorrelation present → OLS SEs are wrong

Show analysis: Detecting autocorrelation in real returns

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.stats.diagnostic import acorr_ljungbox
from statsmodels.graphics.tsaplots import plot_acf

df = load_bloomberg()

# Analyze autocorrelation for SPY
spy_returns = df[df['ticker'] == 'SPY']['return'].dropna().values

# Create visualization
fig, axes = plt.subplots(1, 3, figsize=(12, 4))

# Plot 1: ACF for returns
ax1 = axes[0]
plot_acf(spy_returns, lags=20, ax=ax1, alpha=0.05, 
         title='SPY Returns: Autocorrelation Function')
ax1.set_xlabel('Lag (days)', fontsize=10)
ax1.set_ylabel('Autocorrelation', fontsize=10)

# Plot 2: ACF for squared returns (volatility clustering)
ax2 = axes[1]
plot_acf(spy_returns**2, lags=20, ax=ax2, alpha=0.05,
         title='SPY Squared Returns: Volatility Clustering')
ax2.set_xlabel('Lag (days)', fontsize=10)
ax2.set_ylabel('Autocorrelation', fontsize=10)

# Plot 3: Ljung-Box test results
ax3 = axes[2]
lb_returns = acorr_ljungbox(spy_returns, lags=range(1, 21), return_df=True)
lb_squared = acorr_ljungbox(spy_returns**2, lags=range(1, 21), return_df=True)

lags = range(1, 21)
ax3.plot(lags, lb_returns['lb_pvalue'], 'b-o', label='Returns', markersize=4)
ax3.plot(lags, lb_squared['lb_pvalue'], 'r-s', label='Squared Returns', markersize=4)
ax3.axhline(y=0.05, color='gray', linestyle='--', label='5% significance')
ax3.set_xlabel('Lag (days)', fontsize=10)
ax3.set_ylabel('p-value', fontsize=10)
ax3.set_title('Ljung-Box Test: p-values by Lag', fontsize=11, fontweight='bold')
ax3.legend(loc='upper right', fontsize=8)
ax3.set_ylim(0, 1)

plt.tight_layout()
plt.show()

# Print test results
print("=== Autocorrelation Analysis: SPY Daily Returns ===\n")

# First-order autocorrelation
from scipy.stats import pearsonr
if len(spy_returns) > 1:
    acf1, _ = pearsonr(spy_returns[:-1], spy_returns[1:])
    print(f"First-order autocorrelation (ρ₁): {acf1:.4f}")

# Ljung-Box at lag 10
lb_10 = acorr_ljungbox(spy_returns, lags=[10], return_df=True)
lb_10_sq = acorr_ljungbox(spy_returns**2, lags=[10], return_df=True)

print(f"\nLjung-Box Test (lag 10):")
print(f"  Returns:         Q = {lb_10['lb_stat'].values[0]:.2f}, p = {lb_10['lb_pvalue'].values[0]:.4f}")
print(f"  Squared Returns: Q = {lb_10_sq['lb_stat'].values[0]:.2f}, p = {lb_10_sq['lb_pvalue'].values[0]:.4f}")

print(f"\n📊 Interpretation:")
if lb_10['lb_pvalue'].values[0] < 0.05:
    print(f"   ✗ Returns show significant autocorrelation (p < 0.05)")
    print(f"     → OLS standard errors are biased downward")
else:
    print(f"   ✓ Returns show no significant autocorrelation (p ≥ 0.05)")
    print(f"     → OLS standard errors may be acceptable for returns")

if lb_10_sq['lb_pvalue'].values[0] < 0.05:
    print(f"   ✗ Squared returns show significant autocorrelation (p < 0.05)")
    print(f"     → Volatility clustering present → heteroskedasticity")
    print(f"     → OLS standard errors are still biased → use HAC!")

Key Finding

Even if return autocorrelation is weak, squared returns (volatility) typically show strong autocorrelation. This is volatility clustering (GARCH): high/low volatility periods persist. HAC corrections address both autocorrelation AND heteroskedasticity.

Autocorrelation detection with real data (120 seconds)

This slide shows students how to detect autocorrelation in real financial data using two tools: the Autocorrelation Function (ACF) and the Ljung-Box test.

ACF interpretation: Each bar shows correlation between returns at time t and time t-k. Blue shaded region is 95% confidence interval under null of no autocorrelation. Bars outside the shaded region indicate significant autocorrelation at that lag.

Returns vs squared returns: Returns themselves often show weak autocorrelation (efficient markets quickly incorporate information). But squared returns (proxy for volatility) show strong autocorrelation: this is volatility clustering. High volatility days cluster together, low volatility days cluster together.

Ljung-Box test: Tests null hypothesis that there’s no autocorrelation up to a given lag. If p-value < 0.05, reject null: autocorrelation is present. For squared returns, p-values are typically very small (strong evidence of volatility clustering).

Why this matters for HAC: Even if returns appear uncorrelated, heteroskedasticity (volatility clustering) still violates OLS assumptions. HAC corrects for BOTH: that’s why it’s called Heteroskedasticity and Autocorrelation Consistent.

Student engagement: “Look at the squared returns ACF: see how it stays high even at long lags? That’s volatility clustering. When you run CAPM regressions, the residuals inherit this structure. OLS doesn’t account for it; HAC does.”

Transition: “We’ve detected autocorrelation. Now let’s see how HAC standard errors differ from OLS in practice.”

HAC vs OLS Standard Errors: Practical Impact

Let’s compare OLS and HAC standard errors using a CAPM regression on real Bloomberg data. The difference shows why HAC is essential.

Methodology: Regress asset returns on market (SPY) using both OLS and HAC standard errors

Show comparison: OLS vs HAC standard errors

import pandas as pd
import numpy as np
import statsmodels.api as sm

df = load_bloomberg()

# Get market (SPY) and asset (AAPL) returns
spy_data = df[df['ticker'] == 'SPY'].sort_values('date')
aapl_data = df[df['ticker'] == 'AAPL'].sort_values('date')

# Merge to align dates
merged = pd.merge(
    aapl_data[['date', 'return']],
    spy_data[['date', 'return']],
    on='date',
    suffixes=('_aapl', '_market')
).dropna()

y = merged['return_aapl'].values
X = sm.add_constant(merged['return_market'].values)

# OLS regression (standard errors assume i.i.d. errors)
model_ols = sm.OLS(y, X).fit()

# HAC regression (Newey-West standard errors, lag = 10)
model_hac = sm.OLS(y, X).fit(cov_type='HAC', cov_kwds={'maxlags': 10})

# Create comparison table
results = pd.DataFrame({
    'Parameter': ['Alpha (α)', 'Beta (β)'],
    'Estimate': [model_ols.params[0], model_ols.params[1]],
    'OLS SE': [model_ols.bse[0], model_ols.bse[1]],
    'HAC SE': [model_hac.bse[0], model_hac.bse[1]],
    'OLS t-stat': [model_ols.tvalues[0], model_ols.tvalues[1]],
    'HAC t-stat': [model_hac.tvalues[0], model_hac.tvalues[1]],
})

# Calculate SE inflation factor
results['SE Inflation'] = results['HAC SE'] / results['OLS SE']

print("=== OLS vs HAC Standard Errors: CAPM Regression ===\n")
print(f"Asset: AAPL regressed on Market (SPY)")
print(f"Sample: {len(y):,} daily observations\n")

print("Parameter Estimates and Standard Errors:")
print("-" * 75)
print(f"{'Parameter':<12} {'Estimate':>10} {'OLS SE':>10} {'HAC SE':>10} {'OLS t':>8} {'HAC t':>8} {'SE Ratio':>10}")
print("-" * 75)

for _, row in results.iterrows():
    print(f"{row['Parameter']:<12} {row['Estimate']:>10.5f} {row['OLS SE']:>10.5f} "
          f"{row['HAC SE']:>10.5f} {row['OLS t-stat']:>8.2f} {row['HAC t-stat']:>8.2f} "
          f"{row['SE Inflation']:>10.2f}x")

print("-" * 75)

# Interpretation
alpha_ols_sig = abs(model_ols.tvalues[0]) > 1.96
alpha_hac_sig = abs(model_hac.tvalues[0]) > 1.96

print(f"\n📊 Key Findings:")
print(f"   HAC SE / OLS SE ratio: {results['SE Inflation'].mean():.2f}x on average")
print(f"   Alpha significance:")
print(f"     OLS: |t| = {abs(model_ols.tvalues[0]):.2f} → {'Significant' if alpha_ols_sig else 'Not significant'} at 5%")
print(f"     HAC: |t| = {abs(model_hac.tvalues[0]):.2f} → {'Significant' if alpha_hac_sig else 'Not significant'} at 5%")

if alpha_ols_sig and not alpha_hac_sig:
    print(f"\n⚠️  CRITICAL: Alpha appears significant with OLS but NOT with HAC!")
    print(f"   This is a FALSE POSITIVE prevented by using HAC standard errors.")
elif alpha_ols_sig and alpha_hac_sig:
    print(f"\n✓  Alpha is significant with both OLS and HAC.")
    print(f"   But t-statistic is lower with HAC: more conservative inference.")
else:
    print(f"\n   Alpha not significant with either method.")
    print(f"   HAC gives more reliable inference regardless.")

print(f"\n💡 Lesson:")
print(f"   Always use HAC (cov_type='HAC') for time-series financial regressions.")
print(f"   OLS standard errors understate uncertainty → inflate t-statistics → false positives.")

Practical Implementation

In statsmodels: model.fit(cov_type='HAC', cov_kwds={'maxlags': 10}) gives Newey-West HAC standard errors. For monthly data, use maxlags=6; for daily data, maxlags=20-30.

HAC vs OLS comparison (120 seconds)

This slide demonstrates the practical impact of using HAC vs OLS standard errors in a real CAPM regression. Students see actual numbers, not just theory.

What we’re doing: Regress AAPL returns on market (SPY) returns. Compare standard errors from OLS (assumes i.i.d. errors) vs HAC (accounts for autocorrelation and heteroskedasticity).

Typical findings: HAC standard errors are 1.2-2.0x larger than OLS. This means t-statistics drop by 20-50%. A result that looks significant with OLS may not be significant with HAC.

Why the difference: HAC estimates the “effective” sample size given autocorrelation structure. If residuals are autocorrelated, adjacent observations are not independent: effective sample size is smaller than actual sample size. Smaller effective sample → larger standard errors → lower t-statistics.

Implementation: In statsmodels, just add cov_type='HAC' to .fit(). The maxlags parameter controls how many lags of autocorrelation to account for. Rule of thumb: monthly data → 6 lags; daily data → 20-30 lags.

Student takeaway: Never use plain OLS for time-series financial regressions. HAC is not optional: it’s required for honest inference. In Coursework 2, you will lose marks for using OLS standard errors instead of HAC.

Connection to signal-to-noise: HAC standard errors are larger because effective sample size is smaller. This further increases the noise fraction: making true signal detection even harder. Need higher t-statistics (Harvey recommends t > 3) to confidently identify real factors.

Transition: “Now we understand why HAC matters. Let’s apply this to factor alpha tests.”

Alpha Tests: CAPM Regression

The CAPM alpha test decomposes factor returns into two components: market exposure (β) and excess return beyond market (α). Only alpha matters: beta just tells you market risk.

Capital Asset Pricing Model regression:

\[ R_{factor,t} = \alpha + \beta \cdot R_{market,t} + \varepsilon_t \]

Interpretation:

Alpha (α): Excess return not explained by market exposure (“skill” component)
Beta (β): Factor’s sensitivity to market movements
R²: Fraction of factor variance explained by market
Null hypothesis: α = 0 (no excess return beyond market)

Example: Momentum earns 1.2% monthly, beta = 0.2, market earns 0.8% monthly

→ CAPM predicts momentum return = α + 0.2 × 0.8% = α + 0.16%
→ Observed return 1.2%, so α = 1.04% monthly (if HAC t > 1.96, it’s significant)

CAPM alpha foundations (150 seconds)

The CAPM alpha test is the workhorse of factor research. It decomposes factor returns into two components: (1) market exposure (β) and (2) excess return beyond market (α). Only alpha matters: beta just tells you how much market risk the factor takes.

Why we care about alpha: If a factor has high returns but also high beta, it’s just levered market exposure: not an independent source of return. Example: small-cap stocks (SMB) have β ≈ 1.2 (more volatile than market). Part of SMB returns comes from that market exposure. Alpha isolates returns beyond what market exposure would predict.

Economic interpretation: Alpha represents mispricing (if markets are inefficient) or compensation for unmeasured risk (if markets are efficient but CAPM is incomplete). Either way, positive alpha means the factor generates returns that can’t be explained by a simple market benchmark.

Statistical test: Null hypothesis H₀: α = 0 (no excess return). Alternative H₁: α > 0 (positive excess return). We reject H₀ if HAC t-statistic > 1.96 (5% level, two-tailed) or > 1.65 (5% level, one-tailed). Finance typically uses two-tailed tests because negative alpha (underperformance) is also informative.

Example numbers: Suppose momentum factor has mean return = 1.2% monthly, beta = 0.2, market return = 0.8% monthly. CAPM predicts momentum return = α + 0.2 × 0.8% = α + 0.16%. Observed return is 1.2%, so α = 1.2% - 0.16% = 1.04% monthly. If SE(α) = 0.4% (HAC), then t = 1.04 / 0.4 = 2.6, which is significant. Momentum earns 1% monthly alpha beyond market exposure.

Beta interpretation: β > 1 means factor is riskier than market (amplifies market moves). β < 1 means factor is defensive. β ≈ 0 means factor is market-neutral (true long-short). Most academic factors have β ≈ 0 by construction because they’re long-short portfolios.

R-squared: Measures fraction of factor variance explained by market. High R² (e.g., 0.8) means factor is mostly market exposure. Low R² (e.g., 0.1) means factor is largely independent of market. For long-short factors, R² is typically low (0.05-0.15) because market exposure is hedged out.

Multi-factor models: CAPM uses only market as benchmark. Fama-French uses market + size + value. Fama-French-Carhart adds momentum. More factors reduce alpha (because they explain more variation), making significance harder. For coursework, CAPM is sufficient, but you could extend to multi-factor for robustness.

Common pitfall: Reporting alpha without beta. Always report both: alpha is meaningless without knowing market exposure. Also report R² to show how much variance is market-driven.

Student engagement: “If a factor has 1.5% monthly return and beta = 2, should you invest?” (Not necessarily! High return might just be leveraged market exposure. If market crashes, factor crashes twice as hard. Alpha is what matters for risk-adjusted performance.)

Assessment: Coursework 2 requires alpha regression with HAC standard errors. Interpret alpha economically (0.5% monthly = 6% annualized excess return). Discuss statistical significance. Consider whether alpha justifies transaction costs.

Transition: “Alpha tests are essential. But one test isn’t enough: we need robustness checks to separate signal from noise.”

Robustness: Why One Test Isn’t Enough

A single significant result is weak evidence. Researchers have many degrees of freedom: what Gelman and Loken (2014) call the “garden of forking paths”: if you try enough specifications, one will appear significant by chance. Robustness guards against false discoveries.

Robustness checks test if results hold under alternative specifications:

Sample split: Does factor work in first half AND second half? (minimum check)
Subperiod analysis: Does alpha remain positive in each decade?
Alternative construction: Tertiles vs. quintiles, value-weighted vs. equal-weighted?
Cross-region: US factors often don’t replicate in Europe/Asia (Jensen, Kelly, and Pedersen 2024)
Transaction costs: Is net alpha positive after 0.2-0.5% monthly costs?

Ethical Econometrics

Don’t cherry-pick checks that passed. If factor works 2000-2010 but not 2010-2020, report both. Selective reporting is a breach of research ethics: transparent, complete disclosure is fundamental to responsible empirical practice and is rewarded in the 35% Critical Analysis component.

Robustness philosophy and ethical econometrics (150 seconds)

A single significant result is weak evidence. This connects directly to Gelman and Loken’s (2014) “garden of forking paths”: researchers face countless analytical choices: data period, factor construction, breakpoints, weighting schemes, outlier treatment, lag specifications. Each choice is a fork in the path; different forks lead to different conclusions. Even without intentional p-hacking, the sheer number of researcher degrees of freedom means some specification will appear significant by chance.

Ethical econometrics: This is not merely about “honesty”: it’s about research ethics and responsible empirical practice. Selective reporting (only showing specifications that work) is a fundamental breach of scientific integrity. The American Statistical Association, the Royal Statistical Society, and leading econometricians increasingly emphasise pre-registration and complete disclosure as ethical obligations, not optional best practices.

Sample split: Divide data at midpoint. Run alpha test on first half and second half separately. If factor works in both, that’s evidence of robustness. If it works only in first half, it might be sample-specific (perhaps driven by one unusual decade). For coursework, this is the minimum robustness check.

Subperiod analysis: Go beyond two-period split. Test factor in each decade (2000s, 2010s, 2020s). Does alpha remain positive and significant in each? Momentum works well in normal times but crashes during crises (e.g., 2009). Subperiod analysis reveals such instability.

Alternative breakpoints: Most factors sort stocks into quintiles (5 groups). What if you use tertiles (3 groups) or deciles (10 groups)? If results are robust, choice shouldn’t matter much. If results flip, factor is fragile: the finding depends on an arbitrary analytical choice (a fork in the garden path).

Cross-region replication: US factors don’t always work internationally. Value works in US but weaker in Japan. Momentum works globally. Testing cross-region replication assesses generalisability. Jensen et al. (2024) show many factors fail cross-region tests.

Transaction costs: Academic papers report gross returns (before trading costs). Real investors face bid-ask spreads, commissions, market impact. For high-turnover factors (momentum rebalances monthly), transaction costs can eliminate alpha. Robustness requires testing net returns after realistic cost assumptions (e.g., 0.5% round-trip cost).

Multiple testing and judgment: If you run 10 robustness checks and 1 fails, that’s okay: no factor is perfect. But if 5 fail, results aren’t robust. Use judgment. The point isn’t to pass every test but to transparently assess whether the factor is real and exploitable.

Ethical reporting obligation: Don’t cherry-pick robustness checks. Report both passes and failures. If factor works 2000-2010 but not 2010-2020, say so. Selective reporting: presenting only favourable results: is precisely the behaviour that created the replication crisis in finance and psychology. Transparent, complete disclosure is rewarded in the 35% Critical Analysis component precisely because it demonstrates understanding of responsible empirical practice.

Student reflection: “Imagine a medical trial where drug works in one hospital but not others. Would you trust it?” (No. Same in finance: one significant result isn’t enough. And hiding the negative results would be unethical.)

Assessment: Coursework 2 requires at least one robustness check (sample split is easiest). But for high marks, discuss multiple robustness dimensions and interpret what failures mean. Does subperiod instability suggest structural change? Does transaction cost erosion suggest arbitrage has eliminated alpha? Critical engagement with failures: not just celebration of passes: demonstrates statistical maturity.

Transition: “Robustness checks test replication within your analysis. But there’s a broader problem: selection bias in the published literature.”

Part III : Selection Bias and the Replication Crisis

The Multiple Testing Problem

This is the core problem creating the replication crisis. With 5% significance threshold, testing 100 hypotheses generates ~5 false positives even if all nulls are true.

Academic research process (the problem):

Researcher tests 50 potential factors
45 don’t work (α ≈ 0, not significant)
5 appear significant (α > 0, t > 2) by chance (5% false positive rate)
Researcher publishes the 5 “successful” factors
Failed tests go in file drawer (never published)
Journals prefer positive results; null results don’t advance careers

Result: Published literature massively overrepresents spurious findings

Harvey (2017) estimates over 300 equity factors published, but only ~10-15 are genuinely robust (95% are questionable)

Simulation: The Multiple Testing Problem in Action

Setup: 1,000 researchers each test 10 factors. All factors are pure noise (true α = 0). At 5% significance level, how many “discoveries” emerge?

Show simulation: False discoveries from pure noise

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats

np.random.seed(42)

# Simulation parameters
n_researchers = 1000
factors_per_researcher = 10
n_months = 240  # 20 years of monthly data
significance_level = 0.05
true_alpha = 0.0  # ALL factors are noise (null is true)

# Simulate: each factor is pure noise
total_factors = n_researchers * factors_per_researcher
t_statistics = []
p_values = []

for i in range(total_factors):
    # Generate random factor returns (mean = 0, sd = 3% monthly)
    factor_returns = np.random.normal(true_alpha, 0.03, n_months)
    
    # One-sample t-test: is mean significantly different from 0?
    t_stat, p_val = stats.ttest_1samp(factor_returns, 0)
    t_statistics.append(t_stat)
    p_values.append(p_val)

t_statistics = np.array(t_statistics)
p_values = np.array(p_values)

# Count "significant" results (false discoveries)
significant_5pct = np.sum(p_values < 0.05)
significant_1pct = np.sum(p_values < 0.01)
significant_harvey = np.sum(np.abs(t_statistics) > 3.0)  # Harvey's threshold

# Create visualisation
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Plot 1: Distribution of t-statistics
ax1 = axes[0]
ax1.hist(t_statistics, bins=50, density=True, alpha=0.7, color='steelblue', edgecolor='white')

# Overlay theoretical t-distribution
x = np.linspace(-5, 5, 100)
ax1.plot(x, stats.t.pdf(x, df=n_months-1), 'r-', linewidth=2, label='Theoretical t-dist')

# Mark significance thresholds
ax1.axvline(x=1.96, color='orange', linestyle='--', linewidth=2, label='t = ±1.96 (5%)')
ax1.axvline(x=-1.96, color='orange', linestyle='--', linewidth=2)
ax1.axvline(x=3.0, color='red', linestyle='--', linewidth=2, label='t = ±3.0 (Harvey)')
ax1.axvline(x=-3.0, color='red', linestyle='--', linewidth=2)

ax1.set_xlabel('t-statistic', fontsize=11, fontweight='bold')
ax1.set_ylabel('Density', fontsize=11, fontweight='bold')
ax1.set_title('Distribution of t-statistics\n(All 10,000 factors are PURE NOISE)', fontsize=12, fontweight='bold')
ax1.legend(loc='upper right', fontsize=9)
ax1.set_xlim(-5, 5)

# Plot 2: False discovery counts
ax2 = axes[1]
categories = ['p < 0.05\n(Standard)', 'p < 0.01\n(Stricter)', '|t| > 3\n(Harvey)']
counts = [significant_5pct, significant_1pct, significant_harvey]
expected = [total_factors * 0.05, total_factors * 0.01, total_factors * 0.0027]  # 0.27% for |t|>3

x_pos = np.arange(len(categories))
width = 0.35

bars1 = ax2.bar(x_pos - width/2, counts, width, label='Observed False Discoveries', color='crimson', alpha=0.8)
bars2 = ax2.bar(x_pos + width/2, expected, width, label='Expected (Theory)', color='steelblue', alpha=0.8)

# Add count labels
for bar, count in zip(bars1, counts):
    ax2.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 10, 
             f'{int(count)}', ha='center', fontweight='bold', fontsize=11)

ax2.set_ylabel('Number of "Significant" Results', fontsize=11, fontweight='bold')
ax2.set_title('False Discoveries from Pure Noise\n(All 10,000 factors have TRUE α = 0)', fontsize=12, fontweight='bold')
ax2.set_xticks(x_pos)
ax2.set_xticklabels(categories, fontsize=10)
ax2.legend(loc='upper right', fontsize=9)
ax2.set_ylim(0, max(counts) * 1.2)

plt.tight_layout()
plt.show()

# Print summary
print("=" * 70)
print("MULTIPLE TESTING SIMULATION: THE REPLICATION CRISIS IN ACTION")
print("=" * 70)
print(f"\nSetup:")
print(f"  Researchers:           {n_researchers:,}")
print(f"  Factors per researcher: {factors_per_researcher}")
print(f"  Total factors tested:   {total_factors:,}")
print(f"  True alpha (all factors): {true_alpha} (ALL ARE NOISE)")
print(f"  Sample size per factor:  {n_months} months")

print(f"\nFalse Discoveries (Type I Errors):")
print(f"  At p < 0.05: {significant_5pct:,} factors appear significant ({significant_5pct/total_factors*100:.1f}%)")
print(f"  At p < 0.01: {significant_1pct:,} factors appear significant ({significant_1pct/total_factors*100:.1f}%)")
print(f"  At |t| > 3:  {significant_harvey:,} factors appear significant ({significant_harvey/total_factors*100:.2f}%)")

print(f"\n📊 Key Insight:")
print(f"  {significant_5pct} papers could be published claiming 'significant alpha'")
print(f"  ALL {significant_5pct} are FALSE DISCOVERIES (true α = 0)")
print(f"  Using t > 3 (Harvey's threshold) reduces false discoveries to {significant_harvey}")
print(f"\n⚠️  This is why the replication crisis exists!")
print(f"  Published literature is full of noise masquerading as signal.")

The Sobering Reality

From 10,000 pure noise factors, approximately 500 will appear significant at the 5% level. If only these are published, the literature looks like 500 “discoveries”: but 100% are false positives. Harvey’s t > 3 threshold reduces this to ~27.

Multiple testing simulation (120 seconds)

This simulation makes the abstract problem concrete. Students can see that even when ALL factors are pure noise (true α = 0), a predictable number will appear “significant” by chance.

The mathematics: E[false positives] = n × α. With 10,000 tests at α = 0.05, expected false positives = 500. This is not approximation: it’s the exact definition of Type I error rate.

Why t > 3 helps: At |t| > 3, the false positive rate drops to ~0.27% (for large samples). So expected false positives = 10,000 × 0.0027 ≈ 27. Still not zero, but 95% reduction from using standard t > 2.

What this means for literature: Harvey estimates 300+ factors published. If true base rate of real factors is low (say 5%), then even with t > 2 threshold, most published factors are likely false positives. This is Bayesian reasoning: posterior probability of being true depends on prior probability AND test specificity.

Student engagement: “Run this simulation yourself: change the seed and watch the numbers change. But the EXPECTED count is always n × α. That’s not luck: it’s mathematics.”

Connection to coursework: When you replicate a published factor and find weaker performance, this simulation explains why. The original paper may have been one of the lucky noise factors that passed significance threshold.

Transition: “Now we understand why selection bias is so pervasive. How do we guard against it?”

Guarding Against Selection Bias

Selection bias is hard to eliminate but can be mitigated with rigorous practices. These separate good research from bad.

Best practices in research:

Pre-registration: Specify hypothesis before seeing data (medical trials standard)
Out-of-sample testing: Test on data not available when factor was published
Cross-region replication: Factors should work globally if they’re real
Multiple testing corrections: Use t > 3 threshold (Harvey 2017) instead of t > 2
Economic theory: Value/momentum have theoretical foundations; “vowel tickers outperform” doesn’t

For Coursework 2: Intellectual Honesty Earns Marks

If you tried 3 factors, disclose that (don’t pretend you tested only 1)
Report robustness failures, not just successes
If alpha is t = 2.1, acknowledge it’s marginal, not “strong evidence”
The 35% Critical Analysis component explicitly rewards honest limitation discussion

Guarding against bias (120 seconds)

Selection bias is hard to eliminate but can be mitigated with rigorous practices. These are what separate good research from bad.

Pre-registration: Commit to hypothesis and analysis plan before seeing results. In medical trials, this is standard: researchers must register trials before recruiting patients. In finance, pre-registration is rare but growing (platforms like OSF, AsPredicted). For coursework, you can’t pre-register (you see data first), but you can document your process: if you tried multiple factors, disclose how many and why you chose the one you report.

Out-of-sample testing: Test factor on data not available when factor was published. If value factor was documented using 1960-1990 data, test it on 1991-2020. Jensen et al. (2024) do this systematically: they split factor publication dates and test post-publication performance. Most factors weaken post-publication (either due to statistical bias or arbitrage eroding profits).

Cross-region replication: Factors should work globally if they’re real (risk-based or behavioural). If value works in US but not Europe or Asia, it might be US-specific (data mining or institutional feature). Jensen et al. show many factors fail cross-region tests. For coursework, JKP data covers multiple regions: you could test this as robustness (advanced).

Multiple testing corrections: If you test 20 factors, one will appear significant at 5% level by chance. Solution: use stricter threshold. Harvey (2017) recommends t > 3 for finance (equivalent to p < 0.003). This reduces false positive rate but increases false negative rate (might miss real factors). Trade-off: better to be conservative and miss some true factors than to accept many false factors.

Economic theory: Factors grounded in theory (risk or mispricing) are more plausible than atheoretical data mining. Value has theoretical foundation (risk compensation or behavioural underreaction). Momentum has behavioural foundation (underreaction to news). A factor like “stocks with tickers starting with vowels outperform” has no theory: pure data mining. Theory isn’t proof, but it increases prior probability factor is real.

Honest reporting for coursework: The 35% Critical Analysis component rewards intellectual honesty. If you tested 3 factors and report only the one that worked, disclose that. If your robustness check failed (factor didn’t work in second half), report it and discuss what it means. If your alpha is marginally significant (t = 2.1), acknowledge uncertainty: don’t oversell it as “strong evidence.” Honesty demonstrates critical thinking.

What assessors look for: (1) Acknowledgment of multiple testing if applicable. (2) Transparent reporting of robustness failures. (3) Discussion of selection bias in published literature. (4) Recognition that your results are uncertain and might not replicate on future data. This intellectual humility is what distinguishes A-grade work.

Student reflection: “If you’re writing your report and tempted to hide a failed robustness check, ask: would a skeptical reader trust my analysis if they knew?” (Probably not. Honesty builds credibility.)

Assessment: Mark scheme explicitly rewards honest discussion of limitations (part of 35% component). Don’t fear reporting failures: they demonstrate understanding of replication challenges.

Transition: “Selection bias is a systemic problem. Now let’s discuss what makes critical analysis rigorous: the skill that earns high marks.”

Part IV : Critical Analysis: What Makes Interpretation Rigorous?

Beyond Reporting Numbers: Ask Questions

Weak analysis (reporting):

“Value factor earns 0.5% monthly alpha with t = 2.3 (significant). Sharpe ratio is 0.4. Results are robust to sample split.”

Strong analysis (interpretation):

“Value earns 0.5% monthly alpha (6% annualised). This is economically meaningful but modest. Statistical significance (t = 2.3) suggests it’s not pure luck, but close to threshold. Sample split shows alpha is stable (0.6% first half, 0.4% second half), increasing confidence. However, transaction costs (~0.2% monthly for value rebalancing) would reduce net alpha to 0.3% (3.6% annualised). Is 3.6% net alpha sufficient to compensate for tracking error and implementation frictions? Original paper reported 8% annualised: our replication shows 50% lower alpha, consistent with post-publication decline documented by Jensen et al. (2024).”

Critical analysis principles (150 seconds)

This is the difference between a pass (50-59%) and a distinction (70%+). Weak analysis reports statistics without interpretation. Strong analysis asks questions, contextualises results, acknowledges uncertainty, and discusses implications.

What strong analysis includes:

Economic interpretation: Convert monthly returns to annualised. Put magnitudes in context. Is 0.5% monthly (6% annualised) large or small? Compare to market returns (~8% annualised) and risk-free rate (~2%). A 6% excess return is meaningful but not exceptional.
Statistical nuance: Don’t treat t = 2.3 as “proof.” It’s marginal (just above 1.96 threshold). Acknowledge uncertainty. “Results are significant at 5% level but not at 1% level, suggesting moderate evidence.”
Robustness interpretation: Don’t just say “robust to sample split.” Quantify: alpha is 0.6% first half, 0.4% second half. This shows weakening over time, worth discussing. Is this statistical noise or structural decline?
Transaction cost realism: Academic papers report gross returns. Real investors face costs. Estimate costs (0.1-0.5% monthly depending on turnover and liquidity) and subtract from alpha. If net alpha is near zero, factor isn’t exploitable.
Literature comparison: Compare your results to original paper. If original reported 10% alpha and you find 5%, that’s a 50% replication gap: discuss why (post-publication decline, different sample, data quality, selection bias in original).
Limitations discussion: Acknowledge what you didn’t test. “Sample size is limited (20 years). Cross-region replication not tested. Transaction cost estimate is rough. Results may not hold during crises (not tested).” This shows you understand research limitations.
Investment implications: What would a practitioner conclude? “Value factor shows positive alpha but modest magnitude and weakening trend. After costs, net alpha is ~3%. This might justify small allocation in diversified portfolio but not concentrated bet. Implementation would require careful transaction cost management.”

What weak analysis looks like: “Alpha is 0.5%, t = 2.3, significant. Sharpe ratio is 0.4. Cumulative return is 150% over 20 years. Robustness check passed. Conclusion: value factor works.” (This just reports numbers without interpretation, context, or critical thinking. It earns 50-60%.)

What strong analysis looks like: See the quoted example above. It interprets magnitudes, acknowledges uncertainty, compares to literature, discusses costs, and draws nuanced conclusions. This earns 70%+.

Assessment rubric: The formal Blackboard rubric (see assessments/FIN510_CW2_Rubric.md) specifies performance bands for each criterion. For the 35% Critical Analysis component: 70%+ requires genuine interpretation connecting results to investment implications, engagement with selection bias literature, and original insights beyond the scaffold. 60-69% requires good interpretation with appropriate limitations discussion. Below 60% is descriptive only with limited insight.

Student task: “Take the statement ‘momentum has alpha = 1% monthly, t = 3.’ Write one paragraph of critical analysis.” (Should include: 12% annualised is large; highly significant; but momentum crashes during crises; transaction costs high due to monthly rebalancing; post-publication performance weaker; requires risk management.)

Transition: “Critical analysis is about asking questions and interpreting honestly. Let’s connect this to Coursework 2 preparation.”

Interpreting Your Factor Results

When you analyse your factor’s performance, interrogate your conclusions:

Statistical vs economic significance: A t-stat of 2.1 clears the 1.96 threshold: but how confident would you be investing real money on that evidence?
Scale matters: What does 0.1% monthly alpha actually mean for an investor over a year? Is that worth pursuing?
Benchmarking performance: If the market delivers a Sharpe ratio around 0.4, what should you conclude about a factor with Sharpe of 0.3?
Robustness integrity: If some of your robustness tests pass and others fail, what story does that tell about your factor?
From paper to portfolio: What happens between calculating returns on a spreadsheet and actually implementing a trading strategy?

The Implementation Gap

Academic factor returns assume frictionless trading. Real portfolios face transaction costs, market impact, and timing constraints. How might these affect your conclusions?

Interpreting results: facilitated discussion (2-3 minutes)

This slide is deliberately framed as questions rather than answers. The goal is to prompt students to think critically about their own analysis, not provide a checklist they can mechanically apply.

Facilitation approach: Pause on each question and invite responses. Let students discover the issues rather than telling them.

Statistical vs economic significance: Ask: “If you had £10,000 to invest and saw a t-stat of 2.1, would you feel confident?” Most will hesitate. Draw out that significance is a continuum: barely clearing 1.96 is weak evidence, not strong. t > 3 represents much stronger evidence.

Scale matters: Work through an example: “0.1% monthly is 1.2% annually. After typical fund fees of 0.5-1%, what’s left for the investor?” Students will recognise this is economically trivial even if statistically significant.

Benchmarking: Ask: “If you can get Sharpe ≈ 0.4 from a passive index fund, what does Sharpe = 0.3 from your active strategy tell you?” Let them conclude that matching or underperforming the market isn’t compelling.

Robustness integrity: Pose a scenario: “You run three robustness checks. Two pass, one fails. What do you write?” Some will say report only the passes: push back gently. What would a reader think if they later discovered the unreported failure?

Implementation gap: This is where many analyses fall short. Ask: “What happens when you try to trade this strategy in real markets?” Draw out: transaction costs, market impact, timing lags, liquidity constraints. A factor rebalancing monthly might face 0.3-0.5% monthly costs.

Post-publication decline: Mention that Jensen et al. (2024) document systematic weakening of factors after publication. If students find weaker results than the original paper, that’s actually expected and worth discussing: it’s evidence of market learning, not replication failure.

Key insight: The coursework tests whether students can think critically about what their numbers mean: not whether they can run the calculations. The analysis is the easy part; the interpretation separates strong from weak submissions.

Transition: “We’ve covered principles: methodology, statistics, bias, critical thinking. Now let’s connect this to Coursework 2 preparation.”

Part V : Preparation for Coursework 2: Principles, Not Templates

What the Scaffold Provides vs. What You Must Provide

The scaffold notebook is deliberately comprehensive: we want you to focus on understanding and interpretation, not debugging code. The 35% Critical Analysis component is where marks are won or lost.

Scaffold provides (execution):

Working code for data loading, alpha regression, robustness checks
All necessary functions pre-written (HAC standard errors, sample splits)
Publication-quality tables and figures ready for your report

You must provide (interpretation grounded in YOUR results):

Numerical engagement: “My alpha is X bp/month (t = Y). The original paper found Z bp. This N% difference likely reflects…”
Specific robustness narrative: Which tests passed? Which failed? What does that specific pattern tell you?
Your judgment, defended: Would you invest £10,000 of your own money in this factor? Why or why not, given YOUR numbers?
Process reflection: What did you expect to find? What surprised you? What would you do differently?

Generic explanations of “why HAC matters” or “what limitations exist” won’t earn marks. Examiners want to see you grapple with your specific results.

Strategic focus: Spend 1-2 hours on code, 8-10 hours making sense of what YOUR numbers mean

Coursework 2 philosophy (120 seconds)

The scaffold notebook handles execution: you focus on making sense of YOUR results. This isn’t about coding skill; it’s about analytical judgment. The 35% Critical Analysis component rewards students who engage deeply with their specific findings, not those who produce generic explanations.

What scaffold does: Loads JKP data, calculates summary statistics, runs CAPM regression with HAC standard errors, implements robustness checks, produces publication-quality outputs. You change the factor name and run it. That’s the easy part.

What scaffold doesn’t do: Tell you what YOUR alpha means. Explain why YOUR robustness tests showed the pattern they did. Defend whether YOU would invest real money given YOUR numbers. That’s the hard part: and it’s 50% of your mark.

The specificity test: When marking, examiners ask: “Could this report have been written about ANY factor, or does it engage with THIS student’s specific results?” A report that says “alpha is statistically significant therefore the factor works” could apply to any analysis. A report that says “my alpha of 0.28% monthly (t = 2.1) is 40% lower than the original paper’s 0.47%, likely reflecting post-publication arbitrage” demonstrates genuine engagement.

Why this matters in the GenAI era: Generic content: “HAC matters because returns exhibit autocorrelation”: is trivially generated by AI tools. But interpreting specific numerical patterns requires you to actually look at your output and think about what it means. Examiners will recognise the difference. A report that reads like it could apply to any factor analysis will score poorly on Critical Analysis regardless of how polished the prose.

Strategic focus: Spend 1-2 hours on code. Spend 8-10 hours on: (1) Reading the original paper your factor comes from. (2) Comparing your numbers to theirs. (3) Thinking about why they differ. (4) Deciding what you’d actually recommend. (5) Writing analysis grounded in YOUR specific results.

Common mistakes to avoid: (1) Writing generic explanations that don’t reference your actual numbers. (2) Spending days on code refinements while rushing the report. (3) Producing limitations sections that could apply to any factor study. (4) Avoiding a clear investment recommendation because you’re unsure: take a position and defend it.

What examiners look for: Numerical specificity throughout. “My t-stat of 1.8 suggests…” not “the t-stat suggests…”. Explicit comparison to published results. A clear recommendation you’d be willing to defend verbally. Evidence you actually thought about what your specific pattern of results means.

Mark scheme reality: Technical (25%) = scaffold handles most of this. Application (25%) = connect to real investment decisions. Critical Analysis (35%) = THIS IS WHERE MARKS ARE WON OR LOST: requires specificity and genuine engagement. Communication (15%) = clear presentation, proper citations.

Student question: “Can I use AI tools?” (You can use them for coding help, proofreading, or explaining concepts. But the interpretation of YOUR specific results must be YOUR thinking. If your critical analysis section could have been written without ever looking at your output, that’s a problem.)

Transition: “The scaffold handles execution. Today’s lecture gave you the principles. Your job is to connect principles to YOUR specific numbers.”

Questions to Ask About YOUR Results

When you have your output, interrogate it. These questions connect today’s principles to YOUR specific analysis.

About your methodology:

Did your robustness tests pass or fail? What does that specific pattern suggest?
How does your sample period compare to the original paper’s? Does that explain any differences?

About your statistics:

Is your t-stat comfortably above 2, or hovering near the threshold? What’s the practical difference?
Your alpha is X bp monthly. What does that mean for a £10m portfolio over a year?

About your judgment:

Given YOUR numbers, would you recommend this factor to a pension fund? Why or why not?
If your results are weaker than the original paper, is that replication failure: or exactly what you’d expect?

From principles to YOUR analysis (90 seconds)

This slide shifts from teaching concepts to applying them. The questions on screen are what students should ask when staring at their own output.

Facilitation approach: Read each question aloud and pause. These aren’t rhetorical: students should mentally rehearse answering them with their specific results. “When you run the scaffold and see your t-stat, you’ll need to interpret what that specific number means.”

About methodology: The robustness pattern matters. If early-sample works but late-sample fails, that suggests post-publication decline. If both fail, the factor may be data-mined. If both pass, you have stronger evidence. The pattern tells a story: students need to narrate it.

About statistics: Work through the economic magnitude concretely. “Your alpha is 0.3% monthly. On £10m, that’s £30,000 per month, £360,000 per year. After transaction costs of maybe £100,000-150,000, you’re left with £200,000-260,000. Is that worth the complexity and risk?” This grounds abstract percentages in real money.

About judgment: The pension fund question forces a commitment. Students can’t hide behind “results are mixed.” They must take a position: recommend or not, and defend it. This is exactly what examiners look for in the Critical Analysis component.

Weaker results than original paper: This is crucial. Jensen et al. (2024) show most factors weaken post-publication. If students find weaker results, that’s expected: not failure. But they need to articulate this. “My alpha is 40% lower than the original paper, consistent with post-publication arbitrage documented by Jensen et al.”

What distinguishes strong submissions: Students who can answer these questions specifically, with numbers, showing they understand what their particular pattern of results means. Not “robustness is important” but “my early-sample t-stat of 2.8 dropped to 1.4 in the late sample, suggesting the factor has weakened over time.”

Transition: “These questions guide your interpretation. The scaffold gives you outputs: your job is to make sense of what YOUR specific outputs mean.”

Using AI Tools Appropriately

AI tools like ChatGPT and Copilot are permitted: but how you use them determines whether they help or hurt your work.

AI can help you:

Understand concepts: “Explain HAC standard errors in simple terms”
Debug code: “Why is this pandas merge failing?”
Learn techniques: “Show me how to calculate Newey-West standard errors”

AI cannot help you:

Interpret YOUR specific results: It doesn’t know your alpha is 0.28% with t = 1.9
Explain YOUR robustness pattern: It can’t see that your early-sample passed but late-sample failed
Defend YOUR recommendation: Generic “factors can be useful” isn’t a position

The Specificity Test

If your critical analysis section could have been written without ever looking at your actual output, that’s a problem: regardless of whether AI wrote it or you did.

AI tools in context (90 seconds)

This slide addresses the elephant in the room. Students will use AI tools: the question is whether they use them well or poorly.

Framing: We’re not anti-AI. We’re pro-thinking. AI tools are like calculators: useful for mechanics, useless for judgment. The scaffold already handles the mechanical parts. What’s left: interpretation: is precisely what AI can’t do well for YOUR specific results.

What AI can do: Explain concepts, debug code, provide examples. These are legitimate uses. If a student doesn’t understand HAC, asking ChatGPT “explain HAC standard errors” is fine. If their code throws an error, using Copilot to debug is fine.

What AI cannot do: Interpret specific numerical results it hasn’t seen. If you ask ChatGPT “my alpha is 0.28% with t = 1.9, what does this mean?”, it can give generic interpretation. But it doesn’t know: What factor is this? What did the original paper find? What’s the sample period? What do the robustness tests show? The meaningful interpretation requires context only YOU have.

The specificity test: This is the key insight. Generic analysis: whether AI-generated or human-written: fails because it could apply to any factor study. Strong analysis is specific: “My alpha of 0.28% is 40% below the original paper’s 0.47%, consistent with post-publication decline.” AI can’t generate this without being fed all your specific results, at which point you’ve done the thinking anyway.

Practical guidance: Use AI to learn, debug, and understand. Write your interpretation yourself, grounded in your specific numbers. If you find yourself copying AI output into your report, ask: “Is this specific to MY results or could it apply to any factor analysis?”

Disclosure: Per course policy, AI assistance must be disclosed. But more importantly, disclosed AI-generated generic analysis still scores poorly. The goal isn’t avoiding detection: it’s producing genuinely insightful analysis.

AI Use: What Helps vs. What Hurts

Appropriate use (helps your learning):

Task	Example prompt	Why it’s fine
Concept clarification	“Explain why autocorrelation inflates t-statistics”	Builds understanding
Code debugging	“Why does my HAC calculation give NaN?”	Technical problem-solving
Writing feedback	“Is this paragraph clear?”	Improves communication

Problematic use (hurts your marks):

Task	Example prompt	Why it fails
Generic interpretation	“Write a limitations section for a factor study”	Not specific to YOUR results
Boilerplate analysis	“Explain what alpha significance means”	Could apply to ANY study
Wholesale generation	“Write my critical analysis section”	No engagement with YOUR numbers

The test: Would an examiner reading your analysis know which specific factor you studied and what YOUR results showed?

Concrete examples (60 seconds)

This slide makes the distinction concrete with examples. The table format helps students see the pattern.

Appropriate use column: These are all legitimate. Learning concepts, fixing bugs, getting feedback on clarity: these supplement your thinking. You still have to do the analysis; AI helps you do it better.

Problematic use column: These shortcut the thinking. A generic limitations section reads the same whether you studied momentum, value, or quality. Boilerplate analysis about “significance” doesn’t engage with whether YOUR t-stat of 1.9 is meaningful. Wholesale generation means you haven’t thought about your specific results at all.

The test at the bottom: This is what examiners actually look for. Reading a critical analysis section, they ask: “Does this student understand their specific results?” Generic analysis: whatever its source: fails this test.

Student question anticipation: “What if I use AI to draft and then edit?” That’s fine IF your editing makes it specific. But if you start with generic AI output and just change a few words, the underlying analysis is still generic. Better to write from scratch, looking at your numbers, and use AI only for polishing.

Transition: “These guidelines apply throughout your coursework. One more thing about how we handle this…”

Academic Integrity: Detection and Verification

To maintain fairness for all students, I have developed a multi-model GenAI detection architecture that analyses submission patterns across multiple dimensions.

What this means:

All coursework submissions are processed through this system
The system flags submissions with characteristics suggesting over-reliance on AI-generated content
Flags are reviewed by me personally: the system assists, it doesn’t decide

Oral Examination Rights

I reserve the right to orally examine any student whose submission is flagged by this system. You may be asked to explain your analysis, walk through your reasoning, and demonstrate understanding of your own work.

This is not about catching you out: it’s about ensuring your degree means something. Students who genuinely engage with their analysis have nothing to worry about.

Detection and verification (60 seconds)

This slide is about transparency and deterrence. Students should know that detection systems exist and that oral examination is a possibility.

Framing: Present this matter-of-factly, not threateningly. The goal is to encourage genuine engagement, not create anxiety. Students who do their own thinking have nothing to fear.

Multi-model architecture: You don’t need to explain technical details. The key point is that it’s not a single simple check: it analyses multiple dimensions of the submission. This makes it harder to game.

Oral examination: This is the real deterrent. Students can submit AI-generated text, but can they defend it? If asked “Why did you conclude the factor is economically meaningful?”, a student who didn’t do the thinking will struggle. Make clear this is about verification, not punishment.

Fairness framing: Emphasise this protects students who do genuine work. It’s unfair to them if others get equal marks for AI-generated submissions. The system levels the playing field.

Student question anticipation: “What triggers a flag?” Don’t reveal specific details: that enables gaming. Say: “Various textual and analytical patterns. Students who engage genuinely with their own results don’t exhibit these patterns.”

Tone: Confident but not aggressive. You’re explaining a quality assurance process, not issuing threats. The oral examination option is a reasonable verification mechanism, not a punishment.

Demonstration: HAC Standard Errors in Practice

Show code: OLS vs HAC comparison

import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.regression.linear_model import OLS
from statsmodels.stats.sandwich_covariance import cov_hac

# Simulate factor returns with autocorrelation (illustrative)
np.random.seed(42)
n = 240  # 20 years monthly
market = np.random.normal(0.008, 0.04, n)
# Factor with positive alpha and autocorrelation
factor = 0.005 + 0.3 * market + np.random.normal(0, 0.03, n)
for i in range(1, n):
    factor[i] += 0.3 * factor[i-1]  # Autocorrelation

# Regression
X = sm.add_constant(market)
model_ols = OLS(factor, X).fit()

# HAC standard errors (Newey-West with 6 lags)
cov_hac_matrix = cov_hac(model_ols, nlags=6)
se_hac = np.sqrt(np.diag(cov_hac_matrix))
t_hac = model_ols.params / se_hac

# Comparison table
comparison = pd.DataFrame({
    'Coefficient': model_ols.params,
    'OLS SE': model_ols.bse,
    'HAC SE': se_hac,
    'OLS t-stat': model_ols.tvalues,
    'HAC t-stat': t_hac
}, index=['Alpha', 'Beta'])

print("=== OLS vs HAC Standard Errors ===\n")
print(comparison.round(4))
print(f"\n📊 Interpretation:")
print(f"   Alpha = {model_ols.params[0]*100:.2f}% monthly ({model_ols.params[0]*12*100:.1f}% annualised)")
print(f"   OLS: t = {model_ols.tvalues[0]:.2f} (significant at 5% if |t| > 1.96)")
print(f"   HAC: t = {t_hac[0]:.2f} (adjusts for autocorrelation)")
print(f"   HAC standard error is {se_hac[0]/model_ols.bse[0]:.1f}x larger than OLS")
if abs(model_ols.tvalues[0]) > 1.96 and abs(t_hac[0]) < 1.96:
    print(f"   ⚠️  Result is significant with OLS but NOT with HAC!")
    print(f"       Using OLS would lead to false positive. HAC prevents this.")

What this demonstration shows:

HAC standard errors are typically 1.3-2× larger than OLS for autocorrelated data
t-statistics drop correspondingly (OLS t = 2.3 might become HAC t = 1.7)
Results “significant” with OLS can become insignificant with HAC
This is why HAC is required for honest time-series inference

Coursework Requirement

Using OLS for time-series factor data loses marks for incorrect methodology. Always use HAC. The scaffold implements this automatically: you just need to understand why it matters for interpretation.

Code demonstration pedagogy (90 seconds)

This demonstration makes HAC concrete. Students see simulated data with autocorrelation, run both OLS and HAC regressions, and observe how standard errors and t-statistics change.

What simulation does: Creates factor returns with positive alpha (0.5% monthly) and autocorrelation (AR(1) coefficient = 0.3). This mimics real factor data. Then runs CAPM regression using both OLS and HAC standard errors.

Expected output: HAC standard error for alpha is ~1.5x larger than OLS. OLS t-stat might be 2.3 (significant), but HAC t-stat is 1.7 (not significant). This shows how ignoring autocorrelation leads to false positives.

Student takeaway: HAC isn’t optional: it’s essential. Using OLS in time-series regression is methodologically incorrect and leads to spurious significance. Always use HAC for financial data.

Code pedagogy: Code is folded by default. Students can expand it if interested, but they don’t need to understand implementation details. The output table and interpretation are what matter. This aligns with principle-focused teaching: we’re demonstrating concepts, not training coders.

Assessment: Students who use OLS in coursework will lose marks. Students who use HAC and explain why (in report) demonstrate understanding and earn full marks.

Engagement: “Before seeing output, predict: will HAC standard error be larger or smaller than OLS?” (Larger, because autocorrelation means fewer effective independent observations, increasing uncertainty.)

Transition: “This demonstrates HAC in practice. Scaffold notebook implements this automatically: you don’t need to code it. But understanding why it matters enables you to interpret results correctly.”

Next Steps: Week 11 Preview

Week 11 complements today by covering the prediction pathway (Coursework 2 Option B). Same principle-focused approach: understanding concepts, not copying templates.

Week 11 focus: Market prediction using factors

Predict next-month market return using lagged factor data
Compare OLS vs regularised models (ridge regression handles multicollinearity)
Walk-forward validation for honest out-of-sample testing (prevents look-ahead bias)
Evaluate predictive power: R² OOS, directional accuracy, economic value

Same pedagogical philosophy:

Principles and understanding over step-by-step instructions
Critical interpretation over mechanical execution
Preparation for 35% Critical Analysis component

Connection: Replication (today) tests if factors exist. Prediction (next week) tests if factors forecast returns. Both require rigorous methodology and critical interpretation.

Week 11 preview (60 seconds)

Week 11 complements today by covering the prediction pathway (Coursework 2 Option B). Same principle-focused approach: we teach you why walk-forward validation matters, how to interpret R² OOS, when regularisation helps. Scaffold notebook provides code; your report provides interpretation.

Connection: Replication (today) tests if factors exist. Prediction (next week) tests if factors forecast market returns. Both require rigorous methodology, honest out-of-sample testing, and critical interpretation.

Assessment: The factor investing methodology covered today underpins CW2 Scaffold B. Choose your scaffold based on interest, not perceived difficulty: all three are equally challenging at the interpretation level.

Transition: “Today provided foundations for factor replication. Practice interpreting results critically. Read Jensen et al. (2024). Start thinking about which factor you’ll replicate. See you next week for prediction methodology.”

Summary: Week 10 Key Takeaways

Today provided foundational principles for factor replication. The core message: understanding principles enables critical analysis. Scaffold gives outputs; understanding gives interpretation.

Methodology:

Factor replication tests if published findings are real (out-of-sample, robustness required)
Jensen et al. (2024): Many factors show 50% decline in performance due to selection bias

Statistical foundations:

HAC standard errors correct for autocorrelation (typically 1.5-2× larger than OLS)
Alpha isolates excess returns beyond market exposure
Robustness checks (sample split, subperiod, costs) separate signal from noise

Critical analysis earns marks:

Interpret economically: 0.5% monthly = 6% annualised, but is it meaningful after costs?
Acknowledge uncertainty: t = 2.1 is marginally significant, not “strong evidence”
Honest limitations discussion: what didn’t you test? Selection bias? Post-publication decline?

Start coursework early: read Jensen et al. (2024), run scaffold, focus on interpretation

Closing (60 seconds)

We’ve covered a lot today, but the core message is simple: understanding principles enables critical analysis. The scaffold gives you outputs; understanding gives you interpretation. Focus your effort on thinking deeply about what results mean, not on perfecting code.

Action items: (1) Read Jensen et al. (2024): it’s the foundation for understanding replication crisis. (2) Run scaffold notebook this week to familiarise yourself with outputs. (3) Choose your factor (value, momentum, quality, etc.) based on interest. (4) Start drafting interpretation paragraphs: what will you say about alpha? Robustness? Limitations?

Mindset: Coursework 2 is not a coding exercise. It’s a critical thinking exercise. The code is provided; the thinking is yours. Embrace uncertainty, acknowledge limitations, interpret honestly. That’s what earns high marks.

Questions: Office hours available for conceptual questions (not debugging). Focus questions on interpretation: “How do I discuss transaction costs?” “What does robustness failure mean?” “How do I compare to published paper?”

Final thought: Factor replication is intellectually humbling. You’ll likely find weaker results than published papers. That’s not failure: it’s discovery. The replication crisis is real, and your honest replication contributes to cleaner science in finance.

Next week: Market prediction methodology. Same principle-focused approach. See you then.

Banz, Rolf W. 1981. “The Relationship Between Return and Market Value of Common Stocks.” Journal of Financial Economics 9 (1): 3–18. https://doi.org/10.1016/0304-405X(81)90018-0.

Fama, Eugene F., and Kenneth R. French. 1992. “The Cross-Section of Expected Stock Returns.” Journal of Finance 47 (2): 427–65. https://doi.org/10.1111/j.1540-6261.1992.tb04398.x.

———. 2015. “A Five-Factor Asset Pricing Model.” Journal of Financial Economics 116 (1): 1–22. https://doi.org/10.1016/j.jfineco.2014.10.010.

Gelman, Andrew, and Eric Loken. 2014. “The Statistical Crisis in Science.” American Scientist 102 (6): 460–65. https://doi.org/10.1511/2014.111.460.

Harvey, Campbell R. 2017. “Presidential Address: The Scientific Outlook in Financial Economics.” Journal of Finance 72 (4): 1399–1440. https://doi.org/10.1111/jofi.12530.

Harvey, Campbell R., Yan Liu, and Heqing Zhu. 2020. “False (and Missed) Discoveries in Financial Economics.” Journal of Finance 75 (5): 2503–53. https://doi.org/10.1111/jofi.12960.

Jegadeesh, Narasimhan, and Sheridan Titman. 1993. “Returns to Buying Winners and Selling Losers: Implications for Stock Market Efficiency.” Journal of Finance 48 (1): 65–91. https://doi.org/10.1111/j.1540-6261.1993.tb04702.x.

Jensen, Theis I., Bryan T. Kelly, and Lasse Heje Pedersen. 2024. “Is There a Replication Crisis in Finance?” Journal of Finance. https://doi.org/10.1111/jofi.13249.

Novy-Marx, Robert. 2013. “The Other Side of Value: The Gross Profitability Premium.” Journal of Financial Economics 108 (1): 1–28. https://doi.org/10.1016/j.jfineco.2013.01.003.