Lab 9: Factor Replication — Exploratory Analysis

Open in Colab

Objective

This lab develops understanding of factor replication principles through exploratory exercises. You’ll investigate concepts that underpin rigorous factor analysis — HAC standard errors, alpha interpretation, robustness thinking — without producing submission-ready outputs.

Important: This is not a template for Coursework 2. The scaffold notebook (open in Colab) provides that. This lab teaches you how to think about factor replication so you can interpret scaffold outputs intelligently and write critical analysis.

Time estimate: 60-90 minutes

Learning Goals

By the end of this lab, you should be able to:

  • Explain why HAC standard errors matter in time-series regression
  • Interpret alpha tests and distinguish statistical vs. economic significance
  • Evaluate robustness by comparing results across subsamples
  • Identify limitations in factor research (selection bias, transaction costs)
  • Ask critical questions about factor replicability

Setup

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
import warnings
warnings.filterwarnings('ignore')

pd.set_option('display.float_format', '{:.4f}'.format)
np.random.seed(42)

print("✓ Libraries loaded successfully")
✓ Libraries loaded successfully

About HAC standard errors. Throughout this lab we use HAC (Heteroskedasticity- and Autocorrelation-Consistent) standard errors, also known as Newey–West. These widen the confidence intervals around regression coefficients when the residuals are not independent and identically distributed — which is almost always the case in financial time series. The idiom in statsmodels is:

model = sm.OLS(y, X).fit(cov_type='HAC', cov_kwds={'maxlags': 6})
# model.bse       -> HAC standard errors already
# model.tvalues   -> HAC t-statistics already
# model.pvalues   -> HAC p-values already

We use 6 monthly lags, a common choice for monthly equity factors. Lab Part 1 explores the sensitivity of standard errors to this choice.

Part 1: Understanding Autocorrelation and HAC Standard Errors

Exercise 1.1: Visualising Autocorrelation

Financial returns often exhibit serial correlation. Let’s explore what this means and why it matters for statistical inference.

About AR(1). An AR(1) process (autoregressive of order 1) generates each observation as a weighted combination of the previous observation and fresh noise. We use the mean-deviation form below: each month’s return equals the long-run mean (0.8%) plus 40% of how far last month was from the mean, plus a new shock. The coefficient 0.4 is the persistence — how much memory the series carries from one period to the next.

# Generate two synthetic return series: one independent, one autocorrelated
n = 240  # 20 years of monthly observations

# Series 1: Independent returns (no autocorrelation) ----------
returns_iid = np.random.normal(0.008, 0.04, n)

# Series 2: AR(1) returns with persistence coefficient 0.4 ----------
returns_ar = np.zeros(n)
returns_ar[0] = np.random.normal(0.008, 0.04)
for i in range(1, n):
    returns_ar[i] = 0.008 + 0.4 * (returns_ar[i-1] - 0.008) + np.random.normal(0, 0.04)

dates = pd.date_range('2004-01-01', periods=n, freq='MS')
returns_iid = pd.Series(returns_iid, index=dates, name='IID Returns')
returns_ar  = pd.Series(returns_ar,  index=dates, name='AR(1) Returns')

# Plot both side by side
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 7))

ax1.plot(returns_iid.index, returns_iid.values, alpha=0.7, linewidth=1)
ax1.axhline(y=0, color='black', linestyle='--', alpha=0.3)
ax1.set_ylabel('Monthly Return')
ax1.set_title('Simulated: Independent Returns (no autocorrelation)', fontsize=12)
ax1.grid(alpha=0.3)

ax2.plot(returns_ar.index, returns_ar.values, alpha=0.7, linewidth=1, color='red')
ax2.axhline(y=0, color='black', linestyle='--', alpha=0.3)
ax2.set_xlabel('Date')
ax2.set_ylabel('Monthly Return')
ax2.set_title('Simulated: Autocorrelated Returns (AR(1), ρ = 0.4)', fontsize=12)
ax2.grid(alpha=0.3)

plt.tight_layout()
plt.show()

print("=== Simulated series: lag-1 autocorrelation ===")
print(f"  Independent:   {returns_iid.autocorr(1):+.4f}")
print(f"  AR(1), ρ=0.4:  {returns_ar.autocorr(1):+.4f}")

=== Simulated series: lag-1 autocorrelation ===
  Independent:   -0.0465
  AR(1), ρ=0.4:  +0.4151

But how realistic is the AR(1) simulation? Let’s compare to genuine monthly return series.

# Load real US factor returns from the JKP master dataset
from pathlib import Path
import urllib.request

JKP_CANDIDATES = [
    Path('jkp_master_global_monthly.csv'),
    Path('../labs/jkp_master_global_monthly.csv'),
    Path('labs/jkp_master_global_monthly.csv'),
]
JKP_URL = ('https://raw.githubusercontent.com/quinfer/fin510-colab-notebooks/'
           'main/resources/jkp_master_global_monthly.csv')

jkp_path = next((p for p in JKP_CANDIDATES if p.exists()), None)
if jkp_path is None:
    jkp_path = Path('jkp_master_global_monthly.csv')
    print("Downloading JKP data from GitHub ...")
    urllib.request.urlretrieve(JKP_URL, jkp_path)

jkp = pd.read_csv(jkp_path, parse_dates=['date'])
us = (jkp[jkp['country'] == 'usa']
      .set_index('date')
      .sort_index()
      .loc['1963':, ['MKT', 'HML', 'MOM']]
      .dropna())

# Build a comparison table of lag-1 autocorrelations
rows = []
for name, s in [('Simulated IID',          returns_iid),
                ('Simulated AR(1) ρ=0.4',  returns_ar),
                ('Real US MKT (1963–2023)',      us['MKT']),
                ('Real US HML (1963–2023)',      us['HML']),
                ('Real US MOM (1963–2023)',      us['MOM']),
                ('Real US MKT² (squared)',       us['MKT']**2),
                ('Real US MOM² (squared)',       us['MOM']**2)]:
    rows.append({'Series':           name,
                 'Observations':     len(s),
                 'Lag-1 autocorr':   s.autocorr(1),
                 'Std dev (monthly %)': s.std() * 100})

print("=== Lag-1 autocorrelation: simulated vs real ===\n")
print(pd.DataFrame(rows).to_string(index=False,
                                    formatters={'Lag-1 autocorr': '{:+.4f}'.format,
                                                'Std dev (monthly %)': '{:.3f}'.format}))
=== Lag-1 autocorrelation: simulated vs real ===

                 Series  Observations Lag-1 autocorr Std dev (monthly %)
          Simulated IID           240        -0.0465               3.883
  Simulated AR(1) ρ=0.4           240        +0.4151               4.351
Real US MKT (1963–2023)           732        +0.0896               5.007
Real US HML (1963–2023)           732        +0.1439               3.225
Real US MOM (1963–2023)           732        -0.0204               3.179
 Real US MKT² (squared)           732        +0.1338               0.495
 Real US MOM² (squared)           732        +0.3309               0.369

Reading the table. The AR(1) simulation with ρ = 0.4 is a stylised, deliberately strong proxy. In real monthly data:

  • Market returns have weak lag-1 autocorrelation (typically ≈ 0.05–0.10) — a single month’s return tells you little about next month’s.
  • Factor returns (HML, MOM) often show moderate serial dependence, especially during momentum crashes and value droughts.
  • Squared returns show strong serial dependence (≈ 0.2–0.4) — this is volatility clustering, the phenomenon Week 4’s GARCH models address.

The key point for inference: it doesn’t matter whether the iid violation shows up in levels or in volatility — both cause OLS standard errors to understate uncertainty. HAC fixes both. Our AR(1) simulation is a clean way to see the effect at work, even though the empirical mechanism in real data is more often volatility clustering than literal autocorrelation of returns.

Discussion questions:

  1. What visual differences do you notice between the two simulated series?
  2. How does autocorrelation affect the “smoothness” of the return series?
  3. If returns are autocorrelated, are observations truly independent?
  4. Looking at the table: which real factor has the highest lag-1 autocorrelation in levels? In squared returns? What does this tell you about which factor most needs HAC inference?

Visual differences: The autocorrelated returns series appears smoother and more persistent than the independent returns. Positive returns tend to follow positive returns, and negative returns cluster together. The independent series shows more random “jumps” with less persistence: each observation appears independent of the previous one.

Smoothness effect: Autocorrelation creates smoother time series because each observation depends partially on the previous observation. With AR(1) coefficient of 0.4, about 40% of each period’s return carries forward from the previous period, creating momentum-like patterns. This smoothness is visually apparent: the autocorrelated series has fewer sharp reversals and more extended trends.

Independence violation: No, autocorrelated observations are not truly independent. Independence requires that knowing one observation provides no information about another. With autocorrelation, knowing the previous return helps predict the current return (correlation ~0.4). This violates the OLS assumption of independent errors, which is why standard errors need adjustment (HAC).

Real-data comparison (question 4): In levels, the momentum (MOM) factor typically shows the highest lag-1 autocorrelation — a residue of the short-term reversal / medium-term continuation mechanism that drives it. In squared returns, MKT² is very high (volatility clustering at the market level), and MOM² is also elevated (momentum crashes cluster). So both channels point to momentum as the factor most in need of HAC inference: its levels have non-trivial persistence and its volatility clusters strongly. HML sits between MKT and MOM on both measures. The AR(1) simulation with ρ = 0.4 overstates persistence in levels for most real series but provides a clean, visible demonstration of the same econometric problem that volatility clustering generates in practice.

Exercise 1.2: Impact of Autocorrelation on Standard Errors

Now let’s see how autocorrelation affects statistical inference. We’ll regress both series on a constant (testing if mean return is significantly different from zero).

def test_mean_return(returns, name):
    """Test whether mean return differs from zero under both OLS and HAC SEs."""
    # Regress returns on a constant only -> coefficient IS the sample mean
    X = pd.DataFrame({'const': np.ones(len(returns))}, index=returns.index)

    ols = sm.OLS(returns.values, X).fit()
    hac = sm.OLS(returns.values, X).fit(cov_type='HAC', cov_kwds={'maxlags': 6})

    return pd.DataFrame({
        'Estimate (% monthly)':  [ols.params['const'] * 100],
        'OLS SE (%)':            [ols.bse['const']    * 100],
        'HAC SE (%)':            [hac.bse['const']    * 100],
        'OLS t-stat':            [ols.tvalues['const']],
        'HAC t-stat':            [hac.tvalues['const']],
        'SE Ratio (HAC/OLS)':    [hac.bse['const'] / ols.bse['const']],
    }, index=[name])

results_iid = test_mean_return(returns_iid, 'Independent')
results_ar  = test_mean_return(returns_ar,  'Autocorrelated')

print("=== Mean Return Tests: OLS vs HAC ===\n")
print(pd.concat([results_iid, results_ar]).round(4))
=== Mean Return Tests: OLS vs HAC ===

                Estimate (% monthly)  OLS SE (%)  HAC SE (%)  OLS t-stat  \
Independent                   0.7904      0.2507      0.2353      3.1532   
Autocorrelated                1.0104      0.2808      0.3956      3.5979   

                HAC t-stat  SE Ratio (HAC/OLS)  
Independent         3.3594              0.9386  
Autocorrelated      2.5541              1.4087  

Discussion questions:

  1. For independent returns, how do OLS and HAC standard errors compare? Why?
  2. For autocorrelated returns, is HAC standard error larger or smaller than OLS? Why?
  3. Does the conclusion about statistical significance change when using HAC vs OLS?
  4. What would happen if we used OLS for autocorrelated data without realising it?

Independent returns comparison: For independent returns, OLS and HAC standard errors should be very similar (ratio close to 1.0). This occurs because OLS assumptions are satisfied: no autocorrelation means no serial dependence to correct for. HAC is designed to handle autocorrelation, but when none exists, it reduces to standard OLS standard errors.

Autocorrelated returns: HAC standard errors are larger than OLS standard errors for autocorrelated data. The ratio (HAC/OLS) typically exceeds 1.0, often 1.2-1.5x. This happens because autocorrelation reduces effective sample size: consecutive observations provide less independent information than assumed. HAC accounts for this by inflating standard errors to reflect reduced information content.

Significance changes: Yes, conclusions can change dramatically. With autocorrelated data, OLS standard errors are understated, making t-statistics appear larger than they truly are. A result might appear significant with OLS (t > 1.96) but fail with HAC (t < 1.96). This is the core problem: using OLS creates false positives.

Consequences of ignoring autocorrelation: Using OLS on autocorrelated data leads to overconfident inference. You’ll claim statistical significance when results are actually noise. In factor replication, this means falsely concluding factors have alpha when they don’t. This contributes to the replication crisis: many published factors are false discoveries due to incorrect standard errors.

Key insight: Autocorrelation violates OLS independence assumption, causing standard errors to be understated. HAC corrects this by adjusting for serial dependence. Using OLS on financial time-series data leads to false positives (claiming significance when results are actually noise).

Part 2: Alpha Tests and Economic Interpretation

Recap: a factor return is the monthly spread between two portfolios — buy the top group (e.g. cheapest 30% of stocks by book-to-market), short-sell the bottom group (most expensive 30%). The factor’s return is the difference. See the Week 9 deck for a full treatment.

Exercise 2.1: Simulating Factor Returns

Let’s create a simulated factor and test for alpha using CAPM regression.

# Simulate market and factor returns
n = 240  # 20 years monthly

# Market returns
market = np.random.normal(0.008, 0.04, n)

# Factor returns with:
# - Positive alpha (0.003 = 0.3% monthly = 3.6% annualised)
# - Beta = 0.2 (some market exposure)
# - Autocorrelation (realistic for factors)
factor = np.zeros(n)
factor[0] = 0.003 + 0.2 * market[0] + np.random.normal(0, 0.03)
for i in range(1, n):
    factor[i] = 0.003 + 0.2 * market[i] + 0.3 * factor[i-1] + np.random.normal(0, 0.03)

# Convert to pandas
dates = pd.date_range('2004-01-01', periods=n, freq='MS')
market = pd.Series(market, index=dates, name='Market')
factor = pd.Series(factor, index=dates, name='Factor')

# Summary statistics
summary = pd.DataFrame({
    'Mean (% monthly)': [market.mean() * 100, factor.mean() * 100],
    'Std Dev (%)': [market.std() * 100, factor.std() * 100],
    'Sharpe Ratio': [(market.mean() / market.std()) * np.sqrt(12), 
                     (factor.mean() / factor.std()) * np.sqrt(12)]
}, index=['Market', 'Factor'])

print("=== Summary Statistics ===\n")
print(summary.round(3))
=== Summary Statistics ===

        Mean (% monthly)  Std Dev (%)  Sharpe Ratio
Market            0.5460       3.9980        0.4730
Factor            1.1070       3.0960        1.2390

Discussion questions:

  1. Which has higher mean return: market or factor?
  2. Which has higher volatility (risk)?
  3. Which has better risk-adjusted return (Sharpe ratio)?
  4. Can you tell from summary stats alone whether factor has significant alpha?

Mean return comparison: The factor typically has higher mean return than the market (e.g., 0.6-0.8% monthly vs 0.8% monthly). This reflects the simulated positive alpha (0.3% monthly) plus market exposure (beta × market return).

Volatility comparison: The factor usually has lower volatility than the market (e.g., 3-3.5% vs 4% monthly). This occurs because the factor has lower market exposure (beta = 0.2) and idiosyncratic volatility that’s smaller than market volatility. Lower volatility doesn’t necessarily mean lower risk: factor risk includes both market and factor-specific components.

Sharpe ratio: The factor typically has a higher Sharpe ratio than the market due to higher mean return and lower volatility. Sharpe ratio = (mean / std) × √12 for annualised comparison. Higher Sharpe indicates better risk-adjusted returns, but this doesn’t guarantee statistical significance of alpha.

Summary stats limitations: No, summary statistics alone cannot determine whether alpha is statistically significant. You need regression analysis with proper standard errors (HAC) to test if alpha differs significantly from zero. Summary stats show magnitude but not statistical reliability. A factor could have high mean return but zero true alpha (just luck), or low mean return but significant alpha (after controlling for market exposure).

Exercise 2.2: CAPM Alpha Regression

Now test for alpha by regressing factor on market.

# CAPM regression with HAC inference from the start
X = sm.add_constant(pd.DataFrame({'MKT': market.values}, index=market.index))
model_ols = sm.OLS(factor.values, X).fit()
model_hac = sm.OLS(factor.values, X).fit(cov_type='HAC', cov_kwds={'maxlags': 6})

alpha_results = pd.DataFrame({
    'Coefficient': model_hac.params,
    'OLS SE':      model_ols.bse,
    'HAC SE':      model_hac.bse,
    'HAC t-stat':  model_hac.tvalues,
    'HAC p-value': model_hac.pvalues,
}, index=['Alpha (const)', 'Beta (MKT)'])

alpha_results.loc['R²', 'Coefficient'] = model_hac.rsquared

print("=== CAPM Alpha Test ===")
print("Regression: Factor = Alpha + Beta × Market + error\n")
print(alpha_results.round(4))

alpha_monthly = model_hac.params['const']
alpha_annual  = alpha_monthly * 12
t_stat        = model_hac.tvalues['const']

print(f"\nINTERPRETATION")
print(f"  Alpha       = {alpha_monthly*100:.3f}% monthly ({alpha_annual*100:.2f}% annualised)")
print(f"  Beta        = {model_hac.params['MKT']:.3f}")
print(f"  HAC t-stat  = {t_stat:.2f}")

if abs(t_stat) > 1.96:
    print("  ✓ Alpha is statistically significant at the 5% level (|t| > 1.96).")
    print("    Factor earns excess return beyond market exposure.")
else:
    print("  ✗ Alpha is NOT statistically significant (|t| < 1.96).")
    print("    Cannot reject null hypothesis of zero alpha.")

print(f"\n  R² = {model_hac.rsquared:.3f} ({model_hac.rsquared*100:.1f}% of factor variance "
      "explained by the market)")
print("\n  Note: the Harvey–Liu–Zhu (2016) correction for multiple testing recommends "
      "a stricter threshold of |t| > 3.0 for newly proposed factors.")
=== CAPM Alpha Test ===
Regression: Factor = Alpha + Beta × Market + error

               Coefficient  OLS SE  HAC SE  HAC t-stat  HAC p-value
Alpha (const)          NaN     NaN     NaN         NaN          NaN
Beta (MKT)             NaN     NaN     NaN         NaN          NaN
R²                  0.0720     NaN     NaN         NaN          NaN

INTERPRETATION
  Alpha       = 0.993% monthly (11.92% annualised)
  Beta        = 0.208
  HAC t-stat  = 3.94
  ✓ Alpha is statistically significant at the 5% level (|t| > 1.96).
    Factor earns excess return beyond market exposure.

  R² = 0.072 (7.2% of factor variance explained by the market)

  Note: the Harvey–Liu–Zhu (2016) correction for multiple testing recommends a stricter threshold of |t| > 3.0 for newly proposed factors.

Discussion questions:

  1. What is the estimated alpha? Is it positive or negative?
  2. Is alpha statistically significant (using HAC t-statistic)?
  3. What does beta tell you about factor’s market exposure?
  4. What does R² tell you about how much factor is driven by market?
  5. Even if alpha is statistically significant, is it economically meaningful? (Consider transaction costs of ~0.2-0.5% monthly for real portfolios)

Estimated alpha: Alpha is typically positive (around 0.2-0.4% monthly, 2.4-4.8% annualised), reflecting the simulated factor design with true alpha of 0.3% monthly. Positive alpha indicates the factor earns excess returns beyond what market exposure explains.

Statistical significance: Alpha may or may not be significant depending on the simulation draw. With HAC t-statistic, significance requires |t| > 1.96 at 5% level. If significant, we reject null hypothesis of zero alpha: factor generates excess returns. If not significant, we cannot conclude factor has true alpha (could be noise).

Beta interpretation: Beta around 0.2 indicates low market exposure: factor returns move with market but weakly. A beta of 0.2 means if market rises 1%, factor rises only 0.2% on average (plus alpha). Low beta suggests factor captures factor-specific risk/return, not just market movements.

R² interpretation: R² typically ranges 0.05-0.15 (5-15%), meaning market explains only a small fraction of factor variance. Most factor variation is idiosyncratic (factor-specific). Low R² is expected for factors: they’re designed to capture non-market sources of return. High R² would suggest factor is just a market proxy.

Economic meaningfulness: Statistical significance ≠ economic exploitability. Even with significant alpha of 3% annualised, transaction costs of 0.2-0.5% monthly (2.4-6% annualised) can eliminate net alpha. High-turnover factors require frequent rebalancing, increasing costs. Net alpha after costs determines whether strategy is implementable. Also consider: implementation constraints (short-selling limits), capacity constraints (strategy size), and risk (factor can underperform for extended periods).

Key insight: Alpha measures excess return not explained by market. Positive and significant alpha suggests factor generates returns beyond market exposure. But statistical significance doesn’t guarantee economic exploitability: must consider transaction costs, implementation constraints, and risk.

Part 3: Robustness Thinking

Exercise 3.1: Sample Split

A single significant result could be luck. Robustness checks test if results hold under alternative specifications. Let’s start with sample split.

# Split data at midpoint
midpoint = len(factor) // 2
dates_split = factor.index[midpoint]

# First half
market_1 = market.iloc[:midpoint]
factor_1 = factor.iloc[:midpoint]

# Second half
market_2 = market.iloc[midpoint:]
factor_2 = factor.iloc[midpoint:]

def run_alpha_test(factor_ret, market_ret, period_name):
    """Run CAPM alpha test with HAC standard errors."""
    X = sm.add_constant(pd.DataFrame({'MKT': market_ret.values}, index=market_ret.index))
    model = sm.OLS(factor_ret.values, X).fit(cov_type='HAC', cov_kwds={'maxlags': 6})

    return {
        'Period':            period_name,
        'Alpha (% monthly)': model.params['const'] * 100,
        'HAC SE':            model.bse['const']    * 100,
        't-statistic':       model.tvalues['const'],
        'Significant?':      'Yes' if abs(model.tvalues['const']) > 1.96 else 'No',
        'Beta':              model.params['MKT'],
        'R²':                model.rsquared,
        'Observations':      len(factor_ret),
    }

# Run tests
results_1 = run_alpha_test(factor_1, market_1, 
                           f"{factor.index[0].year}-{factor.index[midpoint-1].year}")
results_2 = run_alpha_test(factor_2, market_2, 
                           f"{factor.index[midpoint].year}-{factor.index[-1].year}")

# Combine
robustness = pd.DataFrame([results_1, results_2]).set_index('Period')

print("=== Robustness Check: Sample Split ===\n")
print(robustness.round(3))
=== Robustness Check: Sample Split ===

           Alpha (% monthly)  HAC SE  t-statistic Significant?   Beta     R²  \
Period                                                                         
2004-2013             0.6220  0.3680       1.6930           No 0.2050 0.0610   
2014-2023             1.3730  0.3220       4.2670          Yes 0.1980 0.0770   

           Observations  
Period                   
2004-2013           120  
2014-2023           120  

Discussion questions:

  1. Is alpha significant in both periods?
  2. Is alpha magnitude similar across periods, or does it change?
  3. If alpha was significant in first period but not second, what might that suggest?
  4. If alpha is positive in one period but negative in another, is the factor robust?
  5. What conclusions would you draw about factor stability from this test?

Significance across periods: Ideally, alpha should be significant in both periods for a robust factor. If significant in only one period, this raises concerns about sample-specificity or regime dependence. The factor might have worked in one market environment but failed in another.

Alpha magnitude stability: Robust factors show similar alpha magnitudes across periods (e.g., 3-4% annualised in both halves). Large variation (e.g., 5% in first period, 1% in second) suggests instability. Declining alpha might indicate arbitrage (traders exploiting factor, eroding profits) or changing market structure.

Significance in first period only: If alpha is significant in first period but not second, possible explanations include: (1) Sample-specific luck: factor worked by chance in first period, (2) Regime change: market structure shifted, factor no longer works, (3) Arbitrage: traders discovered and exploited factor, eliminating profits, (4) Publication bias: original paper tested during favourable period.

Sign reversal: If alpha is positive in one period but negative in another, the factor is not robust. Sign reversal suggests factor performance is unstable and unreliable. This could indicate: (1) Factor is spurious (no true edge), (2) Factor is regime-dependent (works only in specific conditions), (3) Factor has been arbitraged away (negative alpha reflects over-exploitation).

Stability conclusions: A robust factor should show consistent positive alpha across subsamples. If results vary substantially or reverse sign, factor reliability is questionable. Instability doesn’t definitively prove factor is spurious (could be genuine but regime-dependent), but it raises serious concerns about generalisability and exploitability. For investment purposes, unstable factors are risky: investors cannot rely on historical performance.

Key insight: Robust factors show consistent performance across subsamples. If results hold only in one period, the factor might be sample-specific or driven by a unique market regime. Instability doesn’t definitively prove factor is spurious, but it raises questions about reliability and generalisability.

Exercise 3.2: Rolling Window Analysis

Instead of just two periods, let’s look at how alpha evolves over time using rolling windows.

# Calculate rolling alpha (3-year window)
window = 36  # months

rolling_alpha = []
rolling_dates = []

for i in range(window, len(factor)):
    factor_window = factor.iloc[i-window:i]
    market_window = market.iloc[i-window:i]

    X = sm.add_constant(pd.DataFrame({'MKT': market_window.values},
                                     index=market_window.index))
    model = sm.OLS(factor_window.values, X).fit()

    rolling_alpha.append(model.params['const'] * 12 * 100)  # annualised %
    rolling_dates.append(factor.index[i])

rolling_alpha = pd.Series(rolling_alpha, index=rolling_dates)

# Plot
plt.figure(figsize=(12, 6))
plt.plot(rolling_alpha.index, rolling_alpha.values, linewidth=2, label='Rolling Alpha (3-year window)')
plt.axhline(y=0, color='red', linestyle='--', alpha=0.5, label='Zero Alpha')
plt.fill_between(rolling_alpha.index, 0, rolling_alpha.values, 
                 where=(rolling_alpha >= 0), alpha=0.3, label='Positive Alpha')
plt.fill_between(rolling_alpha.index, 0, rolling_alpha.values, 
                 where=(rolling_alpha < 0), alpha=0.3, color='red', label='Negative Alpha')
plt.xlabel('Date', fontsize=12)
plt.ylabel('Annualised Alpha (%)', fontsize=12)
plt.title('Factor Alpha Over Time (Rolling 3-Year Windows)', fontsize=14, fontweight='bold')
plt.legend(fontsize=10)
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

print(f"Rolling alpha statistics:")
print(f"  Mean: {rolling_alpha.mean():.2f}%")
print(f"  Std Dev: {rolling_alpha.std():.2f}%")
print(f"  Min: {rolling_alpha.min():.2f}% ({rolling_alpha.idxmin().year})")
print(f"  Max: {rolling_alpha.max():.2f}% ({rolling_alpha.idxmax().year})")
print(f"  % periods with positive alpha: {(rolling_alpha > 0).mean()*100:.1f}%")

Rolling alpha statistics:
  Mean: 13.69%
  Std Dev: 8.35%
  Min: -5.45% (2007)
  Max: 32.99% (2020)
  % periods with positive alpha: 96.1%

Discussion questions:

  1. How stable is alpha over time? Does it stay consistently positive/negative?
  2. Are there periods where alpha turns negative?
  3. What might explain variation in rolling alpha? (Market regimes? Structural changes? Arbitrage?)
  4. If you observed this pattern in real data, would you conclude the factor is reliably exploitable?

Alpha stability: Rolling alpha typically shows some variation over time: it doesn’t stay perfectly constant. However, robust factors maintain consistently positive alpha (e.g., 2-5% annualised) across most rolling windows. If alpha fluctuates wildly or frequently crosses zero, this indicates instability.

Negative alpha periods: Yes, even factors with overall positive alpha can have periods where rolling alpha turns negative. This might occur during: (1) Market stress periods (crises, crashes), (2) Regime shifts (changing market structure), (3) Temporary arbitrage pressure (traders exploiting factor, temporarily reversing it). Occasional negative periods don’t necessarily invalidate factor, but frequent or prolonged negative periods are concerning.

Variation explanations: Rolling alpha variation can reflect: (1) Market regimes: factors work better in certain market conditions (bull vs bear, high vs low volatility), (2) Structural changes: market microstructure evolves (electronic trading, algorithmic execution), reducing factor effectiveness, (3) Arbitrage: as factor becomes known, traders exploit it, eroding profits over time, (4) Sample-specific luck: early periods benefited from favourable conditions that don’t persist.

Exploitability conclusion: If rolling alpha shows: (a) Consistent positive values across most windows → factor appears reliably exploitable, (b) Declining trend over time → arbitrage concern, factor may be eroding, (c) High volatility, frequent sign changes → factor unreliable, not exploitable, (d) Early positive, later negative → publication bias or arbitrage. For investment purposes, prefer factors with stable positive rolling alpha. High variation suggests factor is risky and unreliable: investors cannot confidently expect future performance.

Important caveat — sampling variation vs. factor decay. Each 36-month rolling window contains only 36 observations, so individual point estimates are noisy by construction — standard errors on a 3-year alpha are typically ±2–3% annualised. A dip from 4% to 1% in the rolling chart could be either a genuine regime shift or simply sampling variation. Distinguishing the two requires formal structural-break tests (e.g. Chow tests, Bai-Perron breakpoints) and longer out-of-sample windows.

Key insight: Rolling analysis reveals time-variation in factor performance. Factors can work in some periods and fail in others. Stable alpha is more convincing than alpha that fluctuates wildly or reverses sign. Instability might reflect changing market conditions, arbitrage eroding profits, or sample-specific luck.

Part 4: Critical Thinking About Factor Research

Exercise 4.1: Transaction Cost Reality Check

Academic papers often report gross returns (before trading costs). But real investors face costs. Let’s explore how costs affect exploitability.

# Assume our factor requires monthly rebalancing.
# Transaction costs depend on liquidity, portfolio size, and execution strategy.
# Realistic ROUND-TRIP cost per monthly rebalance (incl. bid-ask, commissions, market impact):
#   Large-cap, low-turnover (e.g. value, size): 0.05% - 0.10% per month
#   Mid-turnover (profitability, investment):   0.10% - 0.20% per month
#   High-turnover (momentum, short-term rev.):  0.20% - 0.50% per month
# These add up to 0.6% - 6% per year, consistent with the Week 9 slides and
# Detzel, Novy-Marx & Velikov (2023).

cost_scenarios_monthly = [0.00, 0.05, 0.10, 0.20, 0.30, 0.50]

# Recompute CAPM alpha so this cell is self-contained
X_tc = sm.add_constant(pd.DataFrame({'MKT': market.values}, index=market.index))
model_tc = sm.OLS(factor.values, X_tc).fit(cov_type='HAC', cov_kwds={'maxlags': 6})
alpha_tc = model_tc.params['const']

results_costs = []
for cost in cost_scenarios_monthly:
    gross_alpha = alpha_tc * 12 * 100           # annualised %
    net_alpha   = gross_alpha - (cost * 12)     # subtract annualised cost

    results_costs.append({
        'Monthly Cost (%)': cost,
        'Annual Cost (%)':  cost * 12,
        'Gross Alpha (%)':  gross_alpha,
        'Net Alpha (%)':    net_alpha,
        'Exploitable?':     'Yes' if net_alpha > 1.0 else 'Marginal' if net_alpha > 0 else 'No',
    })

costs_df = pd.DataFrame(results_costs)
print("=== Transaction Cost Impact on Alpha ===\n")
print(costs_df.round(2))

# Visualise
plt.figure(figsize=(10, 6))
plt.plot(costs_df['Monthly Cost (%)'], costs_df['Net Alpha (%)'], 
         marker='o', linewidth=2, markersize=8)
plt.axhline(y=0, color='red', linestyle='--', alpha=0.5, label='Break-even')
plt.axhline(y=1, color='orange', linestyle='--', alpha=0.5, label='Minimum meaningful alpha')
plt.xlabel('Monthly Transaction Cost (%)', fontsize=12)
plt.ylabel('Net Annualised Alpha (%)', fontsize=12)
plt.title('How Transaction Costs Erode Factor Alpha', fontsize=14, fontweight='bold')
plt.legend(fontsize=10)
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()
=== Transaction Cost Impact on Alpha ===

   Monthly Cost (%)  Annual Cost (%)  Gross Alpha (%)  Net Alpha (%)  \
0            0.0000           0.0000          11.9200        11.9200   
1            0.0500           0.6000          11.9200        11.3200   
2            0.1000           1.2000          11.9200        10.7200   
3            0.2000           2.4000          11.9200         9.5200   
4            0.3000           3.6000          11.9200         8.3200   
5            0.5000           6.0000          11.9200         5.9200   

  Exploitable?  
0          Yes  
1          Yes  
2          Yes  
3          Yes  
4          Yes  
5          Yes  

Discussion questions:

  1. At what transaction cost level does net alpha turn negative?
  2. Even if net alpha is positive, is it large enough to justify implementation?
  3. How do transaction costs affect the “break-even” account size for implementing a factor?
  4. Why might high-turnover factors (like momentum) be harder to exploit than low-turnover factors (like value)?
  5. If you read a paper reporting 8% gross alpha, what questions would you ask about exploitability?

Break-even cost level: Net alpha turns negative when transaction costs exceed gross alpha. If gross alpha is 3.6% annualised (0.3% monthly), net alpha becomes negative when monthly costs exceed 0.3% (annualised 3.6%). This threshold depends on gross alpha magnitude: higher alpha can absorb higher costs.

Implementation justification: Even positive net alpha may be insufficient. Consider: (1) Minimum meaningful alpha: practitioners typically require 1-2% net alpha to justify implementation (covers operational costs, risk, effort), (2) Risk-adjusted returns: net alpha must compensate for factor-specific risk (volatility, drawdowns), (3) Opportunity cost: could capital be deployed more profitably elsewhere? Net alpha of 0.5% might not justify implementation if alternatives offer better risk-adjusted returns.

Break-even account size: Transaction costs are often fixed per trade (commissions) plus proportional (bid-ask spreads, market impact). Larger accounts benefit from economies of scale: fixed costs per trade become smaller percentage of capital. Break-even occurs when gross alpha exceeds total costs (fixed + proportional). Small accounts may be unprofitable due to fixed costs; large accounts face market impact (moving prices). Optimal account size balances these effects.

High-turnover vs low-turnover: High-turnover factors (momentum, rebalance monthly) face costs more frequently: 12 rebalancings annually vs 1-2 for low-turnover (value, rebalance annually). If monthly cost is 0.2%, annual cost is 2.4% for momentum vs 0.2-0.4% for value. High-turnover factors need much larger gross alpha to remain profitable. Additionally, high-turnover increases market impact risk (frequent trading moves prices against you).

Questions about 8% gross alpha: Critical questions: (1) Turnover frequency: how often does strategy rebalance? (monthly = 12× costs vs annual = 1× costs), (2) Transaction cost assumptions: did paper account for bid-ask spreads, commissions, market impact? (3) Net alpha: what is alpha after realistic costs? (4) Capacity: can strategy scale without eroding alpha? (5) Implementation constraints: short-selling limits? Liquidity constraints? (6) Out-of-sample testing: did paper test on data not used for development? (7) Robustness: do results hold across subsamples, regions, time periods?

Key insight: Gross alpha overstates exploitability. Always consider transaction costs, especially for high-turnover strategies. Even statistically significant alpha can be economically meaningless after costs. This is a major reason many published factors fail in real-world implementation.

Exercise 4.2: Selection Bias Simulation

Let’s simulate the multiple testing problem that creates false discoveries in factor research.

# Simulate many researchers each testing a potential "factor"
# None of these factors have true alpha (all are pure noise)
# But some will appear significant just by luck (5% false positive rate)

num_researchers = 100
num_obs = 240
significance_level = 0.05

# Storage for results
p_values = []
significant_factors = []

for i in range(num_researchers):
    # Simulated "factor" has zero true alpha (pure noise).
    # Market has realistic mean (~0.8% monthly) so the CAPM intercept is
    # identified correctly; only the FACTOR alpha is null by construction.
    random_factor = np.random.normal(0.000, 0.04, num_obs)   # true α = 0
    market_sim    = np.random.normal(0.008, 0.04, num_obs)   # realistic MKT

    X = sm.add_constant(pd.DataFrame({'MKT': market_sim}))
    model = sm.OLS(random_factor, X).fit()

    p_val = model.pvalues['const']
    p_values.append(p_val)

    if p_val < significance_level:
        significant_factors.append({
            'Factor':             i + 1,
            'Alpha (% monthly)':  model.params['const']  * 100,
            't-statistic':        model.tvalues['const'],
            'p-value':            p_val,
        })

# How many false positives?
num_significant = len(significant_factors)
expected_false = num_researchers * significance_level

print(f"=== Selection Bias Simulation ===")
print(f"Number of 'factors' tested: {num_researchers}")
print(f"True alpha for all factors: 0% (pure noise)")
print(f"Significance level: {significance_level*100}%")
print(f"\nRESULTS:")
print(f"  Number appearing 'significant': {num_significant}")
print(f"  Expected false positives: {expected_false:.1f}")
print(f"  All significant findings are FALSE POSITIVES (Type I errors)")

if significant_factors:
    print(f"\n'Published' Factors:")
    print(pd.DataFrame(significant_factors).round(3))

print(f"\nIMPLICATION:")
print(f"  If only 'significant' factors get published, the literature")
print(f"  overrepresents spurious findings by construction.")
print(f"  This is the source of the replication crisis.")

# Distribution of p-values
plt.figure(figsize=(10, 6))
plt.hist(p_values, bins=20, edgecolor='black', alpha=0.7)
plt.axvline(x=0.05, color='red', linestyle='--', linewidth=2, label='Significance threshold')
plt.xlabel('p-value', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.title('Distribution of p-values for Null Factors (True Alpha = 0)', fontsize=14, fontweight='bold')
plt.legend(fontsize=10)
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

print("\nNote: under the null hypothesis of zero true alpha, p-values should be "
      "UNIFORMLY distributed on [0, 1]. The ~5% that fall below 0.05 are false "
      "positives. In real research, a left-skewed p-value distribution (many small "
      "values) indicates either genuine effects OR p-hacking.")
=== Selection Bias Simulation ===
Number of 'factors' tested: 100
True alpha for all factors: 0% (pure noise)
Significance level: 5.0%

RESULTS:
  Number appearing 'significant': 5
  Expected false positives: 5.0
  All significant findings are FALSE POSITIVES (Type I errors)

'Published' Factors:
   Factor  Alpha (% monthly)  t-statistic  p-value
0      45            -0.6440      -2.6130   0.0100
1      52             0.5530       2.1320   0.0340
2      59            -0.5880      -2.3550   0.0190
3      82             0.6280       2.3460   0.0200
4      97            -0.8510      -3.4110   0.0010

IMPLICATION:
  If only 'significant' factors get published, the literature
  overrepresents spurious findings by construction.
  This is the source of the replication crisis.


Note: under the null hypothesis of zero true alpha, p-values should be UNIFORMLY distributed on [0, 1]. The ~5% that fall below 0.05 are false positives. In real research, a left-skewed p-value distribution (many small values) indicates either genuine effects OR p-hacking.

Discussion questions:

  1. How many “significant” results appeared even though true alpha is zero?
  2. Is this close to the expected 5% false positive rate?
  3. If journals only publish significant results, what fraction of published factors are false positives?
  4. How does this simulation relate to the ~300 published equity factors in real finance literature?
  5. What practices could reduce false discovery rate? (Pre-registration? Higher significance thresholds? Out-of-sample testing?)

False positive count: With 100 factors tested and 5% significance level, we expect ~5 false positives (100 × 0.05). The simulation typically produces 3-7 significant results despite zero true alpha. This matches expected false positive rate: even with perfect null hypothesis, 5% of tests appear significant by chance.

Expected false positive rate: Yes, results should be close to 5% false positive rate. This is the definition of Type I error: when null hypothesis is true (alpha = 0), we incorrectly reject it 5% of the time. The simulation demonstrates this fundamental statistical property: multiple testing guarantees false discoveries.

Published false positive fraction: If journals only publish significant results, the fraction of false positives depends on: (1) True discovery rate: if 10% of tested factors have true alpha, then 10 true + 5 false = 15 published, false positive rate = 5/15 = 33%, (2) If no true factors exist: all published factors are false positives (100%), (3) File drawer problem: null results remain unpublished, inflating apparent success rate. Real literature likely has 20-50% false positives due to selection bias.

Relation to ~300 published factors: The simulation directly models the factor discovery process. With hundreds of researchers testing factors, many false positives get published whilst null results remain unpublished. Jensen et al. (2024) document that most published factors fail replication: consistent with high false discovery rate. The ~300 published factors likely include 60-150 false positives (20-50% false discovery rate), explaining replication failures.

Reducing false discovery rate: Practices to reduce false positives: (1) Pre-registration: commit to hypotheses and tests before seeing data, prevents p-hacking, (2) Higher significance thresholds: use t > 3 (instead of 1.96) or Bonferroni correction for multiple tests, (3) Out-of-sample testing: require replication on fresh data before publication, (4) Disclosure: publish all tested factors, not just significant ones, (5) Bayesian methods: incorporate prior beliefs, require stronger evidence, (6) Replication requirements: journals require independent replication before publication. These practices reduce false discoveries but don’t eliminate them: fundamental tradeoff between discovery and false positive control.

Key insight: Multiple testing creates false discoveries even under rigorous statistical practices. With hundreds of factors tested, many spurious findings get published whilst null results remain in file drawers. This selection bias inflates published returns and explains why many factors fail replication. Jensen et al. (2024) document this systematically.

Part 5: Advanced Extension — What Is Inside a Factor? (Optional)

NoteAbout this extension

This section is optional and designed for students who want to deepen their understanding of factor construction. It is directly relevant to the Data quality and preparation decisions component of your Coursework 2 report. Allow approximately 30–40 minutes.

Motivation

Parts 1–4 used pre-built factor return series (simulated or from JKP). But where do those factor returns come from? Every long-short factor portfolio is constructed from individual stock returns: sorting stocks into groups based on some characteristic, then taking the return difference between the top and bottom groups. Understanding this construction process reveals hidden design choices (weighting, rebalancing frequency, universe selection, survivorship treatment) that affect factor returns.

In this extension, you will build a simple momentum factor from individual S&P 500 stock prices and compare it to the pre-built JKP MOM factor. The differences are instructive: they show how construction choices affect results and why data quality matters.

Setup: Load stock-level data

The dataset below contains monthly returns for S&P 500 constituents with point-in-time membership filtering — a stock appears only in months when it was actually in the index. This prevents look-ahead bias: without it, we would implicitly assume a 1995 portfolio “knew” which stocks would be in the S&P 500 twenty years later, which is impossible in real time.

The cell below tries to load the data locally; if it isn’t present (e.g. when running in Colab), it downloads from the course’s raw-data repository:

from pathlib import Path
import urllib.request

DATA_URL = ('https://raw.githubusercontent.com/quinfer/fin510-colab-notebooks/'
            'main/resources/sp500_monthly_returns.parquet')

LOCAL_CANDIDATES = [
    Path('sp500_monthly_returns.parquet'),
    Path('../labs/sp500_monthly_returns.parquet'),
    Path('labs/sp500_monthly_returns.parquet'),
]
sp500_path = next((p for p in LOCAL_CANDIDATES if p.exists()), None)
if sp500_path is None:
    sp500_path = Path('sp500_monthly_returns.parquet')
    print("Downloading S&P 500 monthly returns from GitHub ...")
    urllib.request.urlretrieve(DATA_URL, sp500_path)
    print(f"  saved to {sp500_path.resolve()}")

stocks = pd.read_parquet(sp500_path).sort_values(['ticker', 'date'])

print(f"\nS&P 500 monthly returns panel")
print(f"  Rows:     {len(stocks):,}")
print(f"  Tickers:  {stocks['ticker'].nunique():,}")
print(f"  Period:   {stocks['date'].min().date()}{stocks['date'].max().date()}")

monthly_n = stocks.groupby('date')['ticker'].nunique()
print(f"  Stocks per month: {monthly_n.median():.0f} median, "
      f"{monthly_n.min()} min, {monthly_n.max()} max")

S&P 500 monthly returns panel
  Rows:     110,617
  Tickers:  627
  Period:   1996-01-31 → 2023-12-31
  Stocks per month: 342 median, 200 min, 477 max

Discussion: Why does the number of stocks per month vary? What does “point-in-time membership filtering” prevent?

Why does N vary? The S&P 500 is not a static list of 500 tickers. Constituents are added and removed over time as firms grow, shrink, merge, or delist. Additionally, the historical return file may have missing months for a given ticker (IPO occurred mid-sample, temporary suspension of trading, corporate action). So N is roughly the number of firms that (a) were in the S&P 500 during that month and (b) had a usable price observation that month.

What does point-in-time filtering prevent? It prevents look-ahead bias (also called survivorship bias in this context). If, in January 1995, we filtered the universe to today’s S&P 500 members, we would be implicitly assuming a 1995 portfolio manager “knew” which firms would enter the index over the next 30 years. Those future entrants are mostly winners by construction — they survived and grew big enough to be admitted. Backtests that use “current-constituent” membership typically overstate returns by 1–2% per year.

Socratic probe: Under point-in-time filtering, when Enron (then an S&P 500 name) collapsed in 2001, our dataset should still contain its pre-collapse returns and its collapse itself. Would removing Enron from the panel (because it no longer trades) bias our momentum factor’s returns upward or downward? Why?

Exercise 5.0: Data quality check

Before using the data, inspect it for anomalies. Free stock price datasets often have issues with split adjustments and corporate actions.

Calibration point: a typical S&P 500 stock returns around ±10% on its best or worst month in a given year. Monthly returns above +50% or below –50% are almost always data errors (failed split adjustment, ticker reassignment, corporate action) rather than real price movements.

print("Monthly return distribution:")
print(stocks['monthly_ret'].describe().round(4))

extreme_pos = stocks[stocks['monthly_ret'] >  1.0]
extreme_neg = stocks[stocks['monthly_ret'] < -0.8]
print(f"\nReturns > +100%: {len(extreme_pos)}")
print(f"Returns <  -80%: {len(extreme_neg)}")

if len(extreme_pos) > 0:
    print("\nLargest monthly returns (inspect for data errors):")
    print(extreme_pos.nlargest(5, 'monthly_ret')[['date', 'ticker', 'monthly_ret']].to_string(index=False))
Monthly return distribution:
count   110617.0000
mean         0.0101
std          0.0962
min         -0.8000
25%         -0.0396
50%          0.0108
75%          0.0590
max          2.0000
Name: monthly_ret, dtype: float64

Returns > +100%: 8
Returns <  -80%: 0

Largest monthly returns (inspect for data errors):
      date ticker  monthly_ret
2009-08-31    AIG       2.0000
2020-04-30    APA       2.0000
2020-07-31     GE       2.0000
2008-08-31    MBI       1.7352
2009-05-31    GNW       1.5085

Discussion: Are any of these extreme returns plausible for S&P 500 stocks? What might cause a +200% or -90% monthly return in this dataset? (Hint: think about stock splits, ticker reassignments, and mergers.) The prep script has already removed the worst offenders, but always check your data before building on it.

Are they plausible? Almost never for a large-cap S&P 500 constituent. The worst single-month return in the 2008 crisis for a surviving blue-chip was typically –30% to –50%, and the largest single-month rallies (e.g. GameStop-style retail squeezes) happened in small-caps, not in the index’s long-standing members. Monthly returns above +100% or below –80% in a well-behaved S&P 500 panel are usually data artefacts.

Typical causes:

  1. Failed split adjustments — a 3-for-1 stock split that is recorded in price but not in the return calculation produces a spurious +200% “return.”
  2. Ticker reassignments — a ticker that was retired and later reissued to an unrelated company. The panel appears to show continuous history for “ticker XYZ,” but two distinct firms are being glued together.
  3. Mergers and spin-offs — the acquired firm’s last observation may show an artificially large one-period return if the buyout premium and a distribution-in-kind are both booked in the same month without proper adjustment.
  4. Delisting returns — a firm that delists at $0.10 after a crash may show a final –99% month that reflects bankruptcy, not ongoing operations. Academic convention is to keep the delisting return (CRSP assigns a specific delisting return code), because dropping it understates downside risk (this is a form of survivorship bias).

Methodological consequence: in production-quality research you would either (a) clip returns at sensible limits (e.g. ±80% monthly) with a logged diagnostic, (b) cross-validate against a second data source, or (c) inspect the underlying price series around the event and correct manually. The lab’s build script does option (a) for the worst outliers.

Socratic probe: suppose you clip a +300% return down to +80%. In a long-short factor that happens to rank that stock in the “long” leg that month, are you understating or overstating the factor’s true return? Does your answer change if the +300% is a genuine corporate event (buyout premium) rather than a data error?

Exercise 5.1: Build a momentum signal

Academic momentum is typically defined as the cumulative return from month t-12 to t-2 (the most recent month is skipped to avoid short-term reversal effects). We then sort stocks into terciles (top, middle, bottom) each month and compute the return of the top-minus-bottom portfolio in month t.

def build_momentum_signal(df):
    """
    Compute the academic 12-1 momentum signal for each stock-month.

    The signal for month t is the cumulative return from t-12 to t-2 (inclusive).
    Month t-1 is skipped to purge the short-term reversal anomaly.

    Example (month = Dec 2020):
        Include returns from Dec 2019 through Oct 2020 (11 months).
        Skip     November 2020                 (the t-1 reversal month).
        Evaluate portfolio performance in     December 2020.

    The prior-11-month cumulative return is computed at each (t-1), then lagged
    one more month to align with the formation date.
    """
    df = df.sort_values(['ticker', 'date']).copy()

    df['cum_11m'] = (
        df.groupby('ticker')['monthly_ret']
          .transform(lambda x: (1 + x).rolling(11, min_periods=11).apply(np.prod, raw=True) - 1)
    )
    df['mom_signal'] = df.groupby('ticker')['cum_11m'].shift(1)
    return df

stocks = build_momentum_signal(stocks)

valid = stocks.dropna(subset=['mom_signal'])
valid_per_month = valid.groupby('date')['ticker'].nunique()
print(f"Months with valid momentum signal: {len(valid_per_month)}")
print(f"Stocks per month with signal:      {valid_per_month.median():.0f} median")
Months with valid momentum signal: 325
Stocks per month with signal:      331 median

Exercise 5.2: Sort into portfolios and compute factor return

Note on weighting: academic factor construction typically uses value weighting — each stock’s contribution to the portfolio return is proportional to its market capitalisation. Our DIY version below uses equal weighting for simplicity (we don’t have historical market caps), which gives more influence to smaller stocks within each tercile. Since momentum is known to be stronger in smaller stocks but our universe is only large-cap S&P 500 names, the two approaches differ less here than they would in the full US cross-section.

def momentum_factor_return(df, n_groups=3):
    """
    Each month, sort stocks into n_groups by momentum signal.
    Return the equal-weighted top-minus-bottom portfolio return.

    We skip any month where we cannot form n_groups distinct terciles
    (too few stocks, or too many ties in the signal).
    """
    df = df.dropna(subset=['mom_signal', 'monthly_ret']).copy()

    results, skipped = [], 0
    for date, group in df.groupby('date'):
        if len(group) < n_groups * 5:
            skipped += 1
            continue

        group['tercile'] = pd.qcut(group['mom_signal'], n_groups,
                                   labels=False, duplicates='drop')
        if group['tercile'].nunique() < n_groups:
            skipped += 1
            continue

        top    = group.loc[group['tercile'] == n_groups - 1, 'monthly_ret'].mean()
        bottom = group.loc[group['tercile'] == 0,             'monthly_ret'].mean()
        results.append({'date': date, 'mom_diy': top - bottom})

    print(f"  Months skipped (insufficient coverage): {skipped}")
    return pd.DataFrame(results).set_index('date')

mom_diy = momentum_factor_return(stocks, n_groups=3)
print(f"\nDIY momentum factor")
print(f"  Months: {len(mom_diy)}")
print(f"  Mean monthly return: {mom_diy['mom_diy'].mean()*100:.3f}%")
print(f"  Annualised Sharpe:   "
      f"{mom_diy['mom_diy'].mean() / mom_diy['mom_diy'].std() * np.sqrt(12):.3f}")
  Months skipped (insufficient coverage): 0

DIY momentum factor
  Months: 325
  Mean monthly return: -0.083%
  Annualised Sharpe:   -0.058

Exercise 5.3: Compare to JKP MOM

Now load the JKP pre-built momentum factor and compare.

# Load JKP US momentum (fetches from GitHub if not found locally)
JKP_URL_US = ('https://raw.githubusercontent.com/quinfer/fin510-colab-notebooks/'
              'main/resources/jkp_master_global_monthly.csv')

jkp_path = next((p for p in [Path('jkp_master_global_monthly.csv'),
                              Path('../labs/jkp_master_global_monthly.csv'),
                              Path('labs/jkp_master_global_monthly.csv')]
                 if p.exists()), None)
if jkp_path is None:
    jkp_path = Path('jkp_master_global_monthly.csv')
    print("Downloading JKP monthly master from GitHub ...")
    urllib.request.urlretrieve(JKP_URL_US, jkp_path)

jkp = pd.read_csv(jkp_path, parse_dates=['date'])
jkp_us = (jkp[jkp['country'] == 'usa'][['date', 'MOM']]
          .dropna()
          .set_index('date'))

common = mom_diy.join(jkp_us, how='inner')
print(f"Overlapping months: {len(common)}")
print(f"\n{'':25s} {'DIY MOM':>10s} {'JKP MOM':>10s}")
print(f"{'Mean (% monthly)':<25s} {common['mom_diy'].mean()*100:>10.3f} {common['MOM'].mean()*100:>10.3f}")
print(f"{'Std (% monthly)':<25s} {common['mom_diy'].std()*100:>10.3f} {common['MOM'].std()*100:>10.3f}")
print(f"{'Annualised Sharpe':<25s} "
      f"{common['mom_diy'].mean()/common['mom_diy'].std()*np.sqrt(12):>10.3f} "
      f"{common['MOM'].mean()/common['MOM'].std()*np.sqrt(12):>10.3f}")
print(f"{'Correlation':<25s} {common['mom_diy'].corr(common['MOM']):>10.3f}")

fig, ax = plt.subplots(figsize=(11, 5))
(1 + common['mom_diy']).cumprod().plot(ax=ax, label='DIY momentum (SP500 tercile)', linewidth=1.5)
(1 + common['MOM']).cumprod().plot(ax=ax, label='JKP MOM (US, all stocks)', linewidth=1.5)
ax.axhline(1, color='gray', linewidth=0.8, linestyle='--')
ax.set_ylabel('Cumulative return ($1 invested)')
ax.set_title('DIY vs JKP Momentum Factor')
ax.legend()
plt.tight_layout()
plt.show()
Overlapping months: 325

                             DIY MOM    JKP MOM
Mean (% monthly)              -0.083      0.338
Std (% monthly)                4.969      3.963
Annualised Sharpe             -0.058      0.295
Correlation                    0.845

You should observe: (a) the two series are highly correlated (~0.85), confirming they measure the same underlying phenomenon, but (b) the DIY factor has a substantially weaker mean return (near zero) compared to JKP MOM (~0.3% per month). This gap is the lesson.

Discussion questions (directly relevant to your Coursework 2 report):

  1. Correlation vs level: The two factors are highly correlated but the DIY version earns a weaker premium. What does this tell you about the direction versus the magnitude of factor returns across different constructions?
  2. Universe: Your DIY factor uses only S&P 500 stocks (~340 per month). JKP uses the full US cross-section (thousands of stocks including small-caps). The momentum premium is known to be stronger among smaller stocks. How does restricting to large-caps affect the premium?
  3. Weighting: Your DIY factor is equal-weighted within terciles; JKP is value-weighted. Which approach gives more weight to smaller stocks within each group? Why does this matter for momentum specifically?
  4. Rebalancing: Both rebalance monthly. What happens if you rebalance quarterly instead? (Hint: think about transaction costs versus signal decay.)
  5. Survivorship: The S&P 500 membership file provides point-in-time filtering. What would happen if you simply used “all stocks that are in the S&P 500 today” and looked up their historical prices? Why is this dangerous?
  6. Data quality: The source dataset required cleaning (removal of tickers with split-adjustment errors, clipping of extreme returns). How would you describe these data quality decisions in a reflective report? Which decisions affect your results most?

These answers deliberately emphasise method over conclusion — the specific magnitudes will depend on your sample window and any choices you tweak. The point is to show you how to frame the comparison for your CW2 report.

1. Correlation vs level. High correlation (~0.85) tells us the two factors agree strongly on the direction of monthly returns — when JKP MOM is up, so is the DIY version, and vice versa. They are measuring the same underlying phenomenon (medium-term momentum). But correlation is invariant to a constant difference in means, so it tells us nothing about magnitude. The fact that the DIY mean return is near zero while JKP’s is ~0.3% per month indicates the two constructions disagree systematically on the size of the premium, even though they agree on its sign. Methodological takeaway: report both correlation and a mean-difference test before claiming two datasets “replicate.”

2. Universe. Restricting to the S&P 500 (≈340–500 stocks) eliminates almost all small-cap names, which is exactly where momentum has historically been strongest (Fama & French 2012; Novy-Marx 2013). The momentum premium in large-caps is roughly 40–60% smaller than in the full cross-section, and considerably less statistically significant once HAC is applied. Your DIY factor is therefore a large-cap-only version of MOM, and the “missing premium” is substantially the small-cap tail. Write-up hint: rather than calling this a replication failure, call it a universe-restricted replication and quantify how much of the gap is attributable to universe choice vs other design differences.

3. Weighting. Equal weighting gives every stock in the tercile the same portfolio weight, so smaller stocks within each group receive proportionally more weight than they would under value weighting. This matters for momentum specifically because (i) small stocks drive the premium, so equal-weighting should help the factor, but (ii) small stocks also have higher transaction costs and lower capacity. Paradoxically, equal weighting typically raises the reported gross premium but shrinks the net premium once realistic trading costs are subtracted. Value weighting is preferred in academic work because it represents what a large investor can actually implement.

4. Rebalancing. Moving from monthly to quarterly rebalancing:

  • Lower turnover → lower transaction costs — roughly a third of the monthly figure (though not exactly: momentum holdings are more persistent over quarterly horizons than over month-to-month).
  • Signal decay — the academic 12-1 signal is strongest at the one-month holding horizon; premia decay roughly linearly from t+1 to t+12. So at a 3-month holding horizon the factor captures perhaps 70% of the monthly gross premium.
  • Net trade-off: the breakeven point depends on your assumed TC level. For high-cost environments (retail trading, small-cap stocks) quarterly is typically superior; for low-cost environments (institutional, large-cap) monthly remains optimal. Jegadeesh & Titman (1993) originally worked at 3- and 6-month horizons partly for this reason.

5. Survivorship. Using today’s S&P 500 membership to filter historical prices is a classic survivorship bias: the firms that are in today’s index have all survived, grown large enough to be admitted, and not been relegated. Backtests built this way systematically overstate returns by 1–2% per year and understate drawdown risk (because failed firms are excluded). For momentum specifically, you would most severely bias the long leg upward (winners in our filtered sample are more likely to be true winners, because future losers are excluded ex post), creating a phantom premium. Point-in-time membership fixes this by only including a stock in month t if it was in the S&P 500 as of month t.

6. Data quality. A reflective report should be transparent (list every decision), quantitative (report the magnitude of each adjustment), and falsifiable (show results with and without the adjustment). Good template:

  • Clipping extreme returns at ±80% monthly — typically removes ~20 observations out of ~350,000; negligible impact on factor-level returns but important for tail-risk measures (Value-at-Risk, max drawdown).
  • Excluding tickers with known split-adjustment failures — high-impact because these tickers would otherwise generate spurious factor-return spikes; document the specific tickers removed.
  • Treatment of delisting returns — often the single highest-impact decision; dropping delisted observations is the canonical form of survivorship bias.

Rank decisions by magnitude of effect on your headline result (run the analysis with and without each adjustment). Decisions that don’t move the result need only be mentioned; decisions that do need to be defended in detail.

Overall framing for the CW2 report. The lesson is not that the DIY replication “failed” — it is that construction choices are methodological variables, not neutral defaults. Different universe, weighting, rebalancing, and data-cleaning conventions produce systematically different factor returns, and academic “factor premia” are really shorthand for “the premium under a specific construction convention.” Jensen, Kelly & Pedersen (2023) and Harvey & Liu (2016) both make this point at length; your DIY exercise is a hands-on demonstration of exactly what they document.

TipCoursework 2 connection

If you complete this extension, you have first-hand experience with the data quality and preparation decisions that the CW2 report asks about. In your report, you can discuss:

  • How factor returns depend on construction choices (universe, weighting, rebalancing)
  • Why the JKP dataset embeds specific methodological assumptions
  • What “replication” really means when construction details differ
  • How survivorship bias arises and why point-in-time filtering matters

This moves your report from abstract discussion to evidence-based critique, which is what distinguishes strong submissions.


Part 6: Connecting to Coursework 2

What You’ve Learned vs. What You’ll Apply

Today’s lab explored concepts:

  • HAC standard errors correct for autocorrelation (prevents false positives)
  • Alpha tests isolate excess returns beyond market exposure
  • Robustness checks (sample split, rolling windows) test stability
  • Transaction costs reduce net alpha (affects exploitability)
  • Selection bias creates false discoveries (replication crisis)

For Coursework 2, you’ll:

  1. Use scaffold notebook to run these analyses on real JKP factor data
  2. Interpret outputs using understanding developed today
  3. Write critical analysis discussing significance, robustness, limitations
  4. Engage with Jensen et al. (2024) on selection bias and replication
  5. Provide investment recommendations considering costs and risks

Critical Analysis Questions to Ask

When interpreting your Coursework 2 results, ask:

Statistical questions:

  • Is alpha statistically significant using HAC standard errors?
  • How does significance change if you use higher threshold (t > 3)?
  • Are results robust to sample split? If not, why?

Economic questions:

  • What is alpha magnitude after transaction costs?
  • Is net alpha large enough to justify implementation?
  • How does your result compare to original published paper?

Methodological questions:

  • Could selection bias explain why original paper found stronger results?
  • Has factor performance declined post-publication (arbitrage erosion)?
  • What limitations affect your replication (data, sample period, methodology)?
  • Are your HAC standard errors sensitive to the lag choice (6 vs 12 vs 24 months)? If the conclusion about significance flips when you vary the lag, how confident can you be?

Investment questions:

  • Would you recommend implementing this factor in a real portfolio?
  • What risks and constraints would practitioners face?
  • What further robustness tests would you want before making investment decision?

Next Steps

  1. Read Jensen et al. (2024): “Is There a Replication Crisis in Finance?” — essential for understanding replication methodology and selection bias
  2. Run scaffold notebook: See what actual outputs look like with real JKP data
  3. Choose factor: Value (HML) or Momentum (MOM) are well-documented; Quality (RMW) or Size (SMB) also work
  4. Draft interpretation: Practice writing paragraphs interpreting alpha, robustness, limitations
  5. Office hours: Ask conceptual questions about interpretation (not code debugging)

Summary

Today’s lab developed principles for factor replication:

  • HAC standard errors are essential for honest inference in time-series data
  • Alpha tests reveal excess returns, but significance ≠ exploitability
  • Robustness checks separate real patterns from sample-specific noise
  • Transaction costs can eliminate seemingly significant alpha
  • Selection bias inflates published results, explaining replication failures

These principles enable critical analysis: the 35% component of Coursework 2. Scaffold provides outputs; understanding provides interpretation. Focus your effort on thinking deeply about what results mean, not on perfecting code.

Week 11 preview: Market prediction using factors. Same principle-focused approach: understanding concepts, not copying templates.