---
title: "Factor Investing and Factor Replication"
subtitle: "Week 9: From Theory to Disciplined Evidence"
format:
html:
toc: true
toc-depth: 3
number-sections: true
code-fold: true
code-summary: "Show code"
bibliography:
- ../resources/reading.bib
- ../resources/reading_supp.bib
execute:
echo: true
warning: false
message: false
jupyter: fin510
---
```{python}
#| include: false
import sys
from pathlib import Path
sys.path.insert(0, str(Path("scripts").resolve()))
from bloomberg_loader import load_bloomberg
```
## Why Factor Investing Matters {#sec-why-factors}
Suppose you are managing a pension fund for UK university staff, and your mandate is to deliver returns above inflation over a thirty-year horizon. You could buy an index tracker and accept market returns. You could hire a stock picker and hope for skill. Or you could ask a different question entirely: are there measurable characteristics of firms, observable today, that systematically predict which stocks will outperform in the future? If so, you could tilt your portfolio toward those characteristics, harvesting a premium that neither relies on market timing nor on the judgement of any single analyst.
This question defines the intellectual programme of factor investing, and it has consumed the best minds in academic finance for over half a century. The answer, in short, is yes, with caveats that occupy the rest of this chapter.
@cochrane2011presidential framed the modern research agenda in his presidential address to the American Finance Association: the central puzzle in asset pricing is not whether average returns vary across assets (they clearly do), but *why* they vary. Stocks with low prices relative to book value earn higher average returns than glamorous growth stocks. Past winners continue to outperform past losers. Smaller firms tend to outperform larger ones, though less reliably than once believed. Profitable firms outperform unprofitable ones. These patterns, documented across decades and across countries, are what we call factors.
For practitioners, the implications are concrete. Wesley Gray, a former US Marine officer turned quantitative fund manager, built his firm Alpha Architect around the principle that systematic, rules-based factor strategies can outperform discretionary stock-picking, provided one has the discipline to follow the rules during painful drawdowns. His books, *Quantitative Value* [@gray2012quantitative] and *Quantitative Momentum* [@gray2016momentum], serve as practitioner translations of the academic evidence. Gray's central message is instructive for students: the edge in factor investing comes not from discovering secret signals, but from disciplined implementation of well-documented ones. Most investors lack the patience to hold a value portfolio through a five-year drawdown, and that behavioural gap is itself part of the premium.
This chapter takes you through three layers of understanding. First, you will build practitioner intuition about what factors are, how they are constructed, and why they might earn premia. Second, you will learn the canonical academic models that formalise this intuition, from the Capital Asset Pricing Model through to the modern five-factor and q-factor frameworks. Third, and most critically for your assessment, you will encounter the frontier literature that challenges naive factor claims on grounds of false discovery, weak identification, overfitting, and implementation costs. By the end, you should be equipped not only to replicate a factor portfolio, but to evaluate honestly whether the result is credible, and to write the kind of critical analysis that distinguishes strong assessment from mechanical output.
::: {.callout-note}
**Learning objectives.** After completing this chapter, you should be able to: (1) explain the economic logic behind the major equity factors; (2) construct and interpret a long-short factor portfolio using JKP data with HAC standard errors; (3) evaluate competing factor models using out-of-sample evidence; (4) identify the principal threats to factor claims (multiple testing, weak factors, test-asset sensitivity, transaction costs); (5) connect academic factor research to practitioner implementation; (6) write a reflective critical analysis of a factor replication exercise.
:::
*Evidence confidence: High. This section draws on canonical sources [@cochrane2011presidential; @fama1992cross; @fama2015five] and practitioner works [@gray2012quantitative; @gray2016momentum] with no frontier-dependent claims.*
## From CAPM to Multi-Factor Models {#sec-capm-to-multifactor}
### The CAPM and its failures
The starting point for any discussion of factor investing is the model it superseded. The Capital Asset Pricing Model, developed independently by Sharpe, Lintner, and Mossin in the 1960s, makes a powerful prediction: the only characteristic of a stock that should determine its expected return is its sensitivity to the market portfolio, measured by beta. High-beta stocks bear more systematic risk and should earn higher returns. Low-beta stocks bear less and should earn less. All other characteristics, including the firm's size, its valuation ratios, and its recent price performance, should be irrelevant once you control for beta.
The prediction is elegant, but the data rejected it decisively. By the early 1980s, Banz [-@banz1981relationship] had documented that small-capitalisation stocks earned higher average returns than large-capitalisation stocks, even after adjusting for market beta. This could not be explained within the CAPM framework without redefining what "risk" means.
### The Fama-French revolution
The decisive empirical blow came from @fama1992cross, who showed that two characteristics, firm size and the ratio of book value to market value, absorbed nearly all of the cross-sectional variation in average stock returns that beta was supposed to explain. Their subsequent three-factor model [-@fama1993common] formalised this finding by constructing tradable factor portfolios:
- **Market (MKT):** the excess return on a broad equity index over the risk-free rate.
- **Size (SMB, Small Minus Big):** the return on small-capitalisation stocks minus the return on large-capitalisation stocks.
- **Value (HML, High Minus Low):** the return on high book-to-market (value) stocks minus the return on low book-to-market (growth) stocks.
The three-factor model was extended to five factors by @fama2015five, adding:
- **Profitability (RMW, Robust Minus Weak):** the return on firms with high operating profitability minus those with low profitability, building on the insight from @novymarx2013other that profitability predicts returns as powerfully as value.
- **Investment (CMA, Conservative Minus Aggressive):** the return on firms with low asset growth (conservative investors) minus those with high asset growth (aggressive investors).
In parallel, @jegadeesh1993returns documented the momentum effect: stocks that performed well over the past six to twelve months continued to outperform over the subsequent month, while past losers continued to underperform. @carhart1997persistence incorporated momentum into a four-factor model and showed that much of the apparent persistence in mutual fund performance was explained by momentum exposure rather than genuine skill. The q-factor model of @hou2015digesting offered an alternative investment-based motivation, replacing the value factor with investment and profitability factors grounded in the production side of the economy.
These models collectively define the canonical vocabulary of factor investing. When a practitioner says "we have a value tilt", they mean the portfolio loads positively on HML. When an analyst says "the fund's alpha is zero after controlling for factors", they mean the fund's returns are fully explained by its exposures to market, size, value, momentum, and profitability, with no residual return attributable to skill.
### Why might factors earn premia?
Two broad families of explanation compete for each factor. The risk-based view holds that factors compensate investors for bearing systematic risks that are painful in bad states of the world. Value stocks, on this view, tend to be financially distressed firms that perform especially poorly during recessions, and the value premium is compensation for that conditional risk. Small stocks are less liquid and harder to sell in a crisis. The behavioural view holds that factors exploit persistent investor mistakes. Momentum, on this reading, arises because investors underreact to new information, and prices drift gradually toward fundamental value. Value stocks are cheap because investors have overreacted to bad news and pushed prices below fundamental value.
The honest answer is that neither explanation fully accounts for all factors, and the frontier literature has shown that disentangling the two is harder than once believed. @kozak2018interpreting demonstrate formally that reduced-form factor models are observationally equivalent under both rational and behavioural data-generating processes: a model that prices the cross-section well tells us *what* predicts returns, not *why*. This interpretation boundary is critical for assessment: when you report that a factor earns a significant premium, you should resist the temptation to assert that the premium is "compensation for risk" or "exploitation of a bias" without additional structural evidence that your regression cannot provide.
*Evidence confidence: High. CAPM, FF3, FF5, momentum, and q-factor are canonical with extensive replication. The interpretation-boundary claim (Kozak, Nagel, and Santosh, 2018) is KB core_034, full-text coded, strong claim strength.*
## How Factors Are Constructed {#sec-factor-construction}
Understanding factor construction is essential for interpreting factor returns and for identifying the practical decisions that affect results. The mechanics are straightforward in principle, but the details matter more than students typically expect.
### The long-short portfolio
A factor portfolio is a zero-investment, long-short strategy. At each rebalancing date (typically monthly or annually), the researcher:
1. Ranks all stocks in the investable universe by a characteristic (e.g., book-to-market ratio for value).
2. Assigns stocks to groups. The simplest approach is a median split (top half vs. bottom half); the standard academic approach uses quintiles or deciles.
3. Forms a long position in the stocks with the desired characteristic (high book-to-market for value) and a short position in those with the opposite characteristic (low book-to-market).
4. Computes the factor return as the difference in returns between the long and short legs.
The resulting return series represents the "pure" premium associated with that characteristic, net of market exposure. Because the long and short positions are dollar-matched, market movements affect both legs similarly, and the factor return captures only the differential performance.
### Construction choices matter
A recurring theme in this chapter is that apparently minor construction choices can have material effects on results. @fama2015five use NYSE breakpoints and value-weighted portfolios within each leg. The q-factor model uses different breakpoints and different definitions of investment. The JKP dataset [@jensen2024replication] standardises construction across 153 factors, making comparisons more reliable, but even here, choices about weighting, breakpoints, and rebalancing frequency affect magnitudes.
Gray emphasises this point from a practitioner perspective. In *Quantitative Value*, he documents how different value metrics (book-to-market, earnings yield, EBITDA/EV) can produce meaningfully different portfolio compositions and returns [@gray2012quantitative]. The lesson for students is not that one metric is "correct", but that construction sensitivity should be reported and discussed. If your factor result changes substantially when you switch from quintile to tercile sorts, that instability is itself informative and should appear in your critical analysis.
### Using JKP and Bloomberg data
For this chapter and for the assessment, you will work with two complementary datasets. The JKP factor data portal (jkpfactors.com) provides pre-constructed monthly factor returns for 153 published factors across multiple regions. The course dataset (`jkp_master_global_monthly.csv`) covers 71 countries and 14 factor themes; the US sample runs from January 1926 to December 2023 and the UK sample from January 1986 to December 2023. This is your primary dataset for factor replication and comparison exercises. Bloomberg data, loaded via the course `load_bloomberg()` utility, provides market context: benchmark index returns (SPY, UKX), ETF performance (ISF, VUKE), and the real-world implementation perspective that JKP's academic factors intentionally abstract away from.
**What this teaches students:** JKP data lets you focus on interpretation rather than data engineering. But the abstraction is deliberate: pre-constructed factors hide the construction choices that affect results. When you use JKP returns, you are inheriting a particular set of decisions about breakpoints, weighting, and universe. Acknowledging this in your assessment demonstrates methodological awareness.
```{python}
#| eval: true
#| code-fold: true
#| code-summary: "Show: loading Bloomberg benchmark data"
import pandas as pd
import numpy as np
bb = load_bloomberg(tickers=["SPY", "UKX"])
spy = (bb.loc[bb["ticker"] == "SPY"]
.set_index("date")["PX_LAST"]
.pct_change().dropna())
ukx = (bb.loc[bb["ticker"] == "UKX"]
.set_index("date")["PX_LAST"]
.pct_change().dropna())
print(f"SPY daily obs: {len(spy)}, mean daily return: {spy.mean()*100:.4f}%")
print(f"UKX daily obs: {len(ukx)}, mean daily return: {ukx.mean()*100:.4f}%")
```
*Evidence confidence: High. Factor construction mechanics are textbook-level [@fama1993common; @fama2015five; @jensen2024replication]. Gray's practitioner observations are consistent with academic evidence on construction sensitivity.*
## Factor Performance: What the Data Shows {#sec-factor-performance}
Before subjecting factors to critical scrutiny, it is useful to establish what the raw empirical record looks like. Using JKP factor returns, we can examine the major factors' average returns, volatilities, Sharpe ratios, and drawdown profiles.
### Summary statistics
The canonical factors display heterogeneous performance profiles. Over the full JKP US sample (January 1926 to December 2023 in the course dataset), the market factor has the highest average return (approximately 9% annualised) but also the highest volatility (approximately 21% annualised), yielding an annualised Sharpe ratio of about 0.43. Among the long-short factors, investment (CMA) has the highest Sharpe ratio at roughly 0.40, followed by momentum (MOM) and profitability (RMW) at approximately 0.28 each, and value (HML) at about 0.27. Size (SMB) has the weakest premium, with a Sharpe ratio of approximately 0.23. Over the commonly-used post-1963 subsample, these magnitudes are somewhat higher — HML and MOM both approach 0.4, and CMA rises to approximately 0.45 — but the ordering is broadly preserved.
```{python}
#| eval: true
#| code-fold: true
#| code-summary: "Show: JKP factor summary statistics (US)"
from pathlib import Path
import pandas as pd
import numpy as np
candidate_paths = [
Path("labs/jkp_master_global_monthly.csv"),
Path("../labs/jkp_master_global_monthly.csv"),
]
jkp_path = next((p for p in candidate_paths if p.exists()), None)
if jkp_path is None:
raise FileNotFoundError("Could not locate jkp_master_global_monthly.csv")
jkp_all = pd.read_csv(jkp_path, parse_dates=["date"])
jkp_us = jkp_all[jkp_all["country"] == "usa"].set_index("date").sort_index()
factors = ["MKT", "SMB", "HML", "MOM", "RMW", "CMA"]
full = jkp_us[factors].dropna()
post1963 = jkp_us.loc["1963-01-01":, factors].dropna()
def annualised_stats(df):
return pd.DataFrame({
"ann_mean_%": df.mean() * 12 * 100,
"ann_vol_%": df.std() * np.sqrt(12) * 100,
"ann_sharpe": (df.mean() / df.std()) * np.sqrt(12),
}).round(3)
print("=== Full sample (1926-2023) ===")
print(annualised_stats(full).to_string())
print("\n=== Post-1963 sample ===")
print(annualised_stats(post1963).to_string())
```
These full-sample statistics, however, conceal substantial time variation. The value premium was strongly positive from the 1960s through the early 2000s but turned sharply negative from approximately 2017 to 2020, as growth and technology stocks dominated equity markets. @arnott2021reports examined this period and argued that the value premium's disappearance was largely a valuation-spread expansion (growth stocks became even more expensive relative to value stocks) rather than a permanent structural change. Whether this interpretation is correct remains debated, but the episode illustrates a critical point: even well-documented factors can experience prolonged drawdowns that test investor discipline.
Momentum experienced an even more dramatic failure. In the spring of 2009, as equity markets recovered from the global financial crisis, momentum portfolios suffered a drawdown exceeding 50% in a single quarter as previous losers (beaten-down financials and cyclicals) surged and previous winners fell. @asness2013value document that value and momentum tend to be negatively correlated, so a combined value-plus-momentum portfolio would have been partially hedged, but the point stands: factor investing is not risk-free arbitrage.
**What this teaches students:** Summary statistics are necessary but not sufficient. A factor can have an attractive full-sample Sharpe ratio while experiencing multi-year drawdowns that would cause most investors to abandon it. Gray's central practitioner insight is relevant here: the premium exists, in part, *because* it is painful to harvest. If it were easy, everyone would do it, and the premium would disappear.
### Subsample instability
A robust factor should deliver positive returns across different subsamples, not just in the full sample. When you split the data at the midpoint (or at a structurally meaningful date such as the dot-com crash or the global financial crisis), do the Sharpe ratios hold up? For your assessment, this subsample analysis is one of the most important robustness exercises. A factor whose alpha is statistically significant over the full sample (1926 to 2023 in the course dataset) but insignificant in every decade-long subsample should be treated with scepticism.
*Evidence confidence: High. Performance statistics are directly observable from JKP data. The value drawdown is documented in Arnott et al. (2021). Momentum crash dynamics are documented in Asness, Moskowitz, and Pedersen (2013).*
## The Factor Zoo Problem {#sec-factor-zoo}
### Four hundred factors and counting
In 2016, @harvey2016and published a paper with a provocative title, "... and the Cross-Section of Expected Returns", documenting that over 300 characteristics had been published as "significant" predictors of stock returns. By the time of their updated count, the number exceeded 400. The implication was stark: the finance literature was producing factors far faster than the economy could plausibly produce distinct sources of systematic risk. Many of these factors, perhaps most, must be false discoveries arising from multiple testing, data mining, and publication bias.
The logic is straightforward. If you test 400 candidate factors at the conventional 5% significance level, you expect 20 to appear significant by chance alone, even if none of them truly predicts returns. If journals preferentially publish significant results and file away null findings, the published literature will systematically overstate the true number of genuine factors.
Harvey, Liu, and Zhu proposed a practical correction: raising the significance threshold from the conventional t-statistic of 1.96 (corresponding to a 5% p-value) to approximately 3.0, which adjusts for the estimated number of tests conducted across the literature. This is not a formal multiple-testing correction in the Bonferroni sense, but it captures the spirit of the problem: in a world of many tests, stronger evidence should be required before a new factor is declared real.
### Disciplined approaches to the zoo
The frontier literature has developed more formal approaches. @feng2020taming propose a double-selection LASSO procedure that tests the incremental contribution of a new factor while controlling for a high-dimensional set of existing factors. Their finding is sobering: most proposed factors are redundant once you properly control for the factors that are already known. The apparent novelty of many published factors reflects omitted-variable bias in their original tests, not genuinely new information about the cross-section.
@giglio2021thousands develop a formal false-discovery-rate (FDR) framework for large-scale alpha testing, incorporating latent factor adjustment and bootstrap validity. Their procedures allow researchers to control the expected proportion of false discoveries among the set of factors declared significant. The key pedagogical message is that FDR control represents a principled alternative to ad hoc t-statistic thresholds, though its practical implementation requires careful attention to benchmark specification and finite-sample properties.
@mclean2016does show that anomaly returns decline by approximately 32% after academic publication, consistent with informed trading eroding mispricings once they become widely known. @jensen2024replication provide perhaps the most comprehensive subsequent assessment. Using 153 published factors and a consistent methodology, they show that most factors replicate directionally (the sign of the premium is correct) but with attenuated magnitudes, typically 30 to 50 per cent lower than originally reported. Cross-regional replication is weaker still: many US-documented factors do not survive in European or Asian data. The JKP dataset you use for assessment is the output of this replication exercise, which means you are already working with more honestly estimated factor returns than those in the original papers.
**What this teaches students:** When you select a factor for the assessment, you are choosing from a population that includes an unknown proportion of false discoveries. Your critical analysis should engage with this reality. Ask: does my factor survive the higher t-statistic threshold? Is its post-publication performance weaker than its original-sample performance? Would it survive the Feng, Giglio, and Xiu double-selection test against known factors?
*Evidence confidence: High. Harvey, Liu, and Zhu (2016) is canonical. Feng, Giglio, and Xiu (2020) and Giglio, Liao, and Xie (2021) are KB core_012 and core_001 respectively, both full-text coded with strong claim strength. Jensen, Kelly, and Pedersen (2023) is canonical replication evidence.*
## Model Comparison Discipline {#sec-model-comparison}
### Why in-sample fit is not enough
Suppose you estimate a five-factor model and find that it explains 95% of the variation in a set of test portfolio returns, with all factors statistically significant. Should you conclude that you have found the correct model of the cross-section? The frontier literature gives a clear answer: not necessarily, and possibly not at all.
@kan2024insample show that in-sample Sharpe ratios of multi-factor models systematically overstate attainable out-of-sample performance. Estimation risk, the unavoidable noise in estimated factor loadings and risk premia, creates an upward bias in IS Sharpe ratios that can be large in practical sample sizes. Their framework provides exact finite-sample and asymptotic distributions for both IS and OOS Sharpe ratios, and the gap between the two is often economically significant: a model that looks impressive in-sample can be mediocre or worse out-of-sample.
@barillas2018choosing formalise the model-selection problem by proposing a unified framework based on the maximum squared Sharpe ratio of each model's factors. Their key contribution is a bootstrap procedure that compares IS and OOS rankings, revealing that IS-optimal models are frequently not the best OOS. The practical implication is that any factor-model comparison should include explicit OOS evaluation, not just IS fit statistics.
### Bayesian approaches and their pitfalls
The Bayesian literature offers an alternative framework for model comparison that naturally incorporates parameter uncertainty and model uncertainty. @bryzgalova2023bayesian run what they describe as "two quadrillion models" using Bayesian model averaging (BMA) with spike-and-slab priors. Their central finding is that no single sparse factor model reliably dominates: the BMA-weighted SDF consistently outperforms any individual sparse model in both IS and OOS evaluation. The pedagogical implication is powerful: the search for "the" correct factor model may be misguided, and honest uncertainty quantification (averaging across models rather than selecting one) produces more robust pricing.
However, even Bayesian approaches require care. Barillas and Shanken [-@barillas2019comparing] demonstrate that the marginal-likelihood comparisons used in earlier Bayesian model-comparison work required correctly constructed priors to be valid; improperly specified priors could produce misleading rankings. This is a useful teaching moment: Bayesian methods are not automatically more reliable than frequentist ones; they simply relocate the assumptions from test design to prior specification.
### Spurious fit and identification failures
Perhaps the most unsettling finding comes from @bryzgalova2018spurious, who shows that standard maximum-likelihood estimation can select factors that are entirely spurious. Under weak identification (when factors have low correlation with the true pricing kernel), the estimator can produce perfect in-sample fit with factors that have no genuine pricing content. Rank tests and misspecification diagnostics are prerequisites to model comparison, not optional supplements. This finding reinforces a simple but important principle for students: impressive R-squared values and significant t-statistics are necessary but not sufficient evidence that a model is correct.
**What this teaches students:** When you present factor regression results in your assessment, discuss not only what the model fits, but what it might be overfitting. Acknowledge that IS performance overstates OOS performance, and, where possible, report both. If your factor's alpha is significant in the full sample but not in a held-out subsample, say so honestly. The evaluation framework rewards intellectual honesty, not clean results.
*Evidence confidence: High. All claims in this section are grounded in KB papers with full-text coding and strong or moderate-strong claim strength: core_000 [@kan2024insample], core_010 [@barillas2018choosing], core_002 [@bryzgalova2023bayesian], core_003 [@barillas2019comparing], and core_033 [@bryzgalova2018spurious].*
## Test Assets, Weak Factors, and What We Can Actually Learn {#sec-test-assets}
### The test-asset problem
Factor models are evaluated against a set of test assets, typically portfolios sorted by characteristics (25 size-and-value portfolios, for instance). A finding that is less widely appreciated among students is that the choice of test assets is not neutral: it can materially affect which models appear to "win" and which factors appear to be priced.
@anatolyev2024test formalise this insight by showing that factor weakness is test-asset-relative. A factor that appears weak (low correlation with the pricing kernel) when evaluated against one set of test portfolios can appear strong against a different set. This is not a statistical anomaly; it reflects the fact that different test assets span different parts of the return space, and a factor's apparent relevance depends on whether its information content is represented in the test portfolios used for evaluation. Their proposed supervised PCA (SPCA) approach improves inference robustness under weak-factor conditions, but the deeper lesson is structural: conclusions about factor relevance are conditional on the test-asset design.
### Interaction-aware test-asset design
@bryzgalova2025forest push this idea further with what they call "AP Trees", a decision-tree-based approach to constructing test assets that captures characteristic interactions. Traditional portfolio sorts (sort by size, then independently by value) assume that the effects of size and value are additive. But if the value premium is stronger among small stocks (an interaction effect), additive sorts will miss this structure and the resulting test assets will fail to span the true pricing kernel. AP Trees allow the data to discover the most informative interactions, producing small sets of interpretable managed portfolios that span the stochastic discount factor better than standard sorts in OOS comparisons. The evidence is US-centric, however, and cost integration is not central to the paper, so claims about global applicability or implementability should be treated cautiously.
The related work of @kozak2019shrinking demonstrates that robust OOS SDF recovery requires shrinkage, and that sparsity in the space of principal components is more plausible than sparsity in the space of individual characteristics. This has implications for how we think about factor selection: rather than asking "which individual factor is best?", a more productive question may be "which low-dimensional projection of the characteristic space best captures the pricing kernel?"
**What this teaches students:** The test assets you use to evaluate a model are not a passive backdrop; they are an active part of the research design. If you evaluate your factor against the standard 25 Fama-French portfolios and find strong results, those results might weaken with a different portfolio set. Acknowledging this in your assessment is a mark of sophistication. You do not need to construct AP Trees for your assignment, but you should be aware that the standard evaluation framework has known limitations.
*Evidence confidence: High. Anatolyev and Mikusheva (2024), Bryzgalova, Pelger, and Zhu (2025), and Kozak, Nagel, and Santosh (2019) are KB core_036, core_007, and core_011 respectively, all full-text coded with strong claim strength. The caveat about US-centricity and cost integration for core_007 is stated in the KB annotation.*
## Implementability: Costs, Capacity, and Real-World Constraints {#sec-implementability}
### When costs change everything
The transition from academic backtests to investable strategies is where many factor premia disappear. @detzel2023transaction provide the most comprehensive treatment, showing that accounting for transaction costs can reverse the ranking of factor models. A model that appears superior when evaluated on gross returns may be inferior when evaluated net of the trading costs required to maintain its factor exposures. High-turnover factors (momentum, in particular, with annual turnover often exceeding 100%) are the most vulnerable. Low-turnover factors (value, quality) fare better, but are not immune.
@demiguel2024comparing extend this analysis to price-impact costs, which increase with the scale of the investor's portfolio. Their finding is that the "best" factor model depends on the investor's risk aversion and portfolio size: a model that is optimal for a small retail investor may be suboptimal for a large institutional fund facing material price impact. This investor-dependence of model rankings challenges the search for a single "best" model and introduces a heterogeneity that most academic comparisons ignore.
### Practitioner discipline
Gray's work is particularly instructive here. In *Quantitative Momentum*, he devotes extensive discussion to the gap between backtested momentum returns and implementable momentum returns, documenting how rebalancing frequency, execution timing, and capacity constraints all erode the premium [@gray2016momentum]. His practical recommendation, to rebalance less frequently than the academic literature suggests and to accept some tracking error in exchange for lower costs, reflects a tension that students should understand: the academically optimal strategy and the implementable strategy are often different, and the practitioner's job is to navigate this gap.
### Smart-beta ETFs and the implementation gap
The smart-beta industry, which packages factor exposures into exchange-traded funds, provides a natural laboratory for studying the implementation gap. Products like iShares Edge MSCI USA Value Factor ETF or Vanguard Value ETF offer systematic value exposure at low cost (0.04 to 0.20% annually). But even these products make construction choices (weighting schemes, rebalancing schedules, universe definitions) that can cause their returns to diverge materially from the academic HML factor. Comparing ETF returns against JKP factor returns is an instructive exercise: it reveals the practical costs and compromises involved in translating an academic concept into an investable product.
For UK students, the relevant context includes workplace pension default funds (many of which now incorporate modest factor tilts), ISA-eligible ETFs listed on the London Stock Exchange, and the UK regulatory framework under which product providers must justify the costs of active strategies, including factor tilts, relative to simple index tracking.
*Evidence confidence: High. Detzel, Novy-Marx, and Velikov (2023) and DeMiguel et al. (2024) are KB core_026 and core_043 respectively, both full-text coded. Gray's practitioner evidence is not peer-reviewed in the same sense but is consistent with the academic findings.*
## Factor Investing in FinTech {#sec-fintech}
Factor investing is one of the most natural applications of the FinTech themes you have studied throughout this module. The entire logic of factor investing, identifying systematic patterns, constructing rules-based portfolios, rebalancing algorithmically, and monitoring exposures continuously, maps directly onto the capabilities of robo-advisory platforms and algorithmic trading systems.
Robo-advisors such as Betterment and Wealthfront initially offered factor tilts (value, small-cap) as premium portfolio options. The subsequent retreat of some providers from aggressive factor tilts, with Wealthfront notably scaling back its smart-beta offering after several years of underperformance, illustrates a real-world lesson: factor premia are volatile, clients lack patience, and a product that relies on a multi-year time horizon is difficult to sell to investors who check their portfolios daily.
Dimensional Fund Advisors (DFA) and AQR Capital Management represent the institutional end of the spectrum, managing hundreds of billions in systematic factor strategies. Their approach combines academic rigour (both firms have deep ties to the academic literature) with implementation expertise (patient trading, capacity management, cost control). For students, the DFA and AQR models illustrate what disciplined factor implementation looks like at scale.
The connection to earlier chapters is direct. The robo-advisor discussion in Chapter 4 introduced portfolio optimisation and algorithmic allocation. Factor investing provides the return-generating model that feeds into those optimisers. The discussion of data quality in Chapter 2 applies directly to factor construction: garbage characteristics in, garbage factor returns out. And the backtesting and production ML discussion in Chapter 10 addresses the deployment challenges that factor strategies face in live markets.
*Evidence confidence: High. This section draws on publicly observable industry information and connections to earlier course content. No frontier-dependent claims.*
## Interpreting Factor Results: A Framework for Critical Analysis {#sec-interpretation}
### What regressions tell us, and what they do not
When you run a time-series regression of a factor's returns on the market factor and obtain a significant alpha, you have established that the factor earned returns beyond what its market exposure would predict. This is a necessary first step, but it is far from the last.
What the regression does not tell you is *why* the premium exists. @kozak2018interpreting make this point formally: tests of reduced-form factor models and horse races between characteristics and covariances cannot discriminate between rational risk-based and behavioural models of investor beliefs. A factor model that prices the cross-section well is consistent with both a world where investors are rational and demand compensation for bearing systematic risk, and a world where investors are irrational and systematically misprice certain types of firms. This observational equivalence is a fundamental limitation of the empirical approach, not a flaw in any particular study.
For your assessment, the practical implication is clear: do not overclaim. If your value factor has significant alpha, you may say "value-tilted portfolios earned returns that are not explained by their market exposure during this period." You should not say "investors are compensated for the higher distress risk of value stocks" or "the market systematically overreacts to bad news about value stocks" unless you have additional structural evidence (duration analysis, event studies, or causal designs) that your factor regression cannot provide.
### HAC standard errors and why they matter
Factor returns, like most financial time series, exhibit serial correlation. Positive returns tend to cluster with positive returns, and negative with negative. This means that standard OLS standard errors, which assume independent observations, are too small. T-statistics computed with OLS standard errors are inflated, and significance is overstated.
Heteroskedasticity and autocorrelation consistent (HAC) standard errors, most commonly computed using the Newey-West estimator with a specified number of lags, correct for both problems simultaneously. The resulting standard errors are larger, the t-statistics are smaller, and some results that appeared significant under OLS disappear.
```{python}
#| eval: true
#| code-fold: true
#| code-summary: "Show: OLS vs HAC alpha test"
from pathlib import Path
import pandas as pd
import numpy as np
import statsmodels.api as sm
# Load JKP data and filter to US
candidate_paths = [
Path("labs/jkp_master_global_monthly.csv"),
Path("../labs/jkp_master_global_monthly.csv"),
]
jkp_path = next((p for p in candidate_paths if p.exists()), None)
if jkp_path is None:
raise FileNotFoundError("Could not locate jkp_master_global_monthly.csv")
jkp_all = pd.read_csv(jkp_path, parse_dates=["date"])
jkp = jkp_all[jkp_all["country"] == "usa"].set_index("date").sort_index()
# Alpha test: regress HML on a constant and MKT
y = jkp["HML"].dropna()
X = sm.add_constant(jkp["MKT"].reindex(y.index))
# OLS with default (non-robust) standard errors
ols_result = sm.OLS(y, X, missing="drop").fit()
# HAC standard errors (Newey-West, 6 lags)
hac_result = sm.OLS(y, X, missing="drop").fit(
cov_type="HAC", cov_kwds={"maxlags": 6}
)
print("=== OLS standard errors ===")
print(f" Alpha: {ols_result.params['const']:.4f} "
f"SE: {ols_result.bse['const']:.4f} "
f"t: {ols_result.tvalues['const']:.2f}")
print("=== HAC (Newey-West, 6 lags) ===")
print(f" Alpha: {hac_result.params['const']:.4f} "
f"SE: {hac_result.bse['const']:.4f} "
f"t: {hac_result.tvalues['const']:.2f}")
```
For your assessment, using HAC standard errors is not optional. Any factor regression reported without HAC correction is methodologically incomplete. The lab exercises provide worked examples of the difference between OLS and HAC inference, and you should report both (or at minimum, report HAC) and discuss what the comparison reveals about the error structure of your factor returns.
### Statistical versus economic significance
A factor alpha of 0.3% per month with a HAC t-statistic of 2.5 is statistically significant at conventional levels. But is it economically meaningful? After transaction costs of 0.1 to 0.3% per month (realistic for a medium-turnover strategy), the net alpha may be between 0.0% and 0.2% per month, or 0 to 2.4% annualised. Whether this justifies the complexity, risk, and operational cost of implementing the strategy is a judgement call, not a statistical one.
This distinction matters for your critical analysis. The best assessment submissions go beyond statistical significance to discuss economic significance: is the premium large enough, after costs, to be worth implementing? Is it stable enough to rely on? How does it compare to the simpler alternative of holding a market index fund?
*Evidence confidence: High. The interpretation-boundary claim (Kozak, Nagel, and Santosh, 2018) is KB core_034. HAC standard errors are standard econometric methodology [@newey1987simple]. The distinction between statistical and economic significance is well-established in the methodological literature.*
## Synthesis and Progression {#sec-synthesis}
### What you have learned
This chapter has taken you through three layers of understanding about factor investing. At the practitioner level, you now know what factors are, how they are constructed, and why disciplined implementation matters more than signal discovery. At the canonical level, you can place the major factors (market, size, value, momentum, profitability, investment) within the evolution from CAPM through to the modern multi-factor frameworks. At the frontier level, you understand the principal threats to factor claims: multiple testing and false discovery, in-sample overfitting, test-asset sensitivity, weak-factor identification, and implementation costs.
### Key principles for assessment
The following principles should guide your the assessment analysis:
1. **Report HAC standard errors.** OLS standard errors are inappropriate for time-series factor regressions. Always use Newey-West or an equivalent HAC estimator, and discuss what the correction implies about your factor's error structure.
2. **Perform subsample analysis.** A factor that is significant only in the full sample but not in any subsample is less credible than one that delivers consistent performance across periods. Report subsample results honestly, even when they weaken your headline finding.
3. **Discuss out-of-sample performance.** If the factor was originally documented in a specific sample (e.g., US equities 1963 to 1990), its performance in subsequent data (1991 to present) is out-of-sample evidence. Comparing the two is one of the most informative analyses you can perform.
4. **Acknowledge the factor zoo.** Your factor was selected from a population that includes many false discoveries. Engage with this reality: what is the evidence that your factor is genuine rather than a product of data mining?
5. **Consider implementation costs.** Even a statistically significant factor may be economically meaningless after transaction costs. Discuss what costs your factor would face and whether the net premium is sufficient to justify implementation.
6. **Do not overclaim about mechanisms.** Your regression tells you what predicts returns, not why. Resist the temptation to assert risk-based or behavioural explanations without structural evidence beyond the scope of your analysis.
### Looking ahead
The factor framework provides the return-generating model that connects to nearly every subsequent topic in this module. In Week 10, you will examine backtesting and production ML, where the factor model serves as the benchmark against which machine-learning strategies are evaluated. The same discipline of out-of-sample testing, false-discovery control, and implementation-cost awareness that you have learned here applies, with even greater force, to the more flexible models that follow.
*Evidence confidence: High. This synthesis section integrates principles from the preceding sections without introducing new frontier claims.*
## References {.unnumbered}
::: {#refs}
:::