Week 1: Foundations of Financial Data Science

Data Science as Statistical Science

Part I: Data Science Is Statistical Science

What Is Data Science, Really?

Data science = the disciplined study of variation and uncertainty in data.

Variation: what we model (differences across units, over time, between groups)
Uncertainty: what we quantify (limits of what we can know from finite, noisy data)
Not just coding, algorithms, or “AI” : those are tools in service of this study
Finance amplifies both: noisy signals, non-stationary processes, strategic behaviour

Three Challenges of Statistical Inference

Gelman, Hill, and Vehtari (2020) frame all statistical work around three challenges:

Sample → Population: Can 100 stocks tell us about the market?
Treatment → Control: Did the strategy cause the improvement?
Measurement → Construct: Does volatility capture actual risk?

All three are prediction problems under uncertainty.

The Model Choice Dilemma

Model A: five curated signals, 85% accuracy on training data
Model B: fifty engineered signals, 92% on same data
Question: Which model wins out-of-sample?
Without disciplined validation, the higher score might just be noise

Learning Objectives

Decide when complex models earn their variance by validating and regularising rigorously
Use the bias-variance perspective to diagnose model behaviour for returns and risk
Quantify and communicate uncertainty with the right statistical tools
Contrast frequentist and Bayesian reasoning; pick the frame that answers the question
Follow reproducible research practices: transparent code, documented assumptions, evidence-based claims

Part II: Variation : The Heart of Statistical Science

Financial Returns 101

Simple return: \(r_t = \frac{P_t - P_{t-1}}{P_{t-1}}\); log return: \(g_t = \ln P_t - \ln P_{t-1}\)
Excess return: asset return minus a benchmark (risk-free or market), e.g., \(r^e_t = r_t - r_{f,t}\)
Why returns (not prices)? Stationarity and comparability across assets
Our targets: market (\(\mathrm{MKT}\)) and factor returns (e.g., \(\mathrm{MOM}\))

Stylised Facts: What Markets Give Us

Heavy tails and mild skewness : extreme events happen more than Normal predicts
Weak linear autocorrelation in raw returns : hard to forecast
Volatility clustering and leverage effects : calm follows calm, storms follow storms
Time-varying factors, non-stationarity, and microstructure noise

Why These Facts Arise

Information arrival is uneven; volatility is a mixture of states
Traders update at different speeds, creating persistence in |r|
Microstructure frictions distort high-frequency data
Structural shifts (policy, technology) move the goalposts; expect regime change

Volatility Clustering Snapshot

Part III: Regression : Coefficients as Comparisons

The Classical Linear Regression Model

\[Y_i = \beta_0 + \beta_1 X_{1i} + \cdots + \beta_k X_{ki} + \varepsilon_i\]

OLS: minimises sum of squared residuals
Under CLRM assumptions: OLS is BLUE (Best Linear Unbiased Estimator)

Property	Meaning
Best	Minimum variance among linear unbiased estimators
Linear	\(\hat{\beta}\) is linear function of Y
Unbiased	\(\mathbb{E}[\hat{\beta}] = \beta\) on average

Assumptions Ranked by Importance

Gelman, Hill, and Vehtari (2020) argue we focus on the wrong assumptions:

Rank	Assumption	Why It Matters
1	Validity	Does your model address your research question?
2	Representativeness	Is your sample representative of target population?
3	Additivity & Linearity	Most important mathematical assumption
4	Independence of errors	Violated in time series, spatial, multilevel data
5	Equal variance	Heteroscedasticity rarely changes conclusions
6	Normality of errors	“Barely important at all” for estimation

Coefficients as Comparisons, Not Effects

Example: ESG and Stock Returns

annual_return = 8.2 + 0.15 × ESG_score + 0.02 × market_cap + error

Comparison (✓)	Effect (✗)
“Firms with higher ESG scores have 15bp higher returns, on average”	“Improving ESG causes returns to increase by 15bp”

Why the Distinction Matters

High-ESG companies may differ systematically:

Better management overall
Stronger governance structures
More patient investors
Greater financial resources

The observed return difference could reflect these omitted factors, not ESG itself.

More Finance Examples

Research Finding	Comparison (✓)	Effect (✗)
β = 0.3 on analyst coverage	More coverage → higher returns observed	Adding analysts causes higher returns
β = -0.05 on leverage	Leveraged firms → lower returns	Reducing leverage increases returns
β = 0.02 on insider ownership	Higher ownership → better performance	Giving managers shares improves performance

Causal claims require different evidence: RCTs, IV, natural experiments.

The Regression Fallacy

Warning

Regression to the mean is often mistaken for a causal effect.

A company performs poorly last year, improves this year. Did the new CEO cause the improvement? Or would regression to the mean have produced similar results anyway?

Extreme observations contain signal + luck. On repetition, luck averages out.

When Assumptions Fail

Violation	OLS Unbiased?	Standard Errors Valid?	Remedy
Heteroscedasticity	✓	✗	White (HC) SEs
Autocorrelation	✓	✗	Newey-West (HAC)
Multicollinearity	✓	✓ (but imprecise)	Regularisation
Endogeneity	✗	✗	IV methods

Connection to Machine Learning

Tip

ML methods extend classical econometrics:

Regularisation (ridge, lasso) → addresses multicollinearity
Tree-based methods → handle nonlinearity and interactions automatically
Sequence learning → explicitly models temporal dependencies

These are not departures from statistics : they are extensions.

Part IV: Uncertainty : What R² Can and Cannot Tell Us

The Limits of R²

\[R^2 = 1 - \frac{\text{Residual variance}}{\text{Total variance}}\]

What R² tells us:

How much variance our predictors explain
Whether adding variables improves explanatory power

What R² does NOT tell us:

Whether the model is correct (wrong model can have high R²)
Whether the model is useful for prediction (low R² can still be actionable)
Whether relationships are causal

Low R² Can Still Be Informative

Gelman, Hill, and Vehtari (2020) note: predicting earnings from height yields R² ≈ 0.10.

This means 90% of variance has nothing to do with height. Yet the regression is still informative : it reveals a genuine association.

In finance, R² values of 0.01-0.05 are common when predicting returns. This reflects fundamental difficulty, not model failure.

But where you look matters: Volatility (~25% R²) and cross-sectional variation (~10% R²) are more predictable. We’ll see why in Week 3.

Interactions Require 4× Sample Size

A crucial but overlooked insight from Gelman, Hill, and Vehtari (2020):

Estimating interaction effects requires roughly four times the sample size of main effects at the same precision.

Why? Standard error of interaction ≈ 2× standard error of main effect.

Implications:

Study powered for “size effect” is underpowered for “does size effect vary by industry?”
Significant interactions in exploratory analysis are likely overestimated
Studying “does momentum work differently in bull vs. bear markets?” requires substantially more data

Part V: Five Pitfalls of Statistical Significance

Pitfall 1: Significance ≠ Practical Importance

A result can be “statistically significant” yet trivially small.

Example: Strategy earns 0.001% excess return with SE 0.0003%

t-statistic ≈ 3.3 (statistically significant!)
Economically meaningless after transaction costs

Ask: “Is the effect large enough to matter?” not just “Is p < 0.05?”

Pitfall 2: Non-Significance ≠ Zero Effect

Failure to reject the null does not mean the effect is zero.

Example: Estimate of 5% ± 8%

Not statistically significant (confidence interval includes zero)
But consistent with large positive or small negative effect
Data are inconclusive, not proof of no effect

Pitfall 3: Comparing Significant vs Non-Significant

This subtle error pervades finance research:

Scenario:

Strategy A: significant alpha (t = 2.1)
Strategy B: not significant (t = 1.8)

Wrong conclusion: A and B differ meaningfully

Why? To compare them, test the difference : SE is roughly √2 times larger.

The difference between “significant” and “not significant” is not itself significant.

Pitfall 4: P-Hacking and Forking Paths

With enough flexibility in:

Data processing
Variable selection
Model specification
Sample period choice

…researchers can achieve p < 0.05 from almost any dataset : even pure noise.

The problem is not always conscious “fishing” but the accumulation of small, defensible choices.

Pitfall 5: Publication Bias

When only “significant” results get published:

Literature systematically overstates effect sizes
A strategy that “works” in one study may be the lucky draw
Meta-analyses inherit this bias

Remedy: Pre-registration, report all specifications tried, track trial counts.

The Multiple Testing Alarm

Even pure noise yields “significant” hits when you try enough ideas.

Part VI: The Bias-Variance Tradeoff

When Complexity Wins

High-dimensional signals can reduce bias enough to dominate variance when markets are noisy and persistent (Kelly, Malamud, and Zhou (2024))
Richer feature spaces capture stylised facts (volatility clustering, asymmetry) that simple models miss
In finance, complexity often rescues weak signals : but only with disciplined regularisation

Kelly-Malamud-Zhou (2024): Key Finding

Setting: Return prediction with high-dimensional predictors

Core result: Under realistic financial DGPs, bias reduction from richer models can outweigh variance increases → better out-of-sample performance

Caveat: Complexity must be disciplined:

Regularisation
Time-aware validation
Governance guardrails

Bias-Variance Map (Illustrative)

Decision Flow: Should We Go Complex?

Diagnose signal: Do stylised facts or theory justify richer structure?
Prototype simple baseline; record validated error
Escalate complexity with shrinkage controls and time-aware validation
Stress-test governance: interpretability, robustness, operational load
Document evidence that complex model wins before promoting it

Underfit vs Good Fit vs Overfit

Part VII: Uncertainty Quantification

Frequentist vs Bayesian: A Pragmatic View

Frequentist	Bayesian
Parameters fixed, data repeatable	Parameters random, data observed
Control long-run error rates (tests, CIs)	Update beliefs via priors and likelihood
Great for regulatory benchmarks	Great when prior information and decisions matter
Confidence intervals have coverage guarantees	Credible intervals express posterior belief

Both use the same Kolmogorov axioms : choosing is about the question asked.

CI ≠ Belief

Confidence interval: procedure property, not belief about the parameter
95% CI means: if we repeated sampling, 95% of intervals would contain the true value
It does NOT mean: “I’m 95% confident the parameter is in this interval”

For belief statements, use Bayesian credible intervals.

95% CI: Many Intervals, One True Value

95% = in the long run, 95% of such intervals contain the true parameter : a procedure property, not a probability about this interval.

Takeaway: Each row is one sample’s 95% CI. The black line is the true value. Most intervals cross it; a few miss : that’s what “95%” means.

Bootstrap Confidence Interval

When textbook assumptions fail (e.g., skewed data), bootstrap provides evidence-based intervals.

Bayesian Update Example

Prior × Likelihood → Posterior: data updates our beliefs.

When Wide Credible Intervals Are Informative

Question: How much of return variance is predictable?

Fit Bayesian AR(1) to daily returns. Posterior for R² (squared autocorrelation):

Asset	Median R²	95% Credible Interval
SPY	1.66%	[0.08%, 6.24%]
AAPL	0.73%	[0.02%, 2.94%]
BTCUSD	0.04%	[0.00%, 0.41%]

Wide intervals are the message, not a weakness:

Even at the optimistic upper bound, >93% of variance is noise
We can’t precisely measure how little predictability there is
Point estimates hide this uncertainty; posteriors reveal it

Part VIII: Validation and Governance

Time-Aware Validation Workflow

Profile data; lock in feature engineering rules
Design rolling or expanding splits; document dates
Run walk-forward validation; log metrics and diagnostics
Review stability, governance notes, decision memos

Walk-Forward Validation (Illustrative)

Common Pitfalls in Financial Modelling

Look-ahead/leakage: future information in features or targets
Survivorship bias: current constituents only; delistings omitted
Data snooping: p-hacking; post-selection inference ignored
Unrealistic frictions: zero costs; fills at non-tradable prices
Time alignment: mismatched timestamps; timezone/holiday drift

Backtest Overfitting: CSCV and PBO

CSCV: split time into contiguous folds; pick in-sample “winner,” test OOS
PBO: fraction of splits where the winner underperforms out-of-sample
High PBO = fragile discovery; reconsider selection or improve validation

Model Validation: Governance Checklist

Define target leakage tests before modelling
Align validation window with business decision horizon
Stress-test hyper-parameters and features for stability
Keep benchmark models live (naive, linear) for accountability

Part IX: Practical Tools and Tips

Fake-Data Simulation

Before trusting real results, test your procedure on fake data:

Specify data-generating process with known parameters
Simulate data from this process
Apply your estimation procedure
Compare estimates to true values
Repeat many times to assess variability

If your procedure can’t recover known effects from fake data, don’t trust it with real data.

Ten Tips for Better Regression Modelling

Think about variation in the data
Forget statistical significance; focus on practical significance
Graph the data relevant to your analysis
Interpret coefficients as comparisons, not effects
Understand your methods using fake-data simulation

Ten Tips (continued)

Fit many models to understand sensitivity
Set up a computational workflow that supports iteration
Use transformations to improve linearity and reduce outlier influence
Do targeted causal inference only when supported by design
Learn methods through real worked examples

Part X: From Here to the Module

From Primer to Weeks 1-4

Week 1: Apply complexity logic to cost persistence and regulatory data
Week 2: Data quality, measurement, survivorship bias
Week 3: Volatility modelling and time series foundations
Week 4: Robo-advice models with reproducibility + governance checklists

Datasets We’ll Use

Bloomberg database (practice and labs)
JKP factors (Coursework 2)
Simulated data (for understanding methods)

Directed Learning & Reflection

Core: Read Kelly, Malamud, and Zhou (2024) and Efron and Hastie (2016) overview; rerun one demo; summarise bias-variance insights
Optional extension: Add Gelman, Hill, and Vehtari (2020) causal framing; write a short reflection on when complexity is virtuous

Recap: Data Science as Statistical Science

Data science is statistical science : the disciplined study of variation and uncertainty
Regression coefficients are comparisons, not effects
Complexity can win, but only with validation and governance
Communicate uncertainty honestly; avoid the five pitfalls of significance

Exit Ticket

Write down one dataset where you will trial a complex model : and why
Capture the validation design you will use to prove it beats the baseline
Note a governance or ethical concern you will monitor as you scale

References

Efron, Bradley, and Trevor Hastie. 2016. Computer Age Statistical Inference: Algorithms, Evidence, and Data Science. Cambridge University Press. https://hastie.su.stanford.edu/CASI/.

Gelman, Andrew, Jennifer Hill, and Aki Vehtari. 2020. Regression and Other Stories. Cambridge, UK: Cambridge University Press. https://avehtari.github.io/ROS-Examples/.

Kelly, Bryan T., Semyon Malamud, and Kangying Zhou. 2024. “The Virtue of Complexity in Return Prediction.” Journal of Finance 79 (1): 459–503. https://doi.org/10.1111/jofi.13298.