Data Science as Statistical Science
Data science = the disciplined study of variation and uncertainty in data.
Gelman, Hill, and Vehtari (2020) frame all statistical work around three challenges:
All three are prediction problems under uncertainty.
\[Y_i = \beta_0 + \beta_1 X_{1i} + \cdots + \beta_k X_{ki} + \varepsilon_i\]
| Property | Meaning |
|---|---|
| Best | Minimum variance among linear unbiased estimators |
| Linear | \(\hat{\beta}\) is linear function of Y |
| Unbiased | \(\mathbb{E}[\hat{\beta}] = \beta\) on average |
Gelman, Hill, and Vehtari (2020) argue we focus on the wrong assumptions:
| Rank | Assumption | Why It Matters |
|---|---|---|
| 1 | Validity | Does your model address your research question? |
| 2 | Representativeness | Is your sample representative of target population? |
| 3 | Additivity & Linearity | Most important mathematical assumption |
| 4 | Independence of errors | Violated in time series, spatial, multilevel data |
| 5 | Equal variance | Heteroscedasticity rarely changes conclusions |
| 6 | Normality of errors | “Barely important at all” for estimation |
Example: ESG and Stock Returns
annual_return = 8.2 + 0.15 × ESG_score + 0.02 × market_cap + error
| Comparison (✓) | Effect (✗) |
|---|---|
| “Firms with higher ESG scores have 15bp higher returns, on average” | “Improving ESG causes returns to increase by 15bp” |
High-ESG companies may differ systematically:
The observed return difference could reflect these omitted factors, not ESG itself.
| Research Finding | Comparison (✓) | Effect (✗) |
|---|---|---|
| β = 0.3 on analyst coverage | More coverage → higher returns observed | Adding analysts causes higher returns |
| β = -0.05 on leverage | Leveraged firms → lower returns | Reducing leverage increases returns |
| β = 0.02 on insider ownership | Higher ownership → better performance | Giving managers shares improves performance |
Causal claims require different evidence: RCTs, IV, natural experiments.
Warning
Regression to the mean is often mistaken for a causal effect.
A company performs poorly last year, improves this year. Did the new CEO cause the improvement? Or would regression to the mean have produced similar results anyway?
Extreme observations contain signal + luck. On repetition, luck averages out.
| Violation | OLS Unbiased? | Standard Errors Valid? | Remedy |
|---|---|---|---|
| Heteroscedasticity | ✓ | ✗ | White (HC) SEs |
| Autocorrelation | ✓ | ✗ | Newey-West (HAC) |
| Multicollinearity | ✓ | ✓ (but imprecise) | Regularisation |
| Endogeneity | ✗ | ✗ | IV methods |
Tip
ML methods extend classical econometrics:
These are not departures from statistics : they are extensions.
\[R^2 = 1 - \frac{\text{Residual variance}}{\text{Total variance}}\]
What R² tells us:
What R² does NOT tell us:
Gelman, Hill, and Vehtari (2020) note: predicting earnings from height yields R² ≈ 0.10.
This means 90% of variance has nothing to do with height. Yet the regression is still informative : it reveals a genuine association.
In finance, R² values of 0.01-0.05 are common when predicting returns. This reflects fundamental difficulty, not model failure.
But where you look matters: Volatility (~25% R²) and cross-sectional variation (~10% R²) are more predictable. We’ll see why in Week 3.
A crucial but overlooked insight from Gelman, Hill, and Vehtari (2020):
Estimating interaction effects requires roughly four times the sample size of main effects at the same precision.
Why? Standard error of interaction ≈ 2× standard error of main effect.
Implications:
A result can be “statistically significant” yet trivially small.
Example: Strategy earns 0.001% excess return with SE 0.0003%
Ask: “Is the effect large enough to matter?” not just “Is p < 0.05?”
Failure to reject the null does not mean the effect is zero.
Example: Estimate of 5% ± 8%
This subtle error pervades finance research:
Scenario:
Wrong conclusion: A and B differ meaningfully
Why? To compare them, test the difference : SE is roughly √2 times larger.
The difference between “significant” and “not significant” is not itself significant.
With enough flexibility in:
…researchers can achieve p < 0.05 from almost any dataset : even pure noise.
The problem is not always conscious “fishing” but the accumulation of small, defensible choices.
When only “significant” results get published:
Remedy: Pre-registration, report all specifications tried, track trial counts.
Even pure noise yields “significant” hits when you try enough ideas.
Setting: Return prediction with high-dimensional predictors
Core result: Under realistic financial DGPs, bias reduction from richer models can outweigh variance increases → better out-of-sample performance
Caveat: Complexity must be disciplined:
| Frequentist | Bayesian |
|---|---|
| Parameters fixed, data repeatable | Parameters random, data observed |
| Control long-run error rates (tests, CIs) | Update beliefs via priors and likelihood |
| Great for regulatory benchmarks | Great when prior information and decisions matter |
| Confidence intervals have coverage guarantees | Credible intervals express posterior belief |
Both use the same Kolmogorov axioms : choosing is about the question asked.
For belief statements, use Bayesian credible intervals.
95% = in the long run, 95% of such intervals contain the true parameter : a procedure property, not a probability about this interval.
Takeaway: Each row is one sample’s 95% CI. The black line is the true value. Most intervals cross it; a few miss : that’s what “95%” means.
When textbook assumptions fail (e.g., skewed data), bootstrap provides evidence-based intervals.
Prior × Likelihood → Posterior: data updates our beliefs.
Question: How much of return variance is predictable?
Fit Bayesian AR(1) to daily returns. Posterior for R² (squared autocorrelation):
| Asset | Median R² | 95% Credible Interval |
|---|---|---|
| SPY | 1.66% | [0.08%, 6.24%] |
| AAPL | 0.73% | [0.02%, 2.94%] |
| BTCUSD | 0.04% | [0.00%, 0.41%] |
Wide intervals are the message, not a weakness:
Before trusting real results, test your procedure on fake data:
If your procedure can’t recover known effects from fake data, don’t trust it with real data.
FinTech & Data Science