Understanding Where Your Numbers Come From
Gelman, Hill, and Vehtari (2020) put it bluntly:
“Before fitting a model… it is a good idea to understand where your numbers are coming from.”
This is not preliminary work : it is the analysis.
In finance, data problems lead to:
The most sophisticated model built on flawed data produces flawed results : often with false confidence.
By the end of this session, you should be able to:
Every dataset is the output of some data generating process : the mechanism that determines:
Understanding the DGP reveals assumptions embedded in your data before you add modelling assumptions.
The prices you observe are the result of:
| Component | What It Determines | Theory Connection |
|---|---|---|
| Exchange mechanisms | How orders are matched, which prices are recorded | Market microstructure (order flow, liquidity) |
| Data vendor processing | How raw tick data is aggregated, cleaned, distributed | Aggregation creates information loss |
| Selection rules | Which securities are included, for how long | Survivorship bias, sample selection |
| Timing conventions | When prices are measured (close, VWAP, bid-ask midpoint) | Price discovery process, bid-ask spread |
Each choice affects downstream analysis and embeds theoretical assumptions.
Every time a trader wants to buy or sell, they face a strategic choice:
Make liquidity (patience) : post a limit order and wait
Take liquidity (urgency) : hit an existing order now
The spread is the price of this choice. Patient traders earn it; urgent traders pay it. Every “price” in your dataset is the result of someone choosing to take.
Every price you observe in financial data is the outcome of this make-or-take negotiation. The makers set the terms:
These limit orders accumulate in the order book : a live record of supply and demand:
A taker closes the gap: a buyer pays the ask price, or a seller accepts the bid price. That crossing is the “price” in your dataset.
Consider a stock trading near £100. The book at one instant:
| Side | Price | Volume | Meaning |
|---|---|---|---|
| Ask (sell) | £100.25 | 1,400 | “I’ll sell 1,400 shares at £100.25 or higher” |
| Ask (sell) | £100.15 | 1,000 | “I’ll sell 1,000 shares at £100.15 or higher” |
| Ask (sell) | £100.05 | 600 | ← Best Ask (cheapest available seller) |
| Spread: £0.10 | No one agrees in this gap | ||
| Bid (buy) | £99.95 | 500 | ← Best Bid (most generous buyer) |
| Bid (buy) | £99.90 | 800 | “I’ll buy 800 shares at £99.90 or lower” |
| Bid (buy) | £99.85 | 1,200 | “I’ll buy 1,200 shares at £99.85 or lower” |
Reading this table: Bids are sorted highest-first (most aggressive buyer on top). Asks are sorted lowest-first (most aggressive seller on top). They converge towards the spread : the frontier of disagreement.
Why it matters for data: The “price” you see depends on which price : last trade, bid-ask midpoint, closing auction, VWAP : and each carries different information.
On a trading terminal, the order book appears as a Depth of Market (DOM) ladder : a compact vertical strip with bids on the left and asks on the right, converging at the spread.
The same data as the table : green depth bars grow from the centre showing bid volume, red bars show ask volume. The best bid and best ask are highlighted where they meet the spread.
The depth chart is this same order book rendered visually : demand (bids) extending left, supply (asks) extending right, just like the supply-demand diagrams from economics.
=== Order Book Summary ===
Best Bid (highest buy price): £99.95 | Volume: 500 shares
Best Ask (lowest sell price): £100.05 | Volume: 600 shares
Bid-Ask Spread: £0.10 (10.0 basis points)
Mid-price: £100.00
Total bid volume (demand): 12,300 shares
Total ask volume (supply): 13,100 shares
Reading the chart: Bids (green, left) are demand : buyers waiting below the price. Asks (red, right) are supply : sellers waiting above. The deeper the bars, the more volume it takes to move the price through that level.
Order imbalance: When buy volume ≠ sell volume, price adjusts.
=== Order Imbalance & Price Impact ===
Scenario 1 (Balanced):
Bid volume: 10,000 | Ask volume: 10,000 | Imbalance: 0
→ Price change: +0.00 (no pressure)
Scenario 2 (Buy Pressure):
Bid volume: 15,000 | Ask volume: 7,000 | Imbalance: +8,000
→ Price change: +0.50 (upward pressure)
Scenario 3 (Sell Pressure):
Bid volume: 7,000 | Ask volume: 15,000 | Imbalance: -8,000
→ Price change: -0.50 (downward pressure)
Key insight: Order flow reveals information → prices adjust to clear imbalance
Price discovery mechanism:
Data implication: Intraday prices reflect continuous discovery; daily close is just one snapshot.
Key principle: Use curated databases, not ad-hoc API calls.
Why this matters:
In this course: Bloomberg database contains ~30 securities, 2015-2024, daily adjusted prices.
In labs, you’ll load from this database using a fallback cascade (local file → Colab URL → synthetic).
Critical distinction: adjusted prices account for corporate actions (splits, dividends).
Returns calculated on unadjusted: 176.2%
Returns calculated on adjusted: 452.5%
Bloomberg database structure:
This DGP enables: Long-term return analysis, portfolio backtests, survivorship studies
This DGP prevents: Intraday patterns, microstructure analysis, order flow studies
We rarely observe what we truly want to study:
| Latent Construct | Observable Proxy | Measurement Gap |
|---|---|---|
| True risk | Historical volatility | Backward-looking; regime-dependent |
| Information asymmetry | Bid-ask spread | Also reflects inventory, competition |
| Market sentiment | Text analysis, word counts | Ignores semantic meaning, context |
| Firm productivity | Accounting ratios (ROA, ROE) | Ignores size, treats firms as homogeneous (Fox Paradox) |
This is the measurement problem in statistical science.
Modern example: Sentiment analysis counts words (“bullish”, “bearish”) but misses context and irony. Transformer models (BERT, GPT) help by capturing semantic meaning.
Problem: Accounting ratios (ROA, ROE) treat all firms as homogeneous, ignoring scale.
Example: Two firms, both with 10% ROA:
=== Fox Paradox Demonstration ===
Firm Assets (£m) Profit (£m) ROA (%)
Small Corp 10 1 10
Large Corp 1000 100 10
Average ROA (simple mean): 10.0%
Portfolio ROA (total profit / total assets): 10.0%
Difference: 0.0 percentage points
→ Average of ratios ≠ Ratio of averages
→ Large Corp dominates portfolio performance (99% of assets)
→ Simple ROA average treats £10m firm = £1bn firm
The paradox: Simple average ROA = 10%, but portfolio ROA = 10% (same here by construction). The issue arises when ROAs differ: small firms get equal weight in the average despite contributing trivially to total profit.
Implication for measurement: Ratios ignore economic significance. Better: use size-weighted metrics or separate analysis by scale.
When proxy \(x\) measures true variable \(x^*\) with noise:
\[x = x^* + u\]
where \(u \perp x^*\) (classical error assumption).
Consequence in regression (\(y = \alpha + \beta x^* + \varepsilon\)):
\[\hat{\beta}_{OLS} = \beta \cdot \underbrace{\frac{\text{Var}(x^*)}{\text{Var}(x^*) + \text{Var}(u)}}_{\text{signal-to-total ratio}}\]
Since \(0 < \text{Var}(x^*) / [\text{Var}(x^*) + \text{Var}(u)] < 1\), we have \(|\hat{\beta}_{OLS}| < |\beta|\).
This is attenuation bias : measurement error shrinks coefficients toward zero.
Attenuation increases with measurement noise:
σᵤ = 0.0: β̂ = 2.92 (97% of true)
σᵤ = 0.5: β̂ = 2.29 (76% of true)
σᵤ = 1.0: β̂ = 1.51 (50% of true)
σᵤ = 2.0: β̂ = 0.60 (20% of true)
Validity: Does your measure capture the construct you intend to study?
Reliability: Are your measurements consistent and reproducible?
Both matter, but validity > reliability : a reliable but invalid measure is consistently wrong.
Example: High-frequency volatility estimates are highly reliable (consistent) but may not be valid measures of “risk” (depends on your risk concept).
Reliability question: Are findings robust to reasonable variations in measurement?
Example metric: Sharpe ratio = (Return - Risk-free rate) / Volatility
This measures risk-adjusted performance : higher is better. But is it reliable across time periods?
Sharpe ratio range: 0.18 to 0.56
Relative difference: 206%
Interpretation: Same strategy looks 'good' or 'bad' depending on measurement period.
Human Development Index (HDI): UN measure combining life expectancy, education, and income.
Sounds comprehensive, but Gelman, Hill, and Vehtari (2020) show:
“The map is pretty much a map of state income with a mysterious transformation and a catchy name.”
Lesson: Most HDI variation comes from income alone : other components add little information.
Finance parallels:
Always ask: What is my variable actually measuring?
Definition: Your sample differs systematically from the population you want to study.
Four common forms in finance:
| Bias Type | What Happens | Direction |
|---|---|---|
| Survivorship | Failed entities disappear from data | Upward |
| Availability | Only easy-to-get data is used | Varies |
| Reporting | Voluntary disclosure is strategic | Upward |
| Look-ahead | Future information leaks into past | Inflates performance |
Databases of currently listed stocks exclude companies that:
Result: The worst performers are removed → upward bias in measured returns.
Magnitude from research: Academic studies find survivorship bias of ~0.5-1.5% per year in US mutual funds; crisis periods show much larger biases (5-10%+ in UK banking 2008).
Simulation setup: 100 funds, 5 years monthly returns (mean 0.5%, volatility 4% monthly). Funds with cumulative loss > 50% “fail” and exit the database.
True mean return (all 100 funds): 5.85% per year
Biased mean (only 100 survivors): 5.85% per year
Survivorship bias: 0.00 percentage points per year
Failed funds: 0 (0%)
How to check your data source for bias:
Example diagnostic:
Professional databases (Bloomberg, CRSP, FactSet) maintain delisting information and point-in-time records to enable survivorship-free analysis.
The most dangerous form in backtesting : using information not available at the time.
Why each type creates bias:
All three share common flaw: Decision at time t uses information from time t+1 or later.
The Golden Rule
At any point in your backtest, you may only use information available at that point in time.
Simulation: 100 stocks, 10 years monthly (120 periods), each with random returns (mean 0.05%, vol 2% monthly). We select top 10 performers.
Wrong approach: Select top 10 over full 10 years, then evaluate performance in second half → uses future information.
Right approach: Select top 10 over first 5 years only, then evaluate in second half → only uses past information.
=== Test Period Performance (2020-2024) ===
Look-ahead selection: +19.1%
Proper selection: +5.4%
Benchmark (all stocks): +1.9%
Look-ahead bias: 13.7 pp
Interpretation: Look-ahead portfolio performs better because we
selected winners AFTER seeing their full-period performance.
Selection bias violates Gelman’s Challenge 1 (Sample → Population):
Consequence: Inferences don’t generalise from sample to population.
Solution: Survivorship-bias-free databases, point-in-time datasets, temporal validation.
Professional tools: Bloomberg maintains point-in-time databases that record:
This enables look-ahead-free backtesting.
John Tukey’s philosophy: look at data before modelling it.
EDA goals:
Gelman, Hill, and Vehtari (2020): “All graphs are comparisons.”
Remember: EDA generates questions; modelling provides answers.
Always start with basic inspection:
=== Data Structure ===
Shape: (1462, 5)
Date range: 2020-01-01 00:00:00 to 2024-01-01 00:00:00
Data types: [dtype('float64')]
=== First 3 rows ===
AAPL MSFT GOOGL META AMZN
2020-01-01 120.662789 144.087901 133.109004 107.659211 122.803641
2020-01-02 125.289229 145.103732 131.834117 104.471862 121.400290
2020-01-03 124.609702 142.219353 129.343588 102.419741 124.289865
=== Last 3 rows ===
AAPL MSFT GOOGL META AMZN
2023-12-30 887.221891 364.569372 53.245292 61.460758 98.329511
2023-12-31 882.953509 363.957197 53.862049 59.609061 99.013962
2024-01-01 864.999320 365.400883 54.849232 57.277103 95.752719
=== Basic Statistics ===
AAPL MSFT GOOGL META AMZN
count 1462.00 1462.00 1462.00 1462.00 1462.00
mean 263.34 255.78 113.43 99.06 114.76
std 195.73 71.42 32.00 17.93 22.46
min 87.13 138.00 52.37 57.28 72.65
25% 120.47 214.80 96.84 85.84 98.16
50% 150.15 232.91 107.94 96.99 109.01
75% 391.59 287.94 129.37 112.08 134.45
max 916.80 454.91 229.27 154.63 168.51
Why check: Missing data patterns reveal systematic issues (delistings, thin trading, data vendor problems).
=== Missing Data Summary ===
AAPL: 50.0% missing
MSFT: 2.0% missing
GOOGL: 0.0% missing
META: 0.0% missing
AMZN: 0.0% missing
AAPL: 50.0% missing (delisting pattern)
MSFT: 2.0% missing (random gaps)
=== Return Statistics ===
AAPL MSFT GOOGL META AMZN
count 1461.0000 1461.0000 1461.0000 1461.0000 1461.0000
mean 0.0016 0.0008 -0.0004 -0.0002 0.0000
std 0.0243 0.0168 0.0222 0.0222 0.0194
min -0.0761 -0.0499 -0.0631 -0.0683 -0.0581
25% -0.0152 -0.0104 -0.0155 -0.0145 -0.0137
50% 0.0015 0.0003 -0.0004 -0.0003 -0.0003
75% 0.0170 0.0118 0.0136 0.0148 0.0136
max 0.0994 0.0695 0.0735 0.0724 0.0685
=== Skewness (negative = left tail longer) ===
AAPL 0.136
MSFT 0.093
GOOGL 0.021
META 0.095
AMZN 0.070
dtype: float64
=== Excess Kurtosis (>0 = fat tails) ===
AAPL 0.029
MSFT 0.218
GOOGL -0.133
META 0.192
AMZN -0.139
dtype: float64
Static correlations hide non-stationarity: Correlations increase during crises (when you need diversification most).
Rolling correlation: Calculate correlation over moving window (e.g., 60 trading days).
=== Rolling Correlation Analysis ===
AAPL-MSFT:
Mean correlation: -0.073
Std deviation: 0.128
Min: -0.530 | Max: 0.298
Range: 0.828
GOOGL-META:
Mean correlation: -0.023
Std deviation: 0.150
Min: -0.373 | Max: 0.357
Range: 0.730
AAPL-GOOGL:
Mean correlation: 0.012
Std deviation: 0.114
Min: -0.273 | Max: 0.287
Range: 0.559
Key insight: Correlations are non-stationary and increase during market stress
→ Static correlation underestimates crisis risk
→ Diversification benefits disappear when needed most
Stylised fact: Crisis correlations : correlations spike during market downturns.
Portfolio implication: Using historical average correlation for risk management underestimates tail risk.
IQR method (Interquartile Range): Standard statistical approach for outlier detection.
Logic:
=== Outlier Detection (IQR Method) ===
Outliers by security:
AAPL: 11 outliers (0.8%)
MSFT: 18 outliers (1.2%)
GOOGL: 11 outliers (0.8%)
META: 19 outliers (1.3%)
AMZN: 5 outliers (0.3%)
Total outliers: 64 (0.88%)
Extreme returns (|r| > 5% daily):
AAPL: 59 days
MSFT: 6 days
GOOGL: 33 days
META: 39 days
AMZN: 12 days
Most extreme single-day return:
AAPL: 9.94%
MSFT: 6.95%
GOOGL: 7.35%
META: 7.24%
AMZN: 6.85%
Before deep analysis, check if your data exhibits the known empirical facts:
These are empirical facts, not assumptions.
ACF (Autocorrelation Function): Measures correlation between a series and its lagged values.
For returns: Most lags should be within band (unpredictable).
Lag-1 autocorrelation: -0.0173
Interpretation: Near zero → today's return doesn't predict tomorrow's
Key insight: Returns unpredictable, but volatility (squared returns) shows persistence.
Returns ACF(1): 0.0364 (near zero)
Squared returns ACF(1): 0.2198 (strong positive)
Interpretation: Returns unpredictable, but volatility persists.
High volatility today → high volatility tomorrow.
Excess kurtosis: 0.03 (Normal = 0)
Normality test p-value: 0.1008
→ Cannot reject normality
Negative skewness: Left tail (losses) extends further than right tail (gains).
Interpretation: Extreme losses are more likely than extreme gains (crashes vs rallies).
Skewness: -0.243
→ Negative skewness confirmed: left tail longer
Left tail (5th %ile): -0.0522
Right tail (95th %ile): 0.0466
Tail asymmetry ratio: 1.12
Interpretation: Extreme losses (5.22%) larger
than extreme gains (4.66%)
These facts constrain sensible model specifications:
| Stylised Fact | Modelling Implication |
|---|---|
| Weak return autocorrelation | Simple AR (AutoRegressive) models won’t forecast returns |
| Volatility clustering | Need GARCH (Generalised AutoRegressive Conditional Heteroskedasticity) |
| Fat tails | Normal-based VaR underestimates risk |
| Negative skewness | Symmetric distributions mis-specify downside |
| Leverage effect | Volatility increases when prices fall |
Simple linear models assuming i.i.d. normal errors will be severely misspecified.
Next weeks: We’ll cover AR models (Week 3) and GARCH models (Week 4) that properly handle these properties.
Even Bloomberg data can have issues:
Price errors:
Timing errors:
Automated validation: Run systematic checks before every analysis.
Four key checks:
Pattern: Define validator class with methods for each check, then run all checks and report issues.
In labs: You’ll implement this full validation pipeline as a reusable class.
=== Data Provenance Log ===
dataset: Bloomberg Sample
date_range: 2020-01-01 00:00:00 to 2024-01-01 00:00:00
shape: (1462, 5)
securities: 5
observations: 1462
missing_values: 0
missing_pct: 0.00%
extreme_returns: 0
max_date_gap_days: 1
✔ Validation log saved to data_validation_log.json
Not all missing data is equal:
| Pattern | Description | Implication | Financial Example |
|---|---|---|---|
| MCAR | Missing Completely At Random | Safe to ignore (no bias) | Random data recording glitches |
| MAR | Missing At Random (given observed) | Imputation may work | Small-cap stocks missing during holidays |
| MNAR | Missing Not At Random | Serious bias risk | Failed funds stop reporting; illiquid assets gap during stress |
In finance, missing data is rarely MCAR:
Key question: Why is it missing? The mechanism matters.
Quality gate: Automated pass/fail check before analysis proceeds.
Logic:
Example thresholds:
This prevents silent failures: Better to catch issues early than discover them after modelling.
In labs, you’ll implement this as a reusable function with customizable thresholds.
Recall from Week 1: statistical science is the study of variation and uncertainty.
Data quality issues introduce systematic uncertainty:
These aren’t random noise : they’re systematic biases that invalidate inference.
Gelman, Hill, and Vehtari (2020)’s three fundamental challenges apply directly to data work:
Challenge 1: Generalisation (Sample → Population)
Challenge 2: Causal Inference
Look-ahead bias is a form of post-treatment bias: using information generated after the “treatment” (investment decision) to evaluate outcomes.
Challenge 3: Validity
Attenuation bias: measurement error systematically underestimates true relationships.
Key insight: Data cleaning and preparation aren’t preliminary tasks : they’re integral parts of statistical inference.
Every data decision affects uncertainty:
From Week 1: Variation and uncertainty propagate through your entire analysis pipeline.
Question: How sensitive are conclusions to data cleaning choices?
Sensitivity analysis: Test multiple reasonable outlier treatments, compare results.
=== Sensitivity to Outlier Treatment ===
Keep All: 28.88% annualised
Winsorise: 24.47% annualised
Drop Extremes: 18.70% annualised
Clip ±5%: 26.15% annualised
Range: 10.18 percentage points
Conclusion: 10.2pp spread shows moderate sensitivity.
Report range in analysis; don't hide cleaning impact.
Six principles from Week 1, applied to data:
This is what “data science as statistical science” means.
In your coursework, you will:
Quality of data work directly affects quality of inference.
Track 1: Colab/Home (Lab 02 APIs)
Track 2: Bloomberg Terminal (Lab 02 Survivorship Bias)
By end of labs, you should have:
These are portable skills : use them in every future analysis.
Five principles for responsible data science:
Remember: Data understanding is the analysis, not preparation for it.
Core readings (course textbook):
Extension (recommended):
Further reading (if interested): Gelman & Hill (2020) Ch 2; Tsay (2010) Ch 1
Quick Check: Data & Measurement Concepts
Answer these three questions before leaving:
Q1: Which type of selection bias is created when you use restated earnings data in a backtest?
Q2: What does it mean if ACF of returns is near zero but ACF of squared returns is high and persistent?
Q3: A hedge fund database shows average returns of 12% per year. Why might this overestimate true performance?
Discussion: Bring one data quality question from your own coursework area to next session.
FinTech & Data Science