Part I : Pipeline architecture: Data ingestion, transformation, orchestration, serving
Part II : Production deployment: Patterns, versioning, A/B testing, rollback strategies
Part III : Monitoring & drift: Performance tracking, data/concept drift, alerting
Part IV : Feature engineering: Feature stores, temporal features, scale challenges
Part V : Rigorous backtesting: Multiple testing, overfitting detection, validation
Data Sources → Ingestion → Storage → Feature Engineering → Training →
Validation → Deployment → Serving → Monitoring → Retraining
Key components:
Monolithic architecture: - Single application handling all components - Simple deployment, challenging scaling - Common for small teams/applications
Microservices architecture: - Separate services for ingestion, training, serving - Independent scaling, complex operations - Standard for large organizations
Serverless architecture: - Cloud functions triggered by events - Auto-scaling, pay-per-use, limited control - Growing adoption for simple pipelines
Key trade-offs: Complexity vs flexibility, cost vs control, simplicity vs scalability
Batch prediction: - Precompute predictions periodically - Store in database, serve via lookup - High throughput, stale predictions
Online serving: - Compute predictions on-demand - REST API, gRPC, or embedded model - Fresh predictions, latency constraints
Shadow mode: - New model runs parallel to production - Predictions not used, only monitored - Safe validation before cutover
A/B testing: - Split traffic between model versions - Compare performance statistically - Gradual rollout reducing risk
Why version models: - Track which model produced which predictions - Enable rollback if new model degrades - Support A/B testing and gradual rollouts - Meet regulatory audit requirements
Versioning strategies: - Semantic versioning (major.minor.patch) - Git commit hashes - Timestamps - Combine: v2.3.1-a4f8e90-20250101
Rollback scenarios: - Performance degradation detected - Production incidents or errors - Regulatory compliance issues - Business metric deterioration
Metrics to track:
Predictive performance: - Accuracy, precision, recall, F1, AUC - By segment (demographic, time period) - Compared to baseline and previous versions
Operational metrics: - Latency (p50, p95, p99) - Throughput (predictions per second) - Error rates - Resource usage (CPU, memory, GPU)
Business metrics: - Revenue impact - Cost savings - Customer satisfaction - Fraud losses
Challenges:
Training-serving skew: - Features computed differently in training vs serving - Causes accuracy degradation in production - Solved by feature stores ensuring consistency
Temporal leakage: - Using future information in features - Creates unrealistically high training accuracy - Fails completely in production
Scale: - Computing features for billions of examples - Low-latency requirements (< 100ms) - Cost efficiency
Feature stores: Centralized repository for feature definitions and values
Scenario: Testing 100 features for predictive power at α = 0.05
Expected false positives: 5 features appear significant purely by chance
Danger: Select these features, backtest strategy, publish results
Reality: Strategy performs no better than random (or worse)
Multiple testing corrections:
Key insight: Most published investment strategies likely overfit
Using JKP factor data (Jensen, Kelly, Pedersen):
Exercise 1: Pipeline Implementation (30 min) - Data ingestion from CSV - Feature engineering with temporal correctness - Training pipeline with versioning - Orchestration using simple scheduler
Exercise 2: Backtesting with Multiple Testing Correction (45 min) - Implement factor models - Apply Bonferroni and FDR corrections - Compute combinatorial purged CV - Calculate probability of backtest overfitting
Exercise 3: Production Monitoring (25 min) - Deploy model in simulation - Monitor performance over time - Detect data drift in factors - Implement alerting for degradation
Problem: Testing 50 features at α=0.05 → expect 2.5 false discoveries by chance!
Connection to production ML: - Pipeline tests 100+ models, 1,000+ features → many “significant” results by luck - Backtest 50 trading strategies → some appear profitable by chance (not skill)
Solutions from Week 1, §0.5.2:
Connection to Week 1, §0.5.2: Multiple Testing & Ch 10: Multiple Testing in ML
Gelman’s critique: p < 0.05 is weak evidence when testing many hypotheses!
Production ML multiple testing: - Feature selection: Testing 1,000 features → expect 50 “significant” by chance - Hyperparameter tuning: Testing 100 configs → some look good by overfitting - Model selection: Testing 20 algorithms → winner likely overfit
Solution: Honest out-of-sample validation on completely held-out test set (no peeking!)
1. Production ML is engineering discipline: pipeline architecture, deployment patterns, monitoring, and maintenance
2. Data drift and concept drift silently degrade models: continuous monitoring and retraining essential
3. Feature engineering drives success: training-serving consistency, temporal correctness, scale challenges
4. Overfitting is pervasive in financial ML: multiple testing corrections and rigorous validation required
5. Regulatory compliance shapes architecture: model risk management, explainability, auditability
6. MLOps practices enable reliable systems: versioning, testing, monitoring, incident response
Core readings:
Additional resources:
FinTech & Data Science