Week 10: Data Science Pipelines & Production ML

Learning Objectives

  • Design end-to-end data science pipelines for financial applications
  • Implement production ML systems with monitoring and maintenance
  • Detect and respond to model drift and data quality issues
  • Apply feature engineering techniques at scale
  • Understand MLOps principles for financial technology
  • Evaluate model performance in production versus development
  • Navigate regulatory requirements for ML in finance (model risk management)
  • Implement backtesting frameworks avoiding overfitting pitfalls

Agenda

Part I : Pipeline architecture: Data ingestion, transformation, orchestration, serving
Part II : Production deployment: Patterns, versioning, A/B testing, rollback strategies
Part III : Monitoring & drift: Performance tracking, data/concept drift, alerting
Part IV : Feature engineering: Feature stores, temporal features, scale challenges
Part V : Rigorous backtesting: Multiple testing, overfitting detection, validation

Part I : Pipeline Architecture: Data Ingestion, Transformation, Orchestration, Serving

End-to-End ML Pipeline Components

Data Sources → Ingestion → Storage → Feature Engineering → Training → 
  Validation → Deployment → Serving → Monitoring → Retraining

Key components:

  • Data ingestion: APIs, databases, streaming (Kafka, Kinesis)
  • Data storage: Data lakes (S3), warehouses (Snowflake, BigQuery)
  • Feature engineering: Transformation, aggregation, encoding
  • Model training: Distributed training, hyperparameter optimization
  • Model validation: Cross-validation, backtesting, A/B testing
  • Model serving: REST APIs, batch prediction, streaming inference
  • Monitoring: Performance metrics, drift detection, alerting
  • Orchestration: Workflow management (Airflow, Prefect, Kubeflow)

ML System Architecture Patterns

Monolithic architecture: - Single application handling all components - Simple deployment, challenging scaling - Common for small teams/applications

Microservices architecture: - Separate services for ingestion, training, serving - Independent scaling, complex operations - Standard for large organizations

Serverless architecture: - Cloud functions triggered by events - Auto-scaling, pay-per-use, limited control - Growing adoption for simple pipelines

Key trade-offs: Complexity vs flexibility, cost vs control, simplicity vs scalability

Part II : Production Deployment: Patterns, Versioning, A/B Testing, Rollback Strategies

Model Deployment Patterns

Batch prediction: - Precompute predictions periodically - Store in database, serve via lookup - High throughput, stale predictions

Online serving: - Compute predictions on-demand - REST API, gRPC, or embedded model - Fresh predictions, latency constraints

Shadow mode: - New model runs parallel to production - Predictions not used, only monitored - Safe validation before cutover

A/B testing: - Split traffic between model versions - Compare performance statistically - Gradual rollout reducing risk

Model Versioning and Rollback

Why version models: - Track which model produced which predictions - Enable rollback if new model degrades - Support A/B testing and gradual rollouts - Meet regulatory audit requirements

Versioning strategies: - Semantic versioning (major.minor.patch) - Git commit hashes - Timestamps - Combine: v2.3.1-a4f8e90-20250101

Rollback scenarios: - Performance degradation detected - Production incidents or errors - Regulatory compliance issues - Business metric deterioration

Part III : Monitoring & Drift: Performance Tracking, Data/Concept Drift, Alerting

Model Performance Monitoring

Metrics to track:

Predictive performance: - Accuracy, precision, recall, F1, AUC - By segment (demographic, time period) - Compared to baseline and previous versions

Operational metrics: - Latency (p50, p95, p99) - Throughput (predictions per second) - Error rates - Resource usage (CPU, memory, GPU)

Business metrics: - Revenue impact - Cost savings - Customer satisfaction - Fraud losses

Part IV : Feature Engineering: Feature Stores, Temporal Features, Scale Challenges

Feature Engineering at Scale

Challenges:

Training-serving skew: - Features computed differently in training vs serving - Causes accuracy degradation in production - Solved by feature stores ensuring consistency

Temporal leakage: - Using future information in features - Creates unrealistically high training accuracy - Fails completely in production

Scale: - Computing features for billions of examples - Low-latency requirements (< 100ms) - Cost efficiency

Feature stores: Centralized repository for feature definitions and values

Part V : Rigorous Backtesting: Multiple Testing, Overfitting Detection, Validation

The Multiple Testing Problem

Scenario: Testing 100 features for predictive power at α = 0.05

Expected false positives: 5 features appear significant purely by chance

Danger: Select these features, backtest strategy, publish results

Reality: Strategy performs no better than random (or worse)

Multiple testing corrections:

  • Bonferroni: α_adjusted = α / n (very conservative)
  • Benjamini-Hochberg: Controls false discovery rate (less conservative)
  • Combinatorial purged CV: Bailey & López de Prado approach

Key insight: Most published investment strategies likely overfit

Lab 10 Preview: Factor Analysis with Production Considerations

Using JKP factor data (Jensen, Kelly, Pedersen):

Exercise 1: Pipeline Implementation (30 min) - Data ingestion from CSV - Feature engineering with temporal correctness - Training pipeline with versioning - Orchestration using simple scheduler

Exercise 2: Backtesting with Multiple Testing Correction (45 min) - Implement factor models - Apply Bonferroni and FDR corrections - Compute combinatorial purged CV - Calculate probability of backtest overfitting

Exercise 3: Production Monitoring (25 min) - Deploy model in simulation - Monitor performance over time - Detect data drift in factors - Implement alerting for degradation

Statistical Foundation: The Multiple Testing Problem

Problem: Testing 50 features at α=0.05 → expect 2.5 false discoveries by chance!

Connection to production ML: - Pipeline tests 100+ models, 1,000+ features → many “significant” results by luck - Backtest 50 trading strategies → some appear profitable by chance (not skill)

Solutions from Week 1, §0.5.2:

  1. Bonferroni correction: Divide α by number of tests (α = 0.05/50 = 0.001)
    • Conservative: reduces false positives, increases false negatives
  2. False Discovery Rate (FDR, Benjamini-Hochberg): Control % of discoveries that are false
    • More powerful than Bonferroni for exploratory analysis
  3. Out-of-sample validation: Test on held-out data (best approach!)

Gelman’s critique: p < 0.05 is weak evidence when testing many hypotheses!

Production ML multiple testing: - Feature selection: Testing 1,000 features → expect 50 “significant” by chance - Hyperparameter tuning: Testing 100 configs → some look good by overfitting - Model selection: Testing 20 algorithms → winner likely overfit

Solution: Honest out-of-sample validation on completely held-out test set (no peeking!)

Summary and Key Takeaways

1. Production ML is engineering discipline: pipeline architecture, deployment patterns, monitoring, and maintenance

2. Data drift and concept drift silently degrade models: continuous monitoring and retraining essential

3. Feature engineering drives success: training-serving consistency, temporal correctness, scale challenges

4. Overfitting is pervasive in financial ML: multiple testing corrections and rigorous validation required

5. Regulatory compliance shapes architecture: model risk management, explainability, auditability

6. MLOps practices enable reliable systems: versioning, testing, monitoring, incident response

References

Core readings:

  • Bailey, D. H., & López de Prado, M. (2014). “The Deflated Sharpe Ratio: Correcting for Selection Bias, Backtest Overfitting and Non-Normality,” Journal of Portfolio Management
  • López de Prado, M. (2018). Advances in Financial Machine Learning. Wiley (especially Ch. 7-8 on backtesting)
  • Sculley, D., et al. (2015). “Hidden Technical Debt in Machine Learning Systems,” NIPS
  • Breck, E., et al. (2019). “Data Validation for Machine Learning,” MLSys
  • Polyzotis, N., et al. (2018). “Data Lifecycle Challenges in Production Machine Learning,” SIGMOD

Additional resources:

  • MLflow: mlflow.org (model tracking and registry)
  • Feast: feast.dev (feature store)
  • Great Expectations: greatexpectations.io (data validation)
  • TensorFlow Extended: tensorflow.org/tfx (production ML pipelines)
  • Evidently AI: evidentlyai.com (drift detection)