Chapter 10 Appendix: Production ML Pipelines
From Research Prototypes to Reliable Systems (FIN510/FIN720 Reference Material)
This appendix preserves the production-ML pipeline material that previously formed the main Chapter 10. It is retained as reference reading for FIN510 and FIN720 students, and for FIN306 students who want to see how the statistical validation ideas in the new Chapter 10 (“Backtesting & Validation”) are wired into live systems. It sits alongside, not in place of, the main chapter.
For the core FIN306 Level-5 learning outcomes (selection bias, CSCV/PBO, PSR/DSR, honest reporting), read the main chapter on Backtesting and Validation. The material below is a forward reference: once you can produce a credible backtest, these are the engineering and governance layers that turn it into a deployable model.
1 Learning Objectives
After completing this chapter, you will be able to:
- Design end-to-end machine learning pipelines for financial applications
- Implement temporally correct feature engineering preventing look-ahead bias
- Apply multiple testing corrections to avoid false discoveries
- Execute combinatorial purged cross-validation for rigorous backtesting
- Monitor production models and detect data drift
- Navigate regulatory requirements for ML in finance (model risk management)
- Evaluate the gap between research findings and production viability
2 Introduction: The Seventy Percent Failure Rate
Gartner estimates that 70% of machine learning projects fail to progress beyond proof-of-concept to production deployment. This statistic reflects not technical impossibility but the substantial gap between building models in notebooks and operating reliable systems serving predictions continuously. Academic machine learning optimises accuracy on static test sets; production machine learning maintains systems adapting to evolving data whilst satisfying latency requirements, providing explainability for regulatory compliance, and recovering from failures without human intervention (Sculley et al. 2015).
Financial machine learning faces particularly acute challenges. Market data is non-stationary: distributions shift as economic regimes change, correlations vary with market stress, and volatility clusters (Cont 2001). Models trained on 2010-2020 data often fail spectacularly during the 2020 pandemic or 2022 interest rate regime shift. Overfitting is rampant: researchers test thousands of features and model configurations, selecting winners based on backtest performance. As Bailey and Prado (2014) demonstrate, most published investment strategies exhibit probability of backtest overfitting exceeding 50%, suggesting that impressive historical performance often reflects data mining rather than genuine predictive power.
Moreover, financial ML systems operate under regulatory scrutiny absent from consumer applications. The Federal Reserve’s SR 11-7 guidance on Model Risk Management establishes expectations for model development, validation, governance, and documentation (Board of Governors of the Federal Reserve System 2011). The European Union’s GDPR mandates explainability for automated decisions affecting individuals. Fair lending regulations require demonstrating that credit models don’t discriminate against protected classes. These requirements add substantial engineering and compliance overhead beyond technical model development (Brundage et al. 2020).
Moving machine learning models to production requires addressing concerns that academic research typically ignores:
Technical Requirements: - Data pipeline reliability (handling missing data, API failures, schema changes) - Serving latency (fraud detection < 100ms; algorithmic trading < 1ms) - Scalability (processing millions of predictions daily without degradation) - Monitoring (detecting data drift, concept drift, and performance degradation) - Versioning (tracking model versions, enabling rollback on failures)
Regulatory Requirements: - Documentation (data provenance, model development decisions, validation results) - Explainability (providing reasons for individual predictions to regulators and customers) - Auditability (maintaining reproducible records of model behaviour over time) - Bias testing (demonstrating fair treatment across demographic groups) - Governance (approval processes, change management, risk assessment)
Operational Requirements: - Incident response (detecting and recovering from failures rapidly) - Continuous validation (comparing production performance to expectations) - Retraining (adapting to new data whilst avoiding catastrophic forgetting) - A/B testing (safely evaluating model changes before full deployment) - Stakeholder communication (reporting performance to business and compliance teams)
This chapter addresses these requirements systematically, providing both technical implementations and conceptual frameworks for production ML in financial contexts.
Our exploration proceeds through the machine learning lifecycle from data ingestion through continuous monitoring. We begin with pipeline architecture: understanding components and their interactions enables designing maintainable scalable systems. We then address feature engineering with temporal correctness, ensuring features use only information available at prediction time. The critical challenge of overfitting receives extended treatment: implementing multiple testing corrections and combinatorial purged cross-validation that rigorously evaluate model performance. Model monitoring and drift detection enable maintaining reliability as data distributions evolve. Finally, we examine regulatory requirements and documentation practices that financial ML systems must satisfy.
Throughout, we maintain realistic perspective. Perfect systems are impossible: data will have errors, models will sometimes fail, and distributions will shift unexpectedly. The goal isn’t eliminating all failure modes but designing systems that degrade gracefully, detect problems quickly, recover automatically when possible, and escalate to humans when necessary. This engineering discipline, combined with statistical rigour in validation, separates research prototypes from production systems managing billions in assets.
3 Pipeline Architecture: Components and Orchestration
Machine learning systems comprise multiple interconnected components processing data from sources through transformations to predictions. Understanding this architecture enables designing systems that are maintainable (engineers can modify components without breaking dependencies), scalable (components can be upgraded for higher throughput without redesigning entire system), and monitorable (failures in specific components can be detected and diagnosed). Financial ML pipelines have particular requirements around data consistency, audit trails, and error handling that influence architectural choices (Paleyes, Urma, and Lawrence 2022).
3.1 Core Pipeline Components
A typical ML pipeline consists of several stages, each with specific responsibilities. Data ingestion pulls raw data from sources: market data APIs, transaction databases, customer information systems, external datasets. Ingestion must handle rate limits, schema variations, missing data, and failures where sources are temporarily unavailable. Financial systems might ingest millions of transactions hourly, tick-by-tick market data for thousands of securities, and batch updates from external providers on varying schedules.
Data storage provides durable reliable access to both raw data and processed features. Modern architectures often use data lakes (object storage like S3 or Azure Data Lake Storage) for raw data given low cost per gigabyte, and data warehouses (Snowflake, BigQuery, Redshift) for structured queryable data supporting analytics. Feature stores: specialised databases caching computed features: enable consistency between training (using historical features) and serving (computing features for new predictions in real-time).
Feature engineering transforms raw data into model inputs through aggregations (30-day transaction average), encoding (one-hot encoding categorical variables), domain-specific computations (technical indicators from price series), and temporal features (day of week, time since last transaction). This code must execute identically in training and serving environments: inconsistencies create training-serving skew where model receives different feature distributions in production than during development, degrading accuracy (Polyzotis et al. 2017).
Model training executes periodically (daily for fraud models, weekly for credit scoring, monthly for portfolio allocation) or when triggered by performance degradation. Distributed training frameworks (Ray, Dask, Spark MLlib) enable processing datasets exceeding single-machine memory. Hyperparameter optimisation tools (Optuna, Hyperopt) automate configuration search. Importantly, training must use only data available at that historical point: including future data creates look-ahead bias inflating apparent performance.
Model serving provides predictions via APIs (REST or gRPC), batch processing (generating predictions for all customers overnight), or streaming (processing transactions as they arrive). Latency requirements vary dramatically: fraud detection might require sub-100ms response including feature computation; portfolio rebalancing can tolerate minutes. The serving architecture must handle load spikes, provide graceful degradation when overloaded, and version models to enable A/B testing.
Orchestration coordinates component execution through workflow management tools like Apache Airflow, Prefect, or Dagster. These define pipelines as directed acyclic graphs where nodes represent tasks and edges represent dependencies. If data ingestion fails, downstream feature engineering doesn’t execute. If model training completes successfully, the new model version deploys automatically after validation. Orchestration handles retries, provides observability into execution status, and enables debugging failures.
Here’s a simplified pipeline implementation demonstrating key patterns:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
from typing import Dict, List, Tuple
import joblib
from pathlib import Path
class TemporalFeatureEngineer:
"""
Feature engineering with strict temporal correctness.
Ensures features use only information available at prediction time,
preventing look-ahead bias that inflates backtest performance.
Parameters
----------
lookback_days : int
Historical window for feature computation (e.g., 30 days)
"""
def __init__(self, lookback_days: int = 30):
self.lookback_days = lookback_days
self._feature_registry = {}
def register_feature(self, name: str, function):
"""Register feature computation function."""
self._feature_registry[name] = function
def compute_features(self, df: pd.DataFrame,
as_of_date: datetime) -> pd.DataFrame:
"""
Compute features using only data available as of as_of_date.
Critical for preventing look-ahead bias in backtesting.
Parameters
----------
df : pd.DataFrame
Historical data with 'date' column
as_of_date : datetime
Point-in-time for feature computation
Returns
-------
features : pd.DataFrame
Computed features for as_of_date
"""
# Filter to data available as of date (strict inequality)
available_data = df[df['date'] < as_of_date].copy()
# Restrict to lookback window
cutoff_date = as_of_date - timedelta(days=self.lookback_days)
window_data = available_data[available_data['date'] >= cutoff_date]
if len(window_data) == 0:
raise ValueError(f"Insufficient data for {as_of_date}")
# Compute registered features
features = {}
for name, func in self._feature_registry.items():
features[name] = func(window_data)
result = pd.DataFrame([features])
result['as_of_date'] = as_of_date
return result
class ModelVersion:
"""
Versioned model with metadata for tracking and rollback.
Tracks model artifacts, performance metrics, and deployment status.
"""
def __init__(self, version_id: str, model, metrics: Dict[str, float]):
self.version_id = version_id
self.model = model
self.metrics = metrics
self.created_at = datetime.now()
self.deployed_at = None
self.is_active = False
def save(self, path: Path):
"""Persist model version to disk."""
model_path = path / f"model_{self.version_id}.pkl"
metadata_path = path / f"metadata_{self.version_id}.json"
# Save model artifact
joblib.dump(self.model, model_path)
# Save metadata
metadata = {
'version_id': self.version_id,
'metrics': self.metrics,
'created_at': self.created_at.isoformat(),
'deployed_at': self.deployed_at.isoformat() if self.deployed_at else None,
'is_active': self.is_active
}
import json
with open(metadata_path, 'w') as f:
json.dump(metadata, f, indent=2)
@classmethod
def load(cls, path: Path, version_id: str):
"""Load model version from disk."""
model_path = path / f"model_{version_id}.pkl"
metadata_path = path / f"metadata_{version_id}.json"
model = joblib.load(model_path)
import json
with open(metadata_path, 'r') as f:
metadata = json.load(f)
version = cls(version_id, model, metadata['metrics'])
version.created_at = datetime.fromisoformat(metadata['created_at'])
version.deployed_at = datetime.fromisoformat(metadata['deployed_at']) if metadata['deployed_at'] else None
version.is_active = metadata['is_active']
return version
class DriftDetector:
"""
Detect data drift using statistical tests.
Monitors feature distributions in production, alerting when they
diverge significantly from training distribution.
"""
def __init__(self, reference_data: pd.DataFrame, alpha: float = 0.05):
"""
Parameters
----------
reference_data : pd.DataFrame
Training data distribution to compare against
alpha : float
Significance level for drift detection (default 0.05)
"""
self.reference_data = reference_data
self.alpha = alpha
self.reference_stats = self._compute_stats(reference_data)
def _compute_stats(self, df: pd.DataFrame) -> Dict:
"""Compute distributional statistics."""
return {
col: {
'mean': df[col].mean(),
'std': df[col].std(),
'quantiles': df[col].quantile([0.25, 0.5, 0.75]).to_dict()
}
for col in df.select_dtypes(include=[np.number]).columns
}
def detect_drift(self, production_data: pd.DataFrame) -> Dict[str, bool]:
"""
Detect drift using Kolmogorov-Smirnov test.
Parameters
----------
production_data : pd.DataFrame
Recent production data to test for drift
Returns
-------
drift_detected : Dict[str, bool]
Mapping from feature name to whether drift detected
"""
from scipy.stats import ks_2samp
drift_results = {}
for col in self.reference_data.select_dtypes(include=[np.number]).columns:
if col not in production_data.columns:
continue
# Kolmogorov-Smirnov test comparing distributions
stat, pvalue = ks_2samp(
self.reference_data[col].dropna(),
production_data[col].dropna()
)
# Drift detected if p-value < alpha
drift_results[col] = pvalue < self.alpha
return drift_resultsThis code demonstrates production patterns: temporal correctness in feature engineering (only using past data), model versioning enabling rollback, and drift detection monitoring distribution changes. Real implementations add complexity handling edge cases, but core principles remain.
4 Combating Overfitting: Multiple Testing and Rigorous Validation
The replication crisis in quantitative finance stems largely from overfitting: researchers test many hypotheses, publish significant results, and fail to validate on truly independent data. Harvey, Liu, and Zhu (2016) analysed 315 published factors claiming predictive power for stock returns, finding that adjusting for multiple testing reduces significance dramatically. Bailey and Prado (2014) introduced the Probability of Backtest Overfitting (PBO) metric quantifying overfitting risk, showing that many published strategies likely exhibit spurious performance.
4.1 The Multiple Testing Problem
This is exactly the problem Gelman warned about in Week 1: When testing many hypotheses, some will appear significant by chance.
We’ve seen this repeatedly: - Ch 05: Testing alternative data features for credit scoring - Ch 07: Liu, Tsyvinski, and Wu (2022) testing 24 cryptocurrency factors
- Ch 10 (here): Testing 100+ trading strategies
Same problem, increasing severity in each application. Financial ML is particularly vulnerable because: - Researchers test thousands of features/models - Only “successful” strategies get published (publication bias) - In-sample backtest performance is overoptimistic
Correction methods (covered below): Bonferroni, FDR, CPCV, out-of-sample holdout.
When testing (m) hypotheses at significance level (), the probability of at least one false positive is (1 - (1 - )^m). Testing 20 factors at (= 0.05) yields approximately 64% chance of finding spuriously significant result. Financial researchers routinely test hundreds of features and dozens of model configurations, creating enormous multiple testing burden (Harvey, Liu, and Zhu 2016).
The Bonferroni correction addresses this by dividing significance threshold by number of tests: reject null hypothesis only if (p < / m). This controls family-wise error rate but is conservative: statistical power decreases substantially with many tests. The False Discovery Rate (FDR) approach of Benjamini and Hochberg (1995) provides more power whilst controlling expected proportion of false discoveries among rejected hypotheses. For (m) tests with p-values (p_1 p_2 p_m), FDR rejects hypotheses (1, , k) where (k) is largest index satisfying (p_k ).
import numpy as np
from scipy import stats
def bonferroni_correction(pvalues: np.ndarray, alpha: float = 0.05) -> np.ndarray:
"""
Apply Bonferroni correction for multiple testing.
Parameters
----------
pvalues : np.ndarray
Array of p-values from hypothesis tests
alpha : float
Desired family-wise error rate
Returns
-------
significant : np.ndarray
Boolean array indicating significant tests after correction
"""
m = len(pvalues)
bonferroni_threshold = alpha / m
return pvalues < bonferroni_threshold
def benjamini_hochberg_fdr(pvalues: np.ndarray, alpha: float = 0.05) -> np.ndarray:
"""
Apply Benjamini-Hochberg FDR correction.
More powerful than Bonferroni whilst controlling false discovery rate.
Parameters
----------
pvalues : np.ndarray
Array of p-values from hypothesis tests
alpha : float
Desired false discovery rate
Returns
-------
significant : np.ndarray
Boolean array indicating significant tests after FDR control
"""
m = len(pvalues)
# Sort p-values while tracking original indices
sorted_indices = np.argsort(pvalues)
sorted_pvalues = pvalues[sorted_indices]
# Find largest k where p_k <= (k/m) * alpha
thresholds = (np.arange(1, m + 1) / m) * alpha
significant_sorted = sorted_pvalues <= thresholds
if not np.any(significant_sorted):
return np.zeros(m, dtype=bool)
k = np.max(np.where(significant_sorted)[0])
# Unsort to match original order
result = np.zeros(m, dtype=bool)
result[sorted_indices[:k+1]] = True
return result4.2 Combinatorial Purged Cross-Validation
Standard cross-validation assumes independent observations, but financial data exhibits temporal dependencies and overlapping labels (today’s return depends on last week’s information). Prado (2018) introduced Combinatorial Purged Cross-Validation (CPCV) addressing these issues through: 1) purging: removing observations in test set within time window of training observations to prevent leakage, 2) embargo: excluding recent data before test period to account for execution delays, 3) combinatorial testing: evaluating all possible train-test splits to assess strategy stability.
CPCV divides data into (k) folds chronologically. For each combination of (k/2) folds as training set (remaining as test), it purges test observations too close temporally to training set, embargoes recent data, trains model, and evaluates performance. The distribution of out-of-sample performances quantifies strategy robustness: narrow distribution suggests genuine predictive power, wide distribution suggests overfitting.
The Probability of Backtest Overfitting extends this by comparing in-sample versus out-of-sample performance across all combinations. Define (R^{IS}_s) as in-sample Sharpe ratio for combination (s), (R^{OS}_s) as out-of-sample Sharpe ratio. The PBO is:
[ PBO = P[R^{OS}_s < (R^{OS})] = ]
Values near 50% suggest balanced performance; values near 100% suggest severe overfitting where most combinations underperform even the median out-of-sample result.
5 Model Monitoring and Drift Detection
Production models degrade over time as data distributions shift: market regimes change, consumer behaviour evolves, fraud tactics adapt. Without monitoring, degradation proceeds silently until catastrophic failures occur. Effective monitoring tracks multiple indicators: model performance (accuracy, precision, recall, calibration), data drift (input distribution changes), concept drift (relationship between inputs and outputs changes), and system health (latency, error rates, resource usage) (Breck et al. 2017).
5.1 Types of Drift
Data drift occurs when input feature distributions change whilst the underlying relationship between features and target remains stable. For example, average transaction amounts might increase with inflation, or customer demographics might shift as product appeal changes. While predictions remain valid given the features, monitoring distribution changes helps anticipate when retraining becomes necessary.
Statistical tests detect data drift by comparing recent production data to training distribution. The Kolmogorov-Smirnov test measures maximum difference between cumulative distribution functions: large differences suggest drift. The Population Stability Index (PSI) quantifies distribution changes for categorical and discretised features: ( = _i (p_i - q_i) (p_i / q_i)) where (p_i) and (q_i) are proportions in reference and production populations. Values exceeding 0.1 typically indicate significant drift.
Concept drift occurs when the relationship between features and target changes. Credit scoring models experience this when economic conditions shift: historical relationships between income, credit history, and default probability vary between boom and recession. Detecting concept drift requires monitoring performance metrics directly since input distributions might remain stable whilst predictions become miscalibrated.
Label drift affects supervised learning when target distribution changes. Fraud detection encounters this as fraud prevalence varies: holiday seasons see different fraud patterns than normal periods. Classification models experience accuracy changes due solely to class imbalance shifts, requiring recalibration even when underlying decision boundaries remain appropriate.
5.2 Monitoring Implementation
Production systems continuously track metrics, alert on anomalies, and maintain historical records enabling post-hoc analysis. Cloud platforms (AWS SageMaker Model Monitor, Azure ML Model Monitoring, Google Cloud AI Platform) provide managed monitoring services. Open-source tools (Evidently AI, Alibi Detect) enable custom monitoring logic integrated into existing infrastructure.
Effective monitoring requires baselines: knowing what “normal” looks like enables detecting abnormal. Establishing baselines involves computing statistics on training data and initial production data after deployment. As distributions evolve gradually, baselines should update periodically to avoid alert fatigue from slow continuous drift.
Alerting thresholds require careful calibration balancing sensitivity (catching problems early) against specificity (avoiding false alarms). Overly sensitive thresholds generate alerts for insignificant fluctuations, training teams to ignore alerts and missing genuine problems. Insensitive thresholds delay problem detection, allowing degraded models to serve poor predictions for extended periods. The optimal calibration depends on consequences of failures versus costs of investigation.
6 Regulatory Requirements: Model Risk Management
Financial machine learning operates under regulatory frameworks that don’t apply to most software systems. The Federal Reserve’s SR 11-7 guidance on Model Risk Management establishes expectations for model development, implementation, and use. Models must undergo independent validation before deployment, maintain comprehensive documentation enabling reproducibility, implement effective oversight identifying limitations, and establish governance ensuring ongoing monitoring (Board of Governors of the Federal Reserve System 2011).
6.1 SR 11-7 Framework
SR 11-7 defines model broadly: any quantitative method whose inputs and assumptions can produce quantitative estimates. This encompasses credit scoring, fraud detection, portfolio optimisation, risk measurement, and pricing models. The guidance identifies two sources of model risk: the model might be fundamentally flawed or incorrectly implemented, and the model might be used inappropriately or its limitations not understood.
Development and implementation requires sound design aligned with product purpose and bank strategy, rigorous testing before deployment, evaluation of conceptual soundness (do assumptions and mathematics make sense?), ongoing monitoring and validation, and comprehensive documentation enabling independent reviewers to understand model logic, data sources, and limitations.
Independent validation provides critical assessment separate from development. Validators review conceptual soundness, examine data quality and relevance, replicate key development steps, conduct sensitivity analysis exploring model behaviour under alternative assumptions, and assess whether model limitations are understood and appropriately managed. Validation must occur before initial deployment and repeated periodically or when material changes occur.
Governance and controls establish board and senior management oversight of model risk, clear policies defining model risk management, accountability for model development and validation, and protocols for model changes and exceptions. The framework emphasizes that model users must understand limitations: models are simplifications inevitably missing elements of modeled phenomenon, and misuse or misinterpretation creates risk independent of model quality.
6.2 Explainability and Fair Lending
The Equal Credit Opportunity Act and Regulation B prohibit discrimination in lending based on protected characteristics (race, colour, religion, national origin, sex, marital status, age). Machine learning models must not create disparate impact: differences in treatment or outcomes across protected groups: even if protected characteristics aren’t directly used as features (Barocas and Selbst 2016).
Explainability helps demonstrate compliance by showing which factors influenced decisions. LIME (Local Interpretable Model-agnostic Explanations) and SHAP (SHapley Additive exPlanations) provide post-hoc explanations for individual predictions from black-box models. These approaches fit interpretable models (linear regression, decision trees) locally approximating complex model behaviour for specific predictions, enabling “adverse action notices” explaining credit denials (Lundberg and Lee 2017).
However, explainability introduces tensions. Most interpretability methods provide approximations rather than true model behaviour: explanations might be misleading if local approximation is poor. Moreover, providing detailed explanations could enable gaming: applicants might learn precisely how to manipulate features for approval. Balancing transparency supporting fairness assessment against gaming risk and intellectual property concerns remains active area of regulatory evolution.
7 Conclusion: Engineering Discipline for Financial ML
Moving machine learning from research to production requires substantially more than algorithmic innovation. Data pipelines must reliably ingest heterogeneous sources, transform data consistently between training and serving, and scale to production volumes whilst maintaining low latency. Feature engineering must maintain temporal correctness preventing look-ahead bias that inflates backtest performance. Validation must account for multiple testing through appropriate corrections and use rigorous cross-validation acknowledging temporal dependencies. Production systems must monitor for drift, version models enabling rollback, and maintain audit trails satisfying regulatory requirements.
The statistical challenges are particularly acute in finance given non-stationarity and overfitting prevalence. Research demonstrating impressive backtest performance often reflects data mining rather than genuine predictive power, with probability of overfitting exceeding 50% for many published strategies. Rigorous validation using combinatorial purged cross-validation, multiple testing corrections, and truly independent holdout sets provides more realistic performance estimates whilst still substantially overestimating live performance given inability to perfectly account for all biases (Bailey and Prado 2014).
Regulatory requirements add substantial complexity beyond typical software engineering. Model risk management frameworks demand comprehensive documentation, independent validation, and ongoing monitoring. Explainability requirements for fair lending necessitate interpretable models or post-hoc explanation techniques. These requirements create engineering overhead but serve important functions: ensuring that consequential automated decisions are auditable, reproducible, and demonstrably non-discriminatory.
The production machine learning gap: 70% of projects failing to deploy: reflects not impossibility but underestimation of engineering effort required. Organisations succeeding in production ML invest heavily in infrastructure (feature stores, model serving platforms, monitoring systems), cultivate talent combining software engineering with statistical expertise, establish clear ownership and accountability, and maintain realistic expectations about model performance and maintenance requirements (Paleyes, Urma, and Lawrence 2022).
For students developing factor models or fraud detection systems, the production perspective matters even when actual deployment isn’t required. Implementing temporal correctness in features, applying multiple testing corrections, using CPCV for validation, and documenting methodology comprehensively demonstrates understanding that research prototypes must satisfy additional constraints to become reliable systems. These practices distinguish sophisticated analysis from naive curve-fitting, and increasingly, employers expect this production-aware mindset from quantitative finance candidates.
8 Further Reading
8.1 Core Academic Papers
Bailey and Prado (2014) introduces Probability of Backtest Overfitting, demonstrating that many published investment strategies likely exhibit spurious performance.
Harvey, Liu, and Zhu (2016) analyzes the factor zoo, finding that multiple testing substantially reduces statistical significance of many published factors.
Sculley et al. (2015) describes “technical debt” in machine learning systems, identifying hidden costs that accumulate in production.
Breck et al. (2017) provides practical guidance on data validation for machine learning systems, addressing common production failure modes.
8.2 Practical Resources
Polyzotis et al. (2017) discusses training-serving skew and strategies for maintaining consistency between development and production.
Google’s MLOps course and papers on ML system architecture provide industry perspective on production best practices.
Board of Governors of the Federal Reserve System (2011) is the Federal Reserve’s SR 11-7 guidance on Model Risk Management: essential reading for understanding regulatory requirements.
Benjamini and Hochberg (1995) introduces the False Discovery Rate approach to multiple testing control.
8.3 Books and Extended Treatments
Advances in Financial Machine Learning by Marcos López de Prado covers CPCV, PBO, and other production ML techniques for finance.
Machine Learning Systems: Design and Implementation provides comprehensive coverage of production ML infrastructure.
Interpretable Machine Learning by Christoph Molnar offers detailed treatment of explanation techniques.
The field evolves rapidly: supplement academic foundations with current industry practices through conference presentations (MLOps World, PyData), blog posts from major technology companies, and open-source projects demonstrating production patterns.