Machine Learning for Financial Applications
Welcome to Machine Learning in Finance - approached with both enthusiasm and appropriate caution. Machine learning offers valuable tools for financial analysis, but it’s important to understand both what these methods can accomplish and where they may fall short. This chapter explores ML techniques through the lens of statistical thinking, emphasizing the importance of understanding assumptions, limitations, and the crucial distinction between prediction and causation. We’ll integrate insights from both traditional ML approaches and modern causal reasoning to develop a more complete understanding of when and how these methods can be most valuable.
0.1 A Statistical Foundation for Machine Learning
Before diving into algorithms, let’s establish the statistical foundations that underpin all meaningful machine learning applications in finance. At its core, machine learning is applied statistics - we’re trying to learn patterns from data while accounting for uncertainty and avoiding overfitting.
0.2 Statistical Foundations: What We’re Really Doing
When we apply machine learning in finance, we’re essentially trying to:
- Learn patterns from historical data while recognizing that financial markets evolve
- Make predictions under uncertainty while acknowledging our confidence intervals
- Distinguish signal from noise while avoiding the trap of finding patterns that don’t exist
- Generalize to new situations while understanding when our models might break down
A crucial insight from integrating causal reasoning: prediction and causation are different goals requiring different approaches.
- Predictive models ask: “Given what we’ve seen, what’s likely to happen next?”
- Causal models ask: “If we change something, what will happen?”
Both are valuable, but they serve different purposes and require different validation approaches.
0.3 Supervised Learning: Prediction with Labeled Data
Supervised learning develops models using historical examples where we know both the inputs and the desired outputs. This approach can be valuable for financial applications, though we must be careful about several assumptions:
Potential Applications in Finance: - Estimating volatility and risk (with appropriate uncertainty quantification) - Exploring relationships between market variables (while distinguishing correlation from causation) - Developing credit scoring models (with careful attention to bias and fairness) - Detecting potentially fraudulent transactions (while minimizing false positives)
Common Algorithms and Their Trade-offs: - Linear regression: Interpretable but assumes linear relationships - Random forests: Flexible but can overfit and are less interpretable - Support vector machines: Good for high-dimensional data but computationally intensive - Neural networks: Very flexible but require large datasets and careful regularization
- Survivorship bias: Using only data from companies/assets that still exist
- Look-ahead bias: Accidentally using future information to predict the past
- Overfitting: Creating models that memorize noise rather than learning patterns
- Assuming stationarity: Financial relationships change over time
- Confusing correlation with causation: High predictive accuracy doesn’t imply causal understanding
0.4 Unsupervised Learning
Unsupervised learning operates on unlabeled data, focusing on the discovery of hidden structures and patterns therein. Primary unsupervised learning tasks encompass clustering, dimensionality reduction, and anomaly detection. In finance, unsupervised learning can be employed to achieve several objectives, including:
- Segmenting customers or investors
- Identifying undervalued or overvalued assets
- Recognizing emerging trends and breaking news
- Monitoring systemic risk
- Flagging suspicious activity
Prominent unsupervised learning algorithms embrace k-means clustering, hierarchical clustering, and principal component analysis (PCA).
0.5 Reinforcement Learning
Reinforcement learning (RL) lies somewhere at the intersection of supervised and unsupervised learning, drawing inspiration from trial-and-error processes and decision theory. Rather than merely receiving labeled data, RL agents engage with their surroundings, gathering experiences, and modifying their behaviors to attain maximal utility or reward. Within finance, RL can be successfully applied to tackle intricate problems such as:
- Algorithmic trading
- Optimal execution
- Portfolio optimization
- Robo-advisory
Among notable RL algorithms, Q-learning, Deep Q Network (DQN), actor-critic methods, and temporal difference (TD) algorithms deserve mention.
0.5.1 Misconceptions Surrounding Reinforcement Learning
Although reinforcement learning bears striking similarities to supervised learning, it would be erroneous to equate them entirely. Indeed, RL possesses distinctive attributes rendering it uniquely qualified to address specific challenges encountered throughout financial decision-making. Several distinguishing traits include:
- Online learning: RL generally proceeds incrementally, assimilating novel experiences alongside existing knowledge.
- Delayed feedback: Outcomes in RL usually manifest with a delay, prompting agents to learn delayed gratification and patience.
- Sequential decision-making: RL grapples with sequences of related decisions, accounting for dependencies amongst successive choices.
Recognizing the divergent qualities of supervised and reinforcement learning allows practitioners to choose appropriate methods for specific financial applications, ensuring optimal performance and insightful results.
1 Applications to Financial Data
Machine learning techniques are essential for making accurate predictions and identifying underlying patterns in financial data. They can significantly impact investment strategies, risk management, fraud detection, and portfolio optimization.
1.1 Industry Applications
Supervised and unsupervised learning techniques hold great potential in the world of finance. They can assist investors, researchers, and practitioners in making informed decisions, deriving insights from vast amounts of data, and automating repetitive processes.
1.1.1 Supervised Learning in Finance
Financial markets constantly evolve, driven by factors such as news events, investor sentiment, and shifting monetary policy. Consequently, accurate forecasting remains a challenge, despite decades of advancement in mathematical modeling and computer algorithms. Nevertheless, supervised learning plays a crucial role in finance because of its ability to establish links between variables and extrapolate patterns found in historical data. Some areas where supervised learning thrives in finance include:
Price and Volume Forecasting: Leveraging historical asset prices and volumes, supervised learning models anticipate future security movements. Accurate predictions can inform investment strategies, minimize risks, and optimize portfolios.
Sentiment Analysis: Applying natural language processing and machine learning, financial experts analyze social media posts, online articles, and press releases to gauge public opinion regarding companies or investments. Positive sentiments drive demand, increasing prices, whereas negative opinions deter investors, leading to falling prices.
Credit Scoring: Evaluating creditworthiness becomes crucial in consumer lending, insurance, and corporate financing. Supervised learning algorithms determine clients’ default probabilities based on payment histories, debt levels, income, employment status, and personal characteristics.
Algorithmic Trading: Automated trading relies heavily on supervised learning models to react swiftly to market developments, capitalize on opportunities, and mitigate losses. Traders employ reinforcement learning, a specialized branch of supervised learning, to refine trading tactics continuously.
Fraud Detection: Detecting irregular transactions early on safeguards banks and consumers from substantial losses. Supervised learning alerts authorities to potentially fraudulent behavior, helping protect finances and reputations.
1.1.2 Unsupervised Learning in Finance
Financial institutions house enormous quantities of structured and semi-structured data waiting to unlock secrets. Unsupervised learning techniques expose hidden structures, associations, and aberrations inherent in financial datasets, complementing conventional supervised learning approaches. Areas where unsupervised learning contributes significantly in finance include:
Portfolio Optimization: Clustering techniques partition securities into homogeneous groups, facilitating diversification and risk management. Investors can allocate assets intelligently, balancing exposure to various sectors or industries, hedging bets, and amplifying rewards.
Network Analysis: Graph theoretical concepts illuminate invisible webs connecting organizations, people, and entities via ownership, transactional, or contractual ties. Social network analysis discovers communities, influential nodes, and central figures in financial ecosystems.
Event Studies: Unsupervised learning pinpoints inflection points in financial series, such as mergers, acquisitions, or regulatory shifts, revealing causality, magnitude, and duration impacts. Such studies inform strategic choices, tactical maneuvers, and operational tweaks.
Text Analytics: Topic modeling and document embedding find usage in parsing contracts, legal agreements, and disclosure statements. Dimensionality reduction highlights salient themes, phrases, and keywords, streamlining compliance reviews and expediting audits.
Robo-Advisory: Personalized wealth management services recommend products aligning customers’ preferences, constraints, and expectations with available options, boosting customer satisfaction and loyalty. Customizable robo-advice engines simplify client acquisition, engagement, and servicing costs.
1.2 Practical Integration: Traditional ML + Causal Reasoning
Let’s see how we can combine traditional machine learning with causal thinking for more robust financial analysis:
# Comprehensive example: Stock return prediction with causal awareness
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import yfinance as yf
# For causal analysis
try:
import dowhy
from dowhy import CausalModel
= True
CAUSAL_AVAILABLE except ImportError:
print("DoWhy not available. Install with: pip install dowhy")
= False
CAUSAL_AVAILABLE
def comprehensive_analysis(ticker, start_date='2020-01-01', end_date='2023-01-01'):
"""
Demonstrate both traditional ML and causal reasoning approaches
"""
# Step 1: Data preparation with statistical rigor
print(f"Analyzing {ticker} from {start_date} to {end_date}")
# Get stock data
= yf.download(ticker, start=start_date, end=end_date)
stock_data
# Create features (being careful about look-ahead bias)
= pd.DataFrame()
features_df 'returns'] = stock_data['Adj Close'].pct_change()
features_df['volume'] = stock_data['Volume']
features_df['volatility'] = features_df['returns'].rolling(20).std()
features_df['ma_5'] = stock_data['Adj Close'].rolling(5).mean()
features_df['ma_20'] = stock_data['Adj Close'].rolling(20).mean()
features_df['price_momentum'] = (features_df['ma_5'] / features_df['ma_20']) - 1
features_df[
# Create target variable (next day return)
'next_day_return'] = features_df['returns'].shift(-1)
features_df[
# Add market context (simulated - in practice use real economic indicators)
42) # For reproducibility
np.random.seed('market_sentiment'] = np.random.normal(0, 1, len(features_df))
features_df['economic_conditions'] = np.random.normal(0, 1, len(features_df))
features_df[
# Clean data
= features_df.dropna()
features_df
if len(features_df) < 50:
print("Insufficient data for analysis")
return None
print(f"Dataset size: {len(features_df)} observations")
# Step 2: Traditional Machine Learning Approach
print("\\n=== TRADITIONAL MACHINE LEARNING APPROACH ===")
# Prepare features and target
= ['volatility', 'price_momentum', 'volume', 'market_sentiment', 'economic_conditions']
feature_cols = features_df[feature_cols].fillna(0)
X = features_df['next_day_return'].fillna(0)
y
# Train-test split (respecting time order for financial data)
= int(0.8 * len(X))
split_point = X.iloc[:split_point], X.iloc[split_point:]
X_train, X_test = y.iloc[:split_point], y.iloc[split_point:]
y_train, y_test
# Train multiple models
= {
models 'Linear Regression': LinearRegression(),
'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42)
}
= {}
ml_results for name, model in models.items():
# Train model
model.fit(X_train, y_train)
# Make predictions
= model.predict(X_test)
y_pred
# Evaluate
= mean_squared_error(y_test, y_pred)
mse = r2_score(y_test, y_pred)
r2
= {
ml_results[name] 'mse': mse,
'r2': r2,
'model': model
}
print(f"{name}:")
print(f" MSE: {mse:.6f}")
print(f" R²: {r2:.4f}")
# Feature importance (if available)
if hasattr(model, 'feature_importances_'):
= pd.DataFrame({
importance 'feature': feature_cols,
'importance': model.feature_importances_
'importance', ascending=False)
}).sort_values(print(f" Top features: {importance.iloc[0]['feature']} ({importance.iloc[0]['importance']:.3f})")
# Step 3: Causal Reasoning Approach
print("\\n=== CAUSAL REASONING APPROACH ===")
if CAUSAL_AVAILABLE and len(features_df) > 100:
try:
# Define causal graph based on domain knowledge
= """
causal_graph digraph {
"economic_conditions" -> "market_sentiment";
"economic_conditions" -> "volatility";
"market_sentiment" -> "price_momentum";
"market_sentiment" -> "next_day_return";
"volatility" -> "next_day_return";
"price_momentum" -> "next_day_return";
}
"""
# Build causal model
= CausalModel(
causal_model =features_df[['volatility', 'price_momentum', 'market_sentiment',
data'economic_conditions', 'next_day_return']].dropna(),
='price_momentum',
treatment='next_day_return',
outcome=causal_graph
graph
)
# Identify causal effect
= causal_model.identify_effect()
identified_estimand
# Estimate causal effect
= causal_model.estimate_effect(
causal_estimate
identified_estimand,="backdoor.linear_regression"
method_name
)
print(f"Causal Effect of Price Momentum on Returns: {causal_estimate.value:.6f}")
# Compare with correlation
= features_df['price_momentum'].corr(features_df['next_day_return'])
correlation print(f"Traditional Correlation: {correlation:.6f}")
print(f"Difference (Causal - Correlation): {causal_estimate.value - correlation:.6f}")
# Refutation test
= causal_model.refute_estimate(
refutation
identified_estimand,
causal_estimate, ="random_common_cause"
method_name
)print(f"Refutation test result: {refutation.new_effect:.6f} (should be close to original)")
except Exception as e:
print(f"Causal analysis encountered challenges: {e}")
print("This is common with financial data - causal inference requires careful setup.")
else:
print("Causal analysis not available or insufficient data.")
print("Conceptually: We would ask whether price momentum actually *causes* returns")
print("or whether both are driven by common factors like market sentiment.")
# Step 4: Critical Interpretation
print("\\n=== CRITICAL INTERPRETATION ===")
print("Key Questions to Ask:")
print("1. Do our models generalize to new market conditions?")
print("2. Are we predicting returns or just fitting noise?")
print("3. What assumptions are we making about market efficiency?")
print("4. How would our conclusions change with different time periods?")
print("5. Are we confusing statistical association with economic causation?")
return {
'data': features_df,
'ml_results': ml_results,
'causal_available': CAUSAL_AVAILABLE
}
# Example usage
= comprehensive_analysis('AAPL', '2020-01-01', '2023-01-01') results
This comprehensive example demonstrates several crucial principles:
- Statistical Rigor: We carefully avoid look-ahead bias and use appropriate train-test splits
- Multiple Approaches: We compare different ML models and understand their trade-offs
- Causal Thinking: We ask not just “what predicts returns?” but “what causes returns?”
- Intellectual Humility: We acknowledge limitations and ask critical questions about our results
- Domain Knowledge: We incorporate financial concepts like momentum and volatility
The goal isn’t to find the “best” model, but to develop a deeper understanding of the relationships in our data and the assumptions underlying our analysis.
1.3 Best Practices for ML in Finance
Based on both traditional ML wisdom and insights from causal reasoning:
1.3.1 1. Start with Domain Knowledge
- Understand the financial phenomena you’re modeling
- Use economic theory to inform feature selection
- Be skeptical of purely data-driven discoveries
1.3.2 2. Validate Rigorously
- Use out-of-sample testing with temporal splits
- Test models across different market regimes
- Quantify uncertainty, not just point predictions
1.3.3 3. Think Causally
- Ask whether relationships will persist under intervention
- Consider confounding factors and selection biases
- Distinguish between prediction and explanation goals
1.3.4 4. Maintain Intellectual Humility
- Acknowledge model limitations explicitly
- Test robustness to assumptions
- Update beliefs when evidence contradicts expectations
By integrating these approaches, we develop more robust and insightful financial analysis capabilities.
1.4 Key Topics in Financial Machine Learning
Feature Selection: Identify essential features for building robust and parsimonious models. Filter, wrapper, and embedded feature selection techniques are typically used.
Regularization: Reduce overfitting by shrinking coefficients toward zero. Ridge, Lasso, and Elastic Net regressions are common types of regularization techniques.
Cross-Validation: Estimate performance measures for supervised learning models by splitting the data into training and validation sets repeatedly. K-fold cross-validation is one of the most popular methods.
Machine Learning Models:
Regression: Predict a continuous target variable. Linear regression, polynomial regression, splines, Random Forests, Gradient Boosting Machines, Support Vector Machines, Neural Networks, etc., are common techniques.
Classification: Assign discrete categories to data points. Logistic regression, Decision Trees, Naïve Bayes, Random Forests, Gradient Boosting Machines, Support Vector Machines, Neural Networks, etc., are widely used techniques.
Clustering: Group similar observations into clusters. K-Means, DBSCAN, Hierarchical Clustering, Gaussian Mixture Models, etc., are typical techniques.
1.4.1 Real-World Applications of ML in Finance
- Portfolio Optimization: Construct optimal portfolios using machine learning algorithms to maximize returns and minimize risk.
- Algorithmic Trading: Automate trading strategies based on market indicators, sentiment analysis, news feeds, and technical analysis.
- Fraud Detection: Detect anomalous transactions and prevent money laundering activities using unsupervised learning techniques.
- Credit Scoring: Evaluate creditworthiness and default risk for loan applicants using supervised learning algorithms.
- Risk Management: Quantify and manage market, liquidity, and operational risks using advanced machine learning techniques.
# Essential imports for ML in finance
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import yfinance as yf
from datetime import datetime, timedelta
import warnings
'ignore')
warnings.filterwarnings(
# Machine Learning libraries
from sklearn.model_selection import train_test_split, TimeSeriesSplit, GridSearchCV
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.linear_model import LinearRegression, LogisticRegression, Ridge, Lasso
from sklearn.svm import SVR, SVC
from sklearn.metrics import mean_squared_error, accuracy_score, classification_report
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score
import xgboost as xgb
import lightgbm as lgb
# Deep learning
try:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, Dropout
print("TensorFlow available for deep learning")
except ImportError:
print("TensorFlow not available - install for deep learning capabilities")
# Set random seeds for reproducibility
42)
np.random.seed(42) if 'tf' in globals() else None
tf.random.set_seed(
print("Machine Learning environment configured!")
1.5 Supervised Learning in Finance
Supervised learning algorithms learn from labeled training data to make predictions on new, unseen data. In finance, this includes predicting stock prices, credit defaults, or market directions.
1.5.1 1. Stock Price Prediction
def prepare_stock_data_for_ml(ticker='AAPL', period='2y', prediction_days=5):
"""
Prepare stock data for machine learning prediction
"""
# Fetch stock data
= yf.download(ticker, period=period)
data
# Calculate technical indicators
'SMA_5'] = data['Close'].rolling(window=5).mean()
data['SMA_20'] = data['Close'].rolling(window=20).mean()
data['SMA_50'] = data['Close'].rolling(window=50).mean()
data[
# Price-based features
'Price_Change'] = data['Close'].pct_change()
data['High_Low_Pct'] = (data['High'] - data['Low']) / data['Close']
data['Price_Volume'] = data['Close'] * data['Volume']
data[
# Volatility features
'Volatility'] = data['Price_Change'].rolling(window=20).std()
data[
# RSI
= data['Close'].diff()
delta = (delta.where(delta > 0, 0)).rolling(window=14).mean()
gain = (-delta.where(delta < 0, 0)).rolling(window=14).mean()
loss = gain / loss
rs 'RSI'] = 100 - (100 / (1 + rs))
data[
# MACD
= data['Close'].ewm(span=12).mean()
exp1 = data['Close'].ewm(span=26).mean()
exp2 'MACD'] = exp1 - exp2
data['MACD_Signal'] = data['MACD'].ewm(span=9).mean()
data[
# Target variable - future price movement
'Target'] = data['Close'].shift(-prediction_days)
data['Target_Direction'] = (data['Target'] > data['Close']).astype(int)
data[
# Select features
= [
feature_columns 'Open', 'High', 'Low', 'Volume', 'SMA_5', 'SMA_20', 'SMA_50',
'Price_Change', 'High_Low_Pct', 'Price_Volume', 'Volatility',
'RSI', 'MACD', 'MACD_Signal'
]
# Clean data
= data.dropna()
data
= data[feature_columns]
X = data['Target']
y_regression = data['Target_Direction']
y_classification
return X, y_regression, y_classification, data
# Prepare data
= prepare_stock_data_for_ml('AAPL', '3y', 5)
X, y_reg, y_class, stock_data print(f"Features shape: {X.shape}")
print(f"Target samples: {len(y_reg)}")
print(f"Feature columns: {list(X.columns)}")
1.5.2 2. Regression Models for Price Prediction
def compare_regression_models(X, y, test_size=0.2):
"""
Compare different regression models for stock price prediction
"""
# Time series split (important for financial data)
= TimeSeriesSplit(n_splits=5)
tscv
# Split data
= int(len(X) * (1 - test_size))
split_idx = X.iloc[:split_idx], X.iloc[split_idx:]
X_train, X_test = y.iloc[:split_idx], y.iloc[split_idx:]
y_train, y_test
# Scale features
= StandardScaler()
scaler = scaler.fit_transform(X_train)
X_train_scaled = scaler.transform(X_test)
X_test_scaled
# Define models
= {
models 'Linear Regression': LinearRegression(),
'Ridge Regression': Ridge(alpha=1.0),
'Lasso Regression': Lasso(alpha=0.1),
'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42),
'XGBoost': xgb.XGBRegressor(n_estimators=100, random_state=42),
'Support Vector Regression': SVR(kernel='rbf', C=100, gamma=0.1)
}
= {}
results
print("Regression Model Comparison:")
print("=" * 50)
for name, model in models.items():
# Use scaled data for linear models, original for tree-based
if name in ['Linear Regression', 'Ridge Regression', 'Lasso Regression', 'Support Vector Regression']:
= X_train_scaled
X_train_model = X_test_scaled
X_test_model else:
= X_train
X_train_model = X_test
X_test_model
# Fit model
model.fit(X_train_model, y_train)
# Predictions
= model.predict(X_train_model)
train_pred = model.predict(X_test_model)
test_pred
# Calculate metrics
= np.sqrt(mean_squared_error(y_train, train_pred))
train_rmse = np.sqrt(mean_squared_error(y_test, test_pred))
test_rmse
# R-squared
= model.score(X_train_model, y_train)
train_r2 = model.score(X_test_model, y_test)
test_r2
= {
results[name] 'Train RMSE': train_rmse,
'Test RMSE': test_rmse,
'Train R²': train_r2,
'Test R²': test_r2,
'Model': model,
'Predictions': test_pred
}
print(f"{name}:")
print(f" Train RMSE: {train_rmse:.4f}")
print(f" Test RMSE: {test_rmse:.4f}")
print(f" Test R²: {test_r2:.4f}")
print()
# Visualization
= plt.subplots(2, 2, figsize=(15, 12))
fig, axes
# Model performance comparison
= list(results.keys())
model_names = [results[name]['Test RMSE'] for name in model_names]
test_rmse_values = [results[name]['Test R²'] for name in model_names]
test_r2_values
0,0].bar(range(len(model_names)), test_rmse_values, alpha=0.7)
axes[0,0].set_title('Test RMSE Comparison')
axes[0,0].set_ylabel('RMSE')
axes[0,0].set_xticks(range(len(model_names)))
axes[0,0].set_xticklabels(model_names, rotation=45, ha='right')
axes[0,0].grid(True, alpha=0.3)
axes[
0,1].bar(range(len(model_names)), test_r2_values, alpha=0.7, color='orange')
axes[0,1].set_title('Test R² Comparison')
axes[0,1].set_ylabel('R²')
axes[0,1].set_xticks(range(len(model_names)))
axes[0,1].set_xticklabels(model_names, rotation=45, ha='right')
axes[0,1].grid(True, alpha=0.3)
axes[
# Best model predictions vs actual
= min(results.keys(), key=lambda x: results[x]['Test RMSE'])
best_model_name = results[best_model_name]['Predictions']
best_predictions
1,0].scatter(y_test, best_predictions, alpha=0.6)
axes[1,0].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
axes[1,0].set_title(f'Predictions vs Actual ({best_model_name})')
axes[1,0].set_xlabel('Actual Price')
axes[1,0].set_ylabel('Predicted Price')
axes[1,0].grid(True, alpha=0.3)
axes[
# Residuals plot
= y_test - best_predictions
residuals 1,1].scatter(best_predictions, residuals, alpha=0.6)
axes[1,1].axhline(y=0, color='r', linestyle='--')
axes[1,1].set_title(f'Residuals Plot ({best_model_name})')
axes[1,1].set_xlabel('Predicted Price')
axes[1,1].set_ylabel('Residuals')
axes[1,1].grid(True, alpha=0.3)
axes[
plt.tight_layout()
plt.show()
return results, scaler, X_test, y_test
# Compare regression models
= compare_regression_models(X, y_reg) reg_results, scaler, X_test_reg, y_test_reg
1.5.3 3. Classification Models for Direction Prediction
def compare_classification_models(X, y, test_size=0.2):
"""
Compare different classification models for predicting price direction
"""
# Split data (time series aware)
= int(len(X) * (1 - test_size))
split_idx = X.iloc[:split_idx], X.iloc[split_idx:]
X_train, X_test = y.iloc[:split_idx], y.iloc[split_idx:]
y_train, y_test
# Scale features
= StandardScaler()
scaler = scaler.fit_transform(X_train)
X_train_scaled = scaler.transform(X_test)
X_test_scaled
# Define models
= {
models 'Logistic Regression': LogisticRegression(random_state=42),
'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
'XGBoost': xgb.XGBClassifier(n_estimators=100, random_state=42),
'Support Vector Classifier': SVC(kernel='rbf', probability=True, random_state=42)
}
= {}
results
print("Classification Model Comparison:")
print("=" * 50)
for name, model in models.items():
# Use scaled data for linear models, original for tree-based
if name in ['Logistic Regression', 'Support Vector Classifier']:
= X_train_scaled
X_train_model = X_test_scaled
X_test_model else:
= X_train
X_train_model = X_test
X_test_model
# Fit model
model.fit(X_train_model, y_train)
# Predictions
= model.predict(X_train_model)
train_pred = model.predict(X_test_model)
test_pred = model.predict_proba(X_test_model)[:, 1]
test_pred_proba
# Calculate metrics
= accuracy_score(y_train, train_pred)
train_acc = accuracy_score(y_test, test_pred)
test_acc = precision_score(y_test, test_pred)
precision = recall_score(y_test, test_pred)
recall = f1_score(y_test, test_pred)
f1 = roc_auc_score(y_test, test_pred_proba)
auc
= {
results[name] 'Train Accuracy': train_acc,
'Test Accuracy': test_acc,
'Precision': precision,
'Recall': recall,
'F1-Score': f1,
'AUC': auc,
'Model': model,
'Predictions': test_pred,
'Probabilities': test_pred_proba
}
print(f"{name}:")
print(f" Test Accuracy: {test_acc:.4f}")
print(f" Precision: {precision:.4f}")
print(f" Recall: {recall:.4f}")
print(f" F1-Score: {f1:.4f}")
print(f" AUC: {auc:.4f}")
print()
# Visualization
= plt.subplots(2, 2, figsize=(15, 12))
fig, axes
# Accuracy comparison
= list(results.keys())
model_names = [results[name]['Test Accuracy'] for name in model_names]
accuracies = [results[name]['F1-Score'] for name in model_names]
f1_scores
0,0].bar(range(len(model_names)), accuracies, alpha=0.7)
axes[0,0].set_title('Test Accuracy Comparison')
axes[0,0].set_ylabel('Accuracy')
axes[0,0].set_xticks(range(len(model_names)))
axes[0,0].set_xticklabels(model_names, rotation=45, ha='right')
axes[0,0].grid(True, alpha=0.3)
axes[
0,1].bar(range(len(model_names)), f1_scores, alpha=0.7, color='orange')
axes[0,1].set_title('F1-Score Comparison')
axes[0,1].set_ylabel('F1-Score')
axes[0,1].set_xticks(range(len(model_names)))
axes[0,1].set_xticklabels(model_names, rotation=45, ha='right')
axes[0,1].grid(True, alpha=0.3)
axes[
# ROC curves
from sklearn.metrics import roc_curve
for name in model_names:
= roc_curve(y_test, results[name]['Probabilities'])
fpr, tpr, _ 1,0].plot(fpr, tpr, label=f"{name} (AUC = {results[name]['AUC']:.3f})")
axes[
1,0].plot([0, 1], [0, 1], 'k--', alpha=0.5)
axes[1,0].set_title('ROC Curves')
axes[1,0].set_xlabel('False Positive Rate')
axes[1,0].set_ylabel('True Positive Rate')
axes[1,0].legend()
axes[1,0].grid(True, alpha=0.3)
axes[
# Feature importance (best model)
= max(results.keys(), key=lambda x: results[x]['AUC'])
best_model_name = results[best_model_name]['Model']
best_model
if hasattr(best_model, 'feature_importances_'):
= pd.DataFrame({
feature_importance 'feature': X.columns,
'importance': best_model.feature_importances_
'importance', ascending=True)
}).sort_values(
1,1].barh(range(len(feature_importance)), feature_importance['importance'])
axes[1,1].set_title(f'Feature Importance ({best_model_name})')
axes[1,1].set_xlabel('Importance')
axes[1,1].set_yticks(range(len(feature_importance)))
axes[1,1].set_yticklabels(feature_importance['feature'])
axes[1,1].grid(True, alpha=0.3)
axes[else:
1,1].text(0.5, 0.5, 'Feature importance\nnot available\nfor this model',
axes[='center', va='center', transform=axes[1,1].transAxes)
ha
plt.tight_layout()
plt.show()
return results
# Compare classification models
= compare_classification_models(X, y_class) class_results
1.6 Deep Learning for Finance
Deep learning models can capture complex non-linear patterns in financial data that traditional models might miss.
1.6.1 LSTM for Time Series Prediction
def create_lstm_model(X, y, sequence_length=60, test_size=0.2):
"""
Create and train LSTM model for financial time series prediction
"""
if 'tf' not in globals():
print("TensorFlow not available. Please install tensorflow for deep learning.")
return None
# Prepare data for LSTM
= MinMaxScaler()
scaler = scaler.fit_transform(X)
X_scaled
# Create sequences
def create_sequences(data, target, seq_length):
= [], []
X_seq, y_seq for i in range(seq_length, len(data)):
-seq_length:i])
X_seq.append(data[i
y_seq.append(target.iloc[i])return np.array(X_seq), np.array(y_seq)
= create_sequences(X_scaled, y, sequence_length)
X_sequences, y_sequences
# Split data
= int(len(X_sequences) * (1 - test_size))
split_idx = X_sequences[:split_idx]
X_train = X_sequences[split_idx:]
X_test = y_sequences[:split_idx]
y_train = y_sequences[split_idx:]
y_test
print(f"LSTM Data shapes:")
print(f"X_train: {X_train.shape}, y_train: {y_train.shape}")
print(f"X_test: {X_test.shape}, y_test: {y_test.shape}")
# Build LSTM model
= Sequential([
model 50, return_sequences=True, input_shape=(sequence_length, X.shape[1])),
LSTM(0.2),
Dropout(50, return_sequences=False),
LSTM(0.2),
Dropout(25),
Dense(1)
Dense(
])
compile(optimizer='adam', loss='mse', metrics=['mae'])
model.
print("LSTM Model Architecture:")
model.summary()
# Train model
= model.fit(
history
X_train, y_train,=50,
epochs=32,
batch_size=(X_test, y_test),
validation_data=0
verbose
)
# Make predictions
= model.predict(X_train)
train_pred = model.predict(X_test)
test_pred
# Calculate metrics
= np.sqrt(mean_squared_error(y_train, train_pred))
train_rmse = np.sqrt(mean_squared_error(y_test, test_pred))
test_rmse
print(f"\nLSTM Performance:")
print(f"Train RMSE: {train_rmse:.4f}")
print(f"Test RMSE: {test_rmse:.4f}")
# Visualization
= plt.subplots(2, 2, figsize=(15, 12))
fig, axes
# Training history
0,0].plot(history.history['loss'], label='Training Loss')
axes[0,0].plot(history.history['val_loss'], label='Validation Loss')
axes[0,0].set_title('Model Training History')
axes[0,0].set_xlabel('Epoch')
axes[0,0].set_ylabel('Loss')
axes[0,0].legend()
axes[0,0].grid(True, alpha=0.3)
axes[
# Predictions vs actual
0,1].scatter(y_test, test_pred, alpha=0.6)
axes[0,1].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
axes[0,1].set_title('LSTM: Predictions vs Actual')
axes[0,1].set_xlabel('Actual Price')
axes[0,1].set_ylabel('Predicted Price')
axes[0,1].grid(True, alpha=0.3)
axes[
# Time series of predictions
= X.index[split_idx + sequence_length:]
test_dates 1,0].plot(test_dates, y_test, label='Actual', alpha=0.7)
axes[1,0].plot(test_dates, test_pred.flatten(), label='Predicted', alpha=0.7)
axes[1,0].set_title('LSTM: Time Series Predictions')
axes[1,0].set_xlabel('Date')
axes[1,0].set_ylabel('Price')
axes[1,0].legend()
axes[1,0].grid(True, alpha=0.3)
axes[
# Residuals
= y_test - test_pred.flatten()
residuals 1,1].plot(test_dates, residuals, alpha=0.7)
axes[1,1].axhline(y=0, color='r', linestyle='--')
axes[1,1].set_title('LSTM: Prediction Residuals')
axes[1,1].set_xlabel('Date')
axes[1,1].set_ylabel('Residual')
axes[1,1].grid(True, alpha=0.3)
axes[
plt.tight_layout()
plt.show()
return model, scaler, history
# Create LSTM model
= create_lstm_model(X, y_reg) lstm_model, lstm_scaler, lstm_history
1.7 Unsupervised Learning in Finance
Unsupervised learning techniques help discover hidden patterns in financial data without labeled targets.
1.7.1 1. Portfolio Clustering
def portfolio_clustering_analysis():
"""
Perform clustering analysis on a portfolio of stocks
"""
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
# Fetch data for multiple stocks
= ['AAPL', 'GOOGL', 'MSFT', 'TSLA', 'NVDA', 'AMZN', 'META', 'NFLX', 'JPM', 'GS']
tickers
print("Fetching portfolio data for clustering analysis...")
= yf.download(tickers, start='2020-01-01', end='2024-01-01')['Adj Close']
portfolio_data
# Calculate returns
= portfolio_data.pct_change().dropna()
returns
# Calculate features for clustering
= pd.DataFrame(index=tickers)
features 'Mean_Return'] = returns.mean() * 252 # Annualized
features['Volatility'] = returns.std() * np.sqrt(252) # Annualized
features['Sharpe_Ratio'] = features['Mean_Return'] / features['Volatility']
features['Skewness'] = returns.skew()
features['Kurtosis'] = returns.kurtosis()
features['Max_Drawdown'] = returns.apply(lambda x: ((1 + x).cumprod() / (1 + x).cumprod().expanding().max() - 1).min())
features[
print("Portfolio Features for Clustering:")
print(features.round(4))
# Standardize features
from sklearn.preprocessing import StandardScaler
= StandardScaler()
scaler = scaler.fit_transform(features)
features_scaled
# Determine optimal number of clusters
= []
inertias = range(2, 8)
K_range
for k in K_range:
= KMeans(n_clusters=k, random_state=42)
kmeans
kmeans.fit(features_scaled)
inertias.append(kmeans.inertia_)
# Perform clustering with optimal k
= 3 # Based on elbow method
optimal_k = KMeans(n_clusters=optimal_k, random_state=42)
kmeans = kmeans.fit_predict(features_scaled)
cluster_labels
# Add cluster labels to features
'Cluster'] = cluster_labels
features[
# PCA for visualization
= PCA(n_components=2)
pca = pca.fit_transform(features_scaled)
features_pca
# Visualization
= plt.subplots(2, 2, figsize=(15, 12))
fig, axes
# Elbow method
0,0].plot(K_range, inertias, 'bo-')
axes[0,0].set_title('Elbow Method for Optimal k')
axes[0,0].set_xlabel('Number of Clusters')
axes[0,0].set_ylabel('Inertia')
axes[0,0].grid(True, alpha=0.3)
axes[
# PCA visualization
= ['red', 'blue', 'green', 'purple', 'orange']
colors for i in range(optimal_k):
= cluster_labels == i
mask 0,1].scatter(features_pca[mask, 0], features_pca[mask, 1],
axes[=colors[i], label=f'Cluster {i}', alpha=0.7, s=100)
c
# Add stock labels
for i, ticker in enumerate(tickers):
0,1].annotate(ticker, (features_pca[i, 0], features_pca[i, 1]),
axes[=(5, 5), textcoords='offset points', fontsize=8)
xytext
0,1].set_title('Stock Clustering (PCA Visualization)')
axes[0,1].set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.2%} variance)')
axes[0,1].set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.2%} variance)')
axes[0,1].legend()
axes[0,1].grid(True, alpha=0.3)
axes[
# Risk-Return scatter
for i in range(optimal_k):
= features['Cluster'] == i
mask = features[mask]
cluster_data 1,0].scatter(cluster_data['Volatility'], cluster_data['Mean_Return'],
axes[=colors[i], label=f'Cluster {i}', alpha=0.7, s=100)
c
# Add stock labels
for ticker in tickers:
= features.loc[ticker]
row 1,0].annotate(ticker, (row['Volatility'], row['Mean_Return']),
axes[=(5, 5), textcoords='offset points', fontsize=8)
xytext
1,0].set_title('Risk-Return Clustering')
axes[1,0].set_xlabel('Volatility (Risk)')
axes[1,0].set_ylabel('Mean Return')
axes[1,0].legend()
axes[1,0].grid(True, alpha=0.3)
axes[
# Cluster characteristics
= features.groupby('Cluster').mean()
cluster_summary ='bar', ax=axes[1,1], alpha=0.7)
cluster_summary.plot(kind1,1].set_title('Cluster Characteristics')
axes[1,1].set_xlabel('Cluster')
axes[1,1].set_ylabel('Mean Value')
axes[1,1].legend(bbox_to_anchor=(1.05, 1), loc='upper left')
axes[1,1].grid(True, alpha=0.3)
axes[
plt.tight_layout()
plt.show()
# Print cluster analysis
print("\nCluster Analysis:")
print("=" * 50)
for i in range(optimal_k):
= features[features['Cluster'] == i].index.tolist()
cluster_stocks print(f"Cluster {i}: {', '.join(cluster_stocks)}")
print(f" Characteristics: {cluster_summary.loc[i].round(4).to_dict()}")
print()
return features, cluster_labels, kmeans
# Perform clustering analysis
= portfolio_clustering_analysis() cluster_features, cluster_labels, kmeans_model
1.8 Reinforcement Learning for Trading
Reinforcement Learning (RL) is particularly well-suited for financial decision making as it can learn optimal trading strategies through trial and error.
1.8.1 Simple Q-Learning Trading Agent
def simple_trading_rl_example():
"""
Implement a simple Q-learning trading agent
"""
# Generate synthetic price data for demonstration
42)
np.random.seed(= 1000
n_days = pd.DataFrame({
price_data 'price': 100 * np.exp(np.cumsum(np.random.normal(0.0005, 0.02, n_days)))
})
# Calculate returns and features
'return'] = price_data['price'].pct_change()
price_data['sma_5'] = price_data['price'].rolling(5).mean()
price_data['sma_20'] = price_data['price'].rolling(20).mean()
price_data['signal'] = np.where(price_data['sma_5'] > price_data['sma_20'], 1, -1)
price_data[
# Clean data
= price_data.dropna()
price_data
print("Simple Q-Learning Trading Agent")
print("=" * 40)
print(f"Data shape: {price_data.shape}")
# Define states based on recent returns
def get_state(returns, window=5):
"""Convert recent returns to discrete state"""
= returns[-window:]
recent_returns = np.mean(recent_returns)
avg_return
if avg_return > 0.01:
return 2 # Strong uptrend
elif avg_return > 0:
return 1 # Weak uptrend
elif avg_return > -0.01:
return 0 # Sideways
else:
return -1 # Downtrend
# Q-learning parameters
= 4 # -1, 0, 1, 2
n_states = 3 # 0: hold, 1: buy, 2: sell
n_actions = 0.1
learning_rate = 0.95
discount_factor = 0.1 # exploration rate
epsilon
# Initialize Q-table
= np.zeros((n_states + 2, n_actions)) # +2 to handle negative indices
q_table
# Trading simulation
= 0 # -1: short, 0: neutral, 1: long
position = 10000
portfolio_value = portfolio_value
cash = []
trades = [portfolio_value]
portfolio_values
for i in range(20, len(price_data) - 1):
# Get current state
= price_data['return'].iloc[i-19:i+1].values
returns_window = get_state(returns_window) + 1 # Shift to make indices positive
state
# Choose action (epsilon-greedy)
if np.random.random() < epsilon:
= np.random.randint(n_actions) # Explore
action else:
= np.argmax(q_table[state]) # Exploit
action
# Execute action
= price_data['price'].iloc[i]
current_price = price_data['price'].iloc[i + 1]
next_price
if action == 1 and position <= 0: # Buy
if position == -1: # Cover short
+= (current_price - next_price) * 100 # Profit from short
cash = 1
position = cash // current_price
shares_bought -= shares_bought * current_price
cash 'BUY', current_price, shares_bought))
trades.append((
elif action == 2 and position >= 0: # Sell
if position == 1: # Close long
+= shares_bought * current_price
cash = -1
position 'SELL', current_price, 100))
trades.append((
# Calculate reward
if position == 1: # Long position
= (next_price - current_price) / current_price
reward elif position == -1: # Short position
= (current_price - next_price) / current_price
reward else: # No position
= 0
reward
# Update Q-table
= price_data['return'].iloc[i-18:i+2].values
next_returns = get_state(next_returns) + 1
next_state
+= learning_rate * (
q_table[state, action] + discount_factor * np.max(q_table[next_state]) - q_table[state, action]
reward
)
# Update portfolio value
if position == 1 and 'shares_bought' in locals():
= cash + shares_bought * current_price
portfolio_value else:
= cash
portfolio_value
portfolio_values.append(portfolio_value)
# Calculate performance
= (portfolio_values[-1] - portfolio_values[0]) / portfolio_values[0]
total_return = (price_data['price'].iloc[-1] - price_data['price'].iloc[20]) / price_data['price'].iloc[20]
buy_hold_return
print(f"RL Trading Results:")
print(f"Total Return: {total_return:.4f}")
print(f"Buy & Hold Return: {buy_hold_return:.4f}")
print(f"Number of trades: {len(trades)}")
# Visualization
= plt.subplots(2, 2, figsize=(15, 12))
fig, axes
# Price and portfolio value
= range(len(portfolio_values))
dates = range(20, 20 + len(portfolio_values))
price_dates
0,0].plot(price_dates, price_data['price'].iloc[20:20+len(portfolio_values)],
axes[='Stock Price', alpha=0.7)
label= axes[0,0].twinx()
ax_twin 'r-', label='Portfolio Value', alpha=0.7)
ax_twin.plot(dates, portfolio_values, 0,0].set_title('Stock Price vs Portfolio Performance')
axes[0,0].set_xlabel('Time')
axes[0,0].set_ylabel('Stock Price', color='blue')
axes['Portfolio Value', color='red')
ax_twin.set_ylabel(0,0].grid(True, alpha=0.3)
axes[
# Cumulative returns comparison
= np.array(portfolio_values) / portfolio_values[0]
rl_returns = price_data['price'].iloc[20:20+len(portfolio_values)] / price_data['price'].iloc[20]
bh_returns
0,1].plot(dates, rl_returns, label='RL Strategy', linewidth=2)
axes[0,1].plot(dates, bh_returns, label='Buy & Hold', linewidth=2)
axes[0,1].set_title('Cumulative Returns Comparison')
axes[0,1].set_xlabel('Time')
axes[0,1].set_ylabel('Cumulative Return')
axes[0,1].legend()
axes[0,1].grid(True, alpha=0.3)
axes[
# Q-table heatmap
= axes[1,0].imshow(q_table, cmap='coolwarm', aspect='auto')
im 1,0].set_title('Q-Table Heatmap')
axes[1,0].set_xlabel('Actions (0:Hold, 1:Buy, 2:Sell)')
axes[1,0].set_ylabel('States')
axes[=axes[1,0])
plt.colorbar(im, ax
# Trade distribution
if trades:
= [trade[0] for trade in trades]
trade_types = pd.Series(trade_types).value_counts()
trade_counts 1,1].bar(trade_counts.index, trade_counts.values, alpha=0.7)
axes[1,1].set_title('Trade Distribution')
axes[1,1].set_ylabel('Number of Trades')
axes[1,1].grid(True, alpha=0.3)
axes[else:
1,1].text(0.5, 0.5, 'No trades executed', ha='center', va='center',
axes[=axes[1,1].transAxes)
transform
plt.tight_layout()
plt.show()
return q_table, portfolio_values, trades
# Run RL trading example
= simple_trading_rl_example() q_table, portfolio_values, trades
1.9 Model Evaluation and Validation
Proper model evaluation is crucial in financial ML to avoid overfitting and ensure robust performance.
1.9.1 Cross-Validation for Financial Data
def financial_cross_validation(X, y, model, n_splits=5):
"""
Perform time series cross-validation for financial models
"""
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import mean_squared_error, r2_score
= TimeSeriesSplit(n_splits=n_splits)
tscv
= []
cv_scores = []
cv_r2_scores
print("Time Series Cross-Validation Results:")
print("=" * 50)
for fold, (train_idx, val_idx) in enumerate(tscv.split(X)):
# Split data
= X.iloc[train_idx]
X_train_fold = X.iloc[val_idx]
X_val_fold = y.iloc[train_idx]
y_train_fold = y.iloc[val_idx]
y_val_fold
# Fit model
model.fit(X_train_fold, y_train_fold)
# Predict
= model.predict(X_val_fold)
y_pred_fold
# Calculate metrics
= np.sqrt(mean_squared_error(y_val_fold, y_pred_fold))
rmse = r2_score(y_val_fold, y_pred_fold)
r2
cv_scores.append(rmse)
cv_r2_scores.append(r2)
print(f"Fold {fold + 1}: RMSE = {rmse:.4f}, R² = {r2:.4f}")
print(f"\nCross-Validation Summary:")
print(f"Mean RMSE: {np.mean(cv_scores):.4f} ± {np.std(cv_scores):.4f}")
print(f"Mean R²: {np.mean(cv_r2_scores):.4f} ± {np.std(cv_r2_scores):.4f}")
return cv_scores, cv_r2_scores
# Example: Cross-validate Random Forest model
= RandomForestRegressor(n_estimators=100, random_state=42)
rf_model = financial_cross_validation(X, y_reg, rf_model) cv_rmse, cv_r2
1.10 Feature Engineering for Financial ML
def advanced_feature_engineering(data, ticker='AAPL'):
"""
Create advanced features for financial machine learning
"""
= data.copy()
df
# Price-based features
'Price_MA_Ratio_5'] = df['Close'] / df['Close'].rolling(5).mean()
df['Price_MA_Ratio_20'] = df['Close'] / df['Close'].rolling(20).mean()
df['Price_MA_Ratio_50'] = df['Close'] / df['Close'].rolling(50).mean()
df[
# Volatility features
'Volatility_5'] = df['Close'].pct_change().rolling(5).std()
df['Volatility_20'] = df['Close'].pct_change().rolling(20).std()
df['Volatility_Ratio'] = df['Volatility_5'] / df['Volatility_20']
df[
# Volume features
'Volume_MA'] = df['Volume'].rolling(20).mean()
df['Volume_Ratio'] = df['Volume'] / df['Volume_MA']
df['Price_Volume_Trend'] = (df['Close'] - df['Close'].shift(1)) * df['Volume']
df[
# Momentum features
'Momentum_5'] = df['Close'] / df['Close'].shift(5) - 1
df['Momentum_10'] = df['Close'] / df['Close'].shift(10) - 1
df['Momentum_20'] = df['Close'] / df['Close'].shift(20) - 1
df[
# Bollinger Bands
= 20
bb_period = 2
bb_std 'BB_Middle'] = df['Close'].rolling(bb_period).mean()
df[= df['Close'].rolling(bb_period).std()
bb_std_dev 'BB_Upper'] = df['BB_Middle'] + (bb_std_dev * bb_std)
df['BB_Lower'] = df['BB_Middle'] - (bb_std_dev * bb_std)
df['BB_Width'] = (df['BB_Upper'] - df['BB_Lower']) / df['BB_Middle']
df['BB_Position'] = (df['Close'] - df['BB_Lower']) / (df['BB_Upper'] - df['BB_Lower'])
df[
# Support and Resistance levels
'High_20'] = df['High'].rolling(20).max()
df['Low_20'] = df['Low'].rolling(20).min()
df['Resistance_Distance'] = (df['High_20'] - df['Close']) / df['Close']
df['Support_Distance'] = (df['Close'] - df['Low_20']) / df['Close']
df[
# Gap features
'Gap'] = (df['Open'] - df['Close'].shift(1)) / df['Close'].shift(1)
df['Gap_Filled'] = np.where(
df['Gap'] > 0) & (df['Low'] <= df['Close'].shift(1)), 1,
(df['Gap'] < 0) & (df['High'] >= df['Close'].shift(1)), 1, 0)
np.where((df[
)
# Seasonal features
'Day_of_Week'] = df.index.dayofweek
df['Month'] = df.index.month
df['Quarter'] = df.index.quarter
df[
# Lagged features
for lag in [1, 2, 3, 5, 10]:
f'Return_Lag_{lag}'] = df['Close'].pct_change().shift(lag)
df[f'Volume_Lag_{lag}'] = df['Volume'].shift(lag)
df[
print(f"Advanced Feature Engineering Complete:")
print(f"Original features: {len(data.columns)}")
print(f"New features: {len(df.columns)}")
print(f"Added features: {len(df.columns) - len(data.columns)}")
return df
# Apply advanced feature engineering
if 'stock_data' in locals():
= advanced_feature_engineering(stock_data)
enhanced_data print("\nSample of new features:")
= [col for col in enhanced_data.columns if col not in stock_data.columns]
feature_cols print(enhanced_data[feature_cols].head())
1.11 Practical Exercises
1.11.1 Exercise 1: Complete ML Pipeline
def ml_pipeline_exercise():
"""
Complete machine learning pipeline exercise for students
Tasks:
1. Data preparation and feature engineering
2. Model comparison
3. Hyperparameter tuning
4. Performance evaluation
5. Feature importance analysis
"""
print("Machine Learning Pipeline Exercise")
print("=" * 50)
# Step 1: Prepare data
= 'MSFT'
ticker = yf.download(ticker, start='2020-01-01', end='2024-01-01')
data
# Basic feature engineering
'Returns'] = data['Close'].pct_change()
data['SMA_10'] = data['Close'].rolling(10).mean()
data['SMA_30'] = data['Close'].rolling(30).mean()
data['RSI'] = calculate_rsi(data['Close'])
data['Volatility'] = data['Returns'].rolling(20).std()
data[
# Target: next day's return direction
'Target'] = (data['Close'].shift(-1) > data['Close']).astype(int)
data[
# Features
= ['Open', 'High', 'Low', 'Volume', 'SMA_10', 'SMA_30', 'RSI', 'Volatility']
feature_cols = data[feature_cols].dropna()
X = data['Target'].loc[X.index]
y
# Step 2: Train-test split
= '2023-01-01'
split_date = X.index < split_date
train_mask
= X[train_mask], X[~train_mask]
X_train, X_test = y[train_mask], y[~train_mask]
y_train, y_test
print(f"Training samples: {len(X_train)}")
print(f"Testing samples: {len(X_test)}")
# Step 3: Model comparison
= {
models 'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
'XGBoost': xgb.XGBClassifier(n_estimators=100, random_state=42),
'Logistic Regression': LogisticRegression(random_state=42)
}
= {}
results for name, model in models.items():
model.fit(X_train, y_train)= model.predict(X_test)
y_pred = accuracy_score(y_test, y_pred)
accuracy = accuracy
results[name] print(f"{name} Accuracy: {accuracy:.4f}")
# Step 4: Best model analysis
= max(results, key=results.get)
best_model_name = models[best_model_name]
best_model
print(f"\nBest Model: {best_model_name}")
# Feature importance
if hasattr(best_model, 'feature_importances_'):
= pd.DataFrame({
importance_df 'Feature': feature_cols,
'Importance': best_model.feature_importances_
'Importance', ascending=False)
}).sort_values(
print("\nFeature Importance:")
print(importance_df)
# Visualization
=(10, 6))
plt.figure(figsize'Feature'], importance_df['Importance'])
plt.barh(importance_df[f'Feature Importance - {best_model_name}')
plt.title('Importance')
plt.xlabel(
plt.tight_layout()
plt.show()
return X_train, X_test, y_train, y_test, best_model
def calculate_rsi(prices, period=14):
"""Calculate RSI indicator"""
= prices.diff()
delta = (delta.where(delta > 0, 0)).rolling(window=period).mean()
gain = (-delta.where(delta < 0, 0)).rolling(window=period).mean()
loss = gain / loss
rs = 100 - (100 / (1 + rs))
rsi return rsi
# Run the exercise
= ml_pipeline_exercise() X_train_ex, X_test_ex, y_train_ex, y_test_ex, best_model_ex
1.12 Summary and Best Practices
This chapter has covered comprehensive machine learning applications in finance:
1.12.1 Key Techniques Covered:
- Supervised Learning: Regression and classification for price prediction
- Deep Learning: LSTM networks for time series modeling
- Unsupervised Learning: Clustering for portfolio analysis
- Reinforcement Learning: Q-learning for trading strategies
1.12.2 Best Practices for Financial ML:
- Time Series Awareness: Use proper train/validation splits
- Feature Engineering: Create domain-specific financial features
- Model Validation: Implement robust cross-validation
- Overfitting Prevention: Use regularization and out-of-sample testing
- Risk Management: Consider transaction costs and market impact
- Interpretability: Understand model decisions for regulatory compliance
1.12.3 Python Libraries for Financial ML:
- scikit-learn: Traditional ML algorithms
- XGBoost/LightGBM: Gradient boosting models
- TensorFlow/Keras: Deep learning
- pandas/numpy: Data manipulation
- yfinance: Financial data acquisition
This foundation provides the essential skills for applying machine learning to real-world financial problems while maintaining awareness of the unique challenges in financial data science.