Python for Finance: From Correlation to Causation

Author

Professor Barry Quinn

Welcome to the intersection of Python programming, financial analysis, and causal reasoning. This chapter introduces you to a revolutionary approach to financial data science that goes beyond traditional correlation-based analysis to understand true cause-and-effect relationships in financial markets. As final year finance students, you’ll learn to combine the technical power of Python with the analytical rigor of causal inference - skills that are increasingly vital in today’s AI-driven financial industry.

1 Why This Combination Matters

The financial industry is experiencing a paradigm shift. Traditional approaches that rely solely on correlation and statistical associations are being challenged by more sophisticated methods that can distinguish between mere statistical relationships and true causal effects. This distinction is crucial for:

  • Investment Decision Making: Understanding what actually drives returns vs. what’s merely correlated
  • Risk Management: Identifying true risk factors rather than spurious correlations
  • Regulatory Compliance: Meeting increasing demands for explainable AI in finance
  • Competitive Advantage: Developing insights that go beyond what traditional methods can provide
The Correlation vs. Causation Challenge

Consider this example: Ice cream sales and drowning incidents are highly correlated. Does this mean ice cream causes drowning? Of course not - both are caused by hot weather and summer activities. In finance, similar spurious correlations abound, and distinguishing them from true causal relationships is essential for sound decision-making.

2 Learning Approach: Integration of Two Worlds

This course uniquely integrates materials from two cutting-edge textbooks:

2.1 Technical Foundation: “Python for Finance” by Yves Hilpisch

  • Master Python programming for financial applications
  • Learn industry-standard libraries (pandas, NumPy, scikit-learn)
  • Implement production-ready financial systems
  • Access real trading platforms and market data

2.2 Analytical Rigor: “Causal AI” by Robert Osazuwa Ness

  • Understand causal reasoning and inference
  • Learn to build and test causal models
  • Apply modern AI with causal awareness
  • Distinguish correlation from causation in financial contexts

3 Getting Started: Python Environment Setup

Before diving into financial analysis, let’s set up your Python environment with the libraries from both textbooks:

# Core Python libraries for finance (from Hilpisch)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import yfinance as yf

# Performance and advanced computing
import numba
from numba import jit

# Machine learning libraries
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Causal inference libraries (from Ness)
import dowhy
from dowhy import CausalModel
import pgmpy
from pgmpy.models import BayesianNetwork
from pgmpy.factors.discrete import TabularCPD

# Advanced causal libraries
# pip install git+https://github.com/y0-causal-inference/y0.git@v0.2.0
import warnings
warnings.filterwarnings('ignore')

print("Environment ready for Python Finance + Causal AI!")

4 Practical Example: Traditional vs. Causal Analysis

Let’s demonstrate the difference between traditional correlation-based analysis and causal reasoning using a financial example:

4.1 Traditional Approach: Correlation Analysis

# Traditional correlation analysis
# Download stock data
tickers = ['AAPL', 'MSFT', 'SPY', 'VIX']
data = yf.download(tickers, start='2020-01-01', end='2024-01-01')['Adj Close']

# Calculate returns
returns = data.pct_change().dropna()

# Traditional correlation matrix
correlation_matrix = returns.corr()
print("Traditional Correlation Matrix:")
print(correlation_matrix)

# Visualization
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0)
plt.title('Traditional Correlation Analysis')
plt.show()

4.2 Enhanced Approach: Causal Reasoning

# Causal approach: Building a causal model
# This goes beyond correlation to understand cause-effect relationships

# Example: Does VIX (volatility) cause stock price movements, 
# or do stock movements cause VIX changes?

# Step 1: Define causal graph (DAG)
causal_graph = """
digraph {
    "Market_Sentiment" -> "VIX";
    "Market_Sentiment" -> "Stock_Returns";
    "Economic_News" -> "Market_Sentiment";
    "VIX" -> "Stock_Returns";
}
"""

# Step 2: Create dataset for causal analysis
causal_data = pd.DataFrame({
    'VIX': returns['VIX'],
    'Stock_Returns': returns['AAPL'],
    'Market_Sentiment': np.random.normal(0, 1, len(returns)),  # Simulated
    'Economic_News': np.random.normal(0, 1, len(returns))      # Simulated
}).dropna()

# Step 3: Build causal model using DoWhy
model = CausalModel(
    data=causal_data,
    treatment='VIX',
    outcome='Stock_Returns',
    graph=causal_graph
)

# Step 4: Identify causal effect
identified_estimand = model.identify_effect()
print("Causal Identification:")
print(identified_estimand)

# Step 5: Estimate causal effect
causal_estimate = model.estimate_effect(identified_estimand,
                                       method_name="backdoor.linear_regression")
print(f"Causal Effect: {causal_estimate.value}")
print(f"Traditional Correlation: {causal_data['VIX'].corr(causal_data['Stock_Returns'])}")
Key Insight: Correlation ≠ Causation

The correlation coefficient tells us about statistical association, while the causal effect tells us about the actual impact of changing one variable on another. In finance, this distinction is crucial for:

  • Portfolio Construction: Understanding which factors actually drive returns
  • Risk Management: Identifying true risk sources vs. correlated indicators
  • Policy Analysis: Predicting the effect of interventions (e.g., interest rate changes)

5 Course Resources and GitHub Integration

Throughout this course, you’ll have access to professional-grade resources:

5.1 From “Python for Finance”

  • Quant Platform: py4fi.pqp.io - Free access to all notebooks
  • Real Trading APIs: FXCM integration for live market data
  • Performance Computing: Numba and Cython for high-speed calculations

5.2 From “Causal AI”

5.3 Installation Commands

# Install Python for Finance libraries
pip install pandas numpy matplotlib seaborn yfinance numba cython

# Install Causal AI libraries  
pip install dowhy pgmpy pyro-ppl
pip install git+https://github.com/y0-causal-inference/y0.git@v0.2.0

# Clone course resources
git clone https://github.com/altdeep/causalML.git

6 Statistical Modelling as an Iterative Process

Statisticians, like artists, have the bad habit of falling in love with their models.

George Box emphasized the importance of viewing statistical modeling as an iterative process, where models are continually improved, scrutinized, and reassessed against new data to reach increasingly reliable inferences and decisions. This chapter delves into the iterative nature of statistics, inspired by George Box’s visionary perspective, and its relevance to financial modeling and decision-making.

At the heart of Box’s philosophy lies the acknowledgment that any statistical model is an approximation of reality. Due to measurement errors, sampling biases, misspecifications, or mere random fluctuations, even seemingly adequate models can fail. Accepting this imperfection calls for humility and constant vigilance, pushing statisticians to question their models and strive for improvement.

Box envisioned statistical modeling as an ongoing cycle, composed of consecutive stages of speculation, exploration, verification, and modification. During each iteration, new findings inspire adjusted mental models, eventually translating into altered analyses.

Figure 1: Iterative Statistical Modeling: Induction, Deduction, and Model Refinement

Figure 1 illustrates an iterative process in statistical modeling, particularly in the context of financial analysis. Here’s how we can relate it to George Box’s ideas:

  1. Data Collection and Signal:
    • At the top right, we have a cloud labeled “True State of Financial World.” This represents the underlying reality we aim to understand.
    • The blue arrow labeled “Signal” connects this reality to a rectangle labeled “Data Signal + Noise.” The data we collect contains both useful information (signal) and irrelevant noise.
  2. Inductive Reasoning (Model Creation):
    • Observation and Pattern Recognition:
      • We engage in inductive reasoning by observing the data. We look for patterns, regularities, and relationships.
    • Preliminary Theory (Model M1):
      • Based on observed patterns, we formulate a preliminary theory or model (let’s call it M1).
      • M1 captures the relationships between variables, aiming to explain the observed data.
  3. Deductive Reasoning (Model Testing):
    • Temporary Pretense:
      • Assume that M1 is true (even though it may not be perfect).
    • Exact Estimation Calculations:
      • Apply M1 to analyze the data, make predictions, and estimate outcomes.
    • Selective Worry:
      • Be critical about the limitations of M1. Where does it fall short?
    • Consequence of M1:
      • Predictions made by M1 are compared with the actual outcomes (consequences).
      • Discrepancies between predictions and reality highlight areas for improvement.
  4. Model Refinement and Iteration:
    • If there are discrepancies:
      • Adjust or refine M1 based on empirical evidence.
      • Create an updated model, which we’ll call M2.
    • The arrow labeled “Analysis with M1 (M1*, M1, …?)” indicates multiple iterations or versions of M1** being analyzed.
    • The process continues iteratively, improving the model with each cycle.
  5. Flexibility and Parsimony:
    • Flexibility:
      • Rapid progress requires flexibility to adapt to new information and confrontations between theory and practice.
    • Parsimonious Models:
      • Effective models are both simple and powerful. Focus on what matters most.
Insights from Academic Sources:
  1. Bayesian Visualization and Workflow:
    • The article “Visualization in Bayesian Workflow” emphasizes that Bayesian data analysis involves more than just computing a posterior distribution.
    • Visualization plays a crucial role throughout the entire statistical workflow, including model building, inference, model checking, evaluation, and expansion.
    • Modern, high-dimensional models used by applied researchers benefit significantly from effective visualization tools.
  2. Andrew Gelman’s Perspective:
    • Andrew Gelman, a renowned statistician, emphasizes the importance of iterative modeling.
    • His work advocates for continuous refinement of models based on empirical evidence.
    • Gelman’s approach aligns with George Box’s idea that all models are approximations, but some are useful. We should embrace imperfection and keep iterating.

6.1 Implications for Financial Modeling and Decision-Making

Financial markets are inherently complex, dictated by intricate relationships and driven by manifold forces. Capturing this complexity requires an iterative approach, where models are consistently tested against emerging data and evolving circumstances.

Emphasizing the iterative aspect of financial modeling brings about several benefits:

  1. Improved responsiveness: Models can quickly adapt to changing market conditions
  2. Reduced hubris: Acknowledging model limitations prevents overconfidence
  3. More effective communication: Clear understanding of model assumptions and limitations

6.2 Practical Strategies for Implementing Iterative Approaches

Implementing an iterative strategy in financial modeling calls for conscious efforts to instill a culture of continuous improvement. The following practices can help embed iterative thinking into organizational norms:

  1. Cross-functional collaboration: Involve domain experts, data scientists, and business stakeholders
  2. Open feedback mechanisms: Create channels for model critique and improvement suggestions
  3. Periodic audits: Regular review of model performance and assumptions
  4. Version control: Track model changes and maintain historical versions
  5. Empowerment of junior staff: Encourage questioning and alternative approaches

George Box’s vision of statistics as an iterative process carries far-reaching ramifications for financial modeling and decision-making. By championing a perpetual pursuit of excellence, Box’s doctrine urges practitioners to abandon complacent acceptance of mediocre models in favor of persistent self-evaluation, reflection, and revision. Organizations embracing Box’s wisdom enjoy the spoils of sustained success, weathering adversity armed with the determination born of iterative resilience.

7 The Importance of Probability Theory in Statistics

Probability theory is the mathematical foundation of statistics, providing the framework for quantifying uncertainty and making inferences from data. In the context of financial analytics, probability theory is indispensable for several reasons.

First, probability theory enables the formulation of statistical models that can describe and predict complex financial phenomena. These models allow analysts to make sense of seemingly random market movements and identify underlying patterns that can inform investment decisions.

Second, probability theory provides the tools for hypothesis testing and statistical inference. In financial research, this means being able to test theories about market behavior, evaluate the significance of observed patterns, and make data-driven conclusions about investment strategies.

Furthermore, probability theory is vital in the assessment of risk and uncertainty. In fields such as finance, insurance, and economics, the ability to quantify risk using probabilistic models is crucial for making informed decisions. This includes evaluating the likelihood of financial losses, determining insurance premiums, and forecasting market trends under uncertainty.

In addition, probability theory lays the groundwork for advanced statistical techniques such as Bayesian inference, which incorporates prior knowledge into the statistical analysis, and stochastic modeling, used extensively in areas like financial modeling and risk assessment.

The role of probability in statistics is not just theoretical; it has practical implications in everyday data analysis. Whether it’s deciding the probability of a stock’s return over a certain threshold or assessing the risk of a new investment, probability theory is the tool that helps convert raw data into actionable insights.

As we delve deeper into this chapter, we will explore the fundamental principles of probability theory, its applications in various statistical methods, and its crucial role in making sense of uncertainty and variability in data. By gaining a solid understanding of probability theory, readers will be well-equipped to tackle complex data analysis tasks with confidence and precision.

8 Basic Principles and Tools of Probability Theory

8.1 Sample Space and Events

A sample space \(\Omega\) is a set containing all conceivable outcomes of a random phenomenon. An event \(A\) is a subset of the sample space \(\Omega\); thus, \(A \subseteq \Omega\). The notation \(P(\cdot)\) indicates probability.

8.2 Union, Intersection, and Complement of Events

Given two events \(A\) and \(B\), the union operation \((A \cup B)\) corresponds to the set of outcomes contained in either \(A\) or \(B\) or both. The intersection operation \((A \cap B)\) is the set of outcomes that lie in both \(A\) and \(B\). The complement of an event \(A'\) refers to the set of outcomes in the sample space that are not in \(A\):

\[\Omega = A \cup A'\quad,\quad A \cap A' = \emptyset\]

8.3 Conditional Probability

Conditional probability is the probability of an event \(A\) given that another event \(B\) occurs:

\[P(A \mid B) = \frac{P(A \cap B)}{P(B)} \qquad (\text{assuming}\;\; P(B)>0)\]

8.4 Multiplicative Property of Conditional Probability

For any two events \(A\) and \(B\), the joint probability satisfies the identity:

\[P(A \cap B) = P(A)\times P(B \mid A) = P(B) \times P(A \mid B)\]

8.5 Chain Rule for Conditional Probability

Given three events \(A\), \(B\), and \(C\), the chain rule decomposes the joint probability as follows:

\[P(A \cap B \cap C) = P(A) \times P(B \mid A) \times P(C \mid A \cap B)\]

8.6 Bayes’ Formula

Bayes’ formula relates the conditional probabilities of two events, say \(A\) and \(B\), as follows:

\[P(A \mid B) = \frac{P(B \mid A) \times P(A)}{P(B)}\]

8.7 Independence of Events

Two events \(A\) and \(B\) are independent if and only if

\[P(A \cap B) = P(A) \times P(B)\]

Independent events satisfy the following equality:

\[P(A \mid B) = P(A) \qquad \text{and} \qquad P(B \mid A) = P(B)\]

8.8 Practical Example: Financial Events

# Example: Analyzing independence of market events
import numpy as np
import pandas as pd
from scipy.stats import chi2_contingency

# Simulate market data for demonstration
np.random.seed(42)
n_days = 1000

# Event A: Market goes up
market_up = np.random.choice([0, 1], size=n_days, p=[0.45, 0.55])  # Slight upward bias

# Event B: High volume day
high_volume = np.random.choice([0, 1], size=n_days, p=[0.7, 0.3])

# Create contingency table
contingency_table = pd.crosstab(market_up, high_volume, margins=True)
print("Contingency Table: Market Direction vs Volume")
print("Rows: Market Up (0=Down, 1=Up), Columns: High Volume (0=Normal, 1=High)")
print(contingency_table)

# Test for independence
chi2, p_value, dof, expected = chi2_contingency(contingency_table.iloc[:-1, :-1])

print(f"\nChi-square test for independence:")
print(f"Chi-square statistic: {chi2:.4f}")
print(f"p-value: {p_value:.4f}")
print(f"Degrees of freedom: {dof}")

if p_value < 0.05:
    print("✓ Events are NOT independent (reject null hypothesis)")
else:
    print("✗ Events appear to be independent (fail to reject null hypothesis)")

# Calculate conditional probabilities
prob_up_given_high_vol = contingency_table.loc[1, 1] / contingency_table.loc['All', 1]
prob_up = contingency_table.loc[1, 'All'] / contingency_table.loc['All', 'All']

print(f"\nConditional Probability Analysis:")
print(f"P(Market Up | High Volume) = {prob_up_given_high_vol:.4f}")
print(f"P(Market Up) = {prob_up:.4f}")
print(f"Difference: {abs(prob_up_given_high_vol - prob_up):.4f}")

8.9 Random Variables

A random variable is a rule associating numerical values with outcomes in a sample space. There are two types of random variables: discrete and continuous.

8.9.1 Discrete Random Variables Example

# Example: Portfolio return outcomes
import matplotlib.pyplot as plt
from scipy.stats import binom

# Discrete random variable: Number of profitable trades out of 10
n_trades = 10
prob_profit = 0.6  # 60% chance each trade is profitable

# Calculate probability mass function
outcomes = range(0, n_trades + 1)
probabilities = [binom.pmf(k, n_trades, prob_profit) for k in outcomes]

# Visualization
plt.figure(figsize=(10, 6))
plt.bar(outcomes, probabilities, alpha=0.7, edgecolor='black')
plt.title('Probability Mass Function: Profitable Trades out of 10')
plt.xlabel('Number of Profitable Trades')
plt.ylabel('Probability')
plt.grid(True, alpha=0.3)

# Add expected value line
expected_value = n_trades * prob_profit
plt.axvline(expected_value, color='red', linestyle='--', linewidth=2, 
           label=f'Expected Value: {expected_value}')
plt.legend()
plt.show()

print(f"Expected number of profitable trades: {expected_value}")
print(f"Standard deviation: {np.sqrt(n_trades * prob_profit * (1 - prob_profit)):.2f}")

8.9.2 Continuous Random Variables Example

# Example: Stock return distribution
from scipy.stats import norm, lognorm

# Generate random stock returns (normal distribution)
np.random.seed(42)
returns = np.random.normal(0.001, 0.02, 1000)  # Daily returns

# Fit distributions
mu, sigma = norm.fit(returns)
print(f"Fitted Normal Distribution: μ = {mu:.6f}, σ = {sigma:.6f}")

# Probability density function
x = np.linspace(returns.min(), returns.max(), 100)
pdf_fitted = norm.pdf(x, mu, sigma)

# Visualization
plt.figure(figsize=(12, 8))

# Histogram of actual data
plt.hist(returns, bins=50, density=True, alpha=0.7, color='skyblue', 
         edgecolor='black', label='Actual Returns')

# Fitted PDF
plt.plot(x, pdf_fitted, 'r-', linewidth=2, label=f'Fitted Normal PDF')

plt.title('Probability Density Function: Daily Stock Returns')
plt.xlabel('Daily Return')
plt.ylabel('Density')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

# Calculate probabilities
prob_positive = 1 - norm.cdf(0, mu, sigma)
prob_large_loss = norm.cdf(-0.05, mu, sigma)  # Probability of losing more than 5%

print(f"Probability of positive return: {prob_positive:.4f}")
print(f"Probability of losing more than 5%: {prob_large_loss:.6f}")

9 Probability Schools of Thought

Probability theory offers a systematic approach to studying uncertain events and measuring uncertainty. Its foundational role in statistical analysis cannot be overstated, as it underpins the methods and techniques used to make sense of random phenomena and data. Understanding probability theory is essential not only for mastering statistical concepts but also for conducting robust and insightful data analysis in various fields.

Unlike many other branches of mathematics, probability theory is characterized by its lack of a single, unifying theory. This unique aspect stems from its historical development and the diverse applications it has found across different domains. Probability has evolved through contributions from mathematicians, philosophers, statisticians, and scientists, each bringing their perspective and influencing its theoretical foundations. As a result, probability theory encompasses a rich tapestry of approaches and interpretations.

There are two major schools of thought in probability theory: the frequentist and the Bayesian perspectives. The frequentist approach, which is the traditional form of probability, interprets probability as the long-run frequency of events occurring in repeated trials. It is grounded in the concept of an objective, empirical observation of frequencies. On the other hand, the Bayesian approach views probability as a measure of belief or certainty about the occurrence of an event, incorporating prior knowledge and subjective judgment into its framework.

9.1 Frequentism

Frequentism posits that probabilities correspond to the long-run frequencies of events in repeated trials. It concentrates on estimating the parameters of probability distributions governing the generation of data, instead of considering alternative hypotheses. Many commonly used statistical tests, such as t-tests and chi-square tests, stem from the Frequentist perspective.

In financial time series econometrics, frequentism dominates academic publication and discourse. This approach, which emphasises the analysis and interpretation of data through frequency-based probability, is central in scholarly research within this field. Frequentist methods, which revolve around estimating parameters based on observed frequencies, such as mean or variance, are extensively applied and featured in academic literature.

Important

Frequentism takes a long-run frequency perspective, asserting that probabilities are the relative frequencies of events obtained through repeated observations. This perspective became widely accepted in the nineteenth century thanks to British polymath John Venn and Austrian mathematician Johann Radon, among others. Sir Ronald Fisher, a renowned geneticist and statistician, championed Frequentism in the twentieth century, arguing that probability should solely deal with random variation in observations.

9.2 Bayesian Methods

Bayesian methods treat probabilities as degrees of belief concerning the truthfulness of propositions, conditioned on prior evidence. Bayesian inference combines prior knowledge with current evidence to update beliefs. This paradigm excels at capturing uncertainty in model parameters and accounts for complex interactions between variables.

In contrast, Bayesian inference in financial time series econometrics offers a different perspective, one that incorporates prior knowledge and beliefs into the analysis. This method involves updating the probability for a hypothesis as more evidence or information becomes available. In the context of financial markets, Bayesian approaches are particularly valuable for their adaptability and ability to handle uncertainty.

Note

Bayesian methods trace their roots to English cleric and mathematician Thomas Bayes, whose revolutionary work, “An Essay Towards Solving a Problem in the Doctrine of Chances” laid the groundwork for Bayesian inference. Bayesian methods were subsequently promoted by French scholar Pierre-Simon Laplace in the late eighteenth century and garnered renewed interest in the mid-twentieth century, largely owing to British statistician Harold Jeffreys and American statistician Leonard Savage.

9.4 Connection between Classical Probability and Bayesian Methods

  • Prior Distributions from Classical Principles: In Bayesian analysis, the choice of a prior distribution is crucial. Classical probability, with its focus on equally likely outcomes, can provide a natural starting point for these priors, especially in situations where little is known a priori (e.g., using a uniform distribution as a non-informative prior).
  • Incorporating Symmetry and Equilibrium: Classical principles often embody symmetry and equilibrium concepts, which can be useful in formulating prior beliefs in a Bayesian context, particularly in financial markets where assumptions of equilibrium are common.
  • Educational Foundation: Classical probability often serves as an introductory framework for students and practitioners, creating a foundational understanding that can be built upon with Bayesian methods.

10 Scalar Quantities in Python

Scalar quantities are numerical values that don’t depend on direction, such as temperature, mass, or height. In finance, scalars often appear in the form of returns, exchange rates, or prices. As a real-world finance application, suppose you want to compute the annualized return of a stock.

10.1 Example: Annualized Return Computation

# Python implementation of annualized return calculation
import numpy as np

# Define the stock prices and holding period
current_price = 100
initial_price = 80
holding_period = 180  # Days

# Calculate annualized return
annualized_return = (current_price / initial_price)**(365 / holding_period) - 1

print(f"Annualized return: {annualized_return:.4f} or {annualized_return*100:.2f}%")

This calculation shows how to compound returns over different time periods, a fundamental concept in financial analysis.

11 Vectors and Arrays with NumPy

Vectors are arrays of numbers, and matrices are rectangular arrays. Both play a crucial role in expressing relationships between variables and performing computations efficiently. Consider a hypothetical scenario where you compare monthly returns across three different assets.

11.1 Example: Monthly Returns Analysis

import pandas as pd
import numpy as np

# Create monthly returns data
monthly_returns = np.array([0.02, -0.01, 0.03])
asset_names = ['Asset A', 'Asset B', 'Asset C']

# Create a pandas DataFrame for better data handling
returns_df = pd.DataFrame({
    'Asset': asset_names,
    'Monthly_Return': monthly_returns,
    'Annualized_Return': monthly_returns * 12  # Simple annualization
})

print("Monthly Returns Analysis:")
print(returns_df)
print(f"\nMean monthly return: {monthly_returns.mean():.4f}")
print(f"Standard deviation: {monthly_returns.std():.4f}")

11.2 Advanced Array Operations

# More advanced operations with financial data
import matplotlib.pyplot as plt

# Generate sample portfolio returns
np.random.seed(42)
portfolio_returns = np.random.normal(0.001, 0.02, 252)  # Daily returns for 1 year

# Calculate cumulative returns
cumulative_returns = np.cumprod(1 + portfolio_returns)

# Calculate key statistics
annual_return = (cumulative_returns[-1] - 1)
volatility = portfolio_returns.std() * np.sqrt(252)  # Annualized volatility
sharpe_ratio = annual_return / volatility

print(f"Annual Return: {annual_return:.4f}")
print(f"Volatility: {volatility:.4f}")
print(f"Sharpe Ratio: {sharpe_ratio:.4f}")

# Simple visualization
plt.figure(figsize=(10, 6))
plt.plot(cumulative_returns)
plt.title('Portfolio Cumulative Returns')
plt.xlabel('Trading Days')
plt.ylabel('Cumulative Return')
plt.grid(True)
plt.show()

12 Functions in Python for Finance

Functions map inputs to outputs and are ubiquitous in mathematics, statistics, and finance. Let’s create a compound interest function and expand it for financial applications.

12.1 Example: Compound Interest Function

def compound_interest(principal, rate, periods):
    """
    Calculate compound interest
    
    Parameters:
    principal (float): Initial investment amount
    rate (float): Interest rate per period (as decimal)
    periods (int): Number of compounding periods
    
    Returns:
    float: Final amount after compound interest
    """
    return_amount = principal * (1 + rate)**periods
    return return_amount

# Example usage
initial_investment = 10000
annual_rate = 0.07  # 7% annual return
years = 10

final_amount = compound_interest(initial_investment, annual_rate, years)
total_return = final_amount - initial_investment

print(f"Initial Investment: ${initial_investment:,.2f}")
print(f"Final Amount: ${final_amount:,.2f}")
print(f"Total Return: ${total_return:,.2f}")
print(f"Total Return %: {(total_return/initial_investment)*100:.2f}%")

12.2 Advanced Financial Functions

def calculate_portfolio_metrics(returns):
    """
    Calculate comprehensive portfolio metrics
    
    Parameters:
    returns (array-like): Array of portfolio returns
    
    Returns:
    dict: Dictionary containing various portfolio metrics
    """
    returns = np.array(returns)
    
    # Basic statistics
    mean_return = np.mean(returns)
    std_return = np.std(returns, ddof=1)
    
    # Annualized metrics (assuming daily returns)
    annual_return = mean_return * 252
    annual_volatility = std_return * np.sqrt(252)
    
    # Risk-adjusted metrics
    sharpe_ratio = annual_return / annual_volatility if annual_volatility != 0 else 0
    
    # Downside metrics
    negative_returns = returns[returns < 0]
    downside_deviation = np.std(negative_returns, ddof=1) * np.sqrt(252) if len(negative_returns) > 0 else 0
    sortino_ratio = annual_return / downside_deviation if downside_deviation != 0 else 0
    
    # Maximum drawdown
    cumulative_returns = np.cumprod(1 + returns)
    running_max = np.maximum.accumulate(cumulative_returns)
    drawdowns = (cumulative_returns - running_max) / running_max
    max_drawdown = np.min(drawdowns)
    
    return {
        'Annual Return': annual_return,
        'Annual Volatility': annual_volatility,
        'Sharpe Ratio': sharpe_ratio,
        'Sortino Ratio': sortino_ratio,
        'Maximum Drawdown': max_drawdown,
        'Total Observations': len(returns)
    }

# Example usage with simulated data
np.random.seed(123)
sample_returns = np.random.normal(0.0008, 0.015, 252)  # Daily returns for 1 year

metrics = calculate_portfolio_metrics(sample_returns)
print("Portfolio Performance Metrics:")
print("-" * 35)
for metric, value in metrics.items():
    if isinstance(value, float):
        print(f"{metric}: {value:.4f}")
    else:
        print(f"{metric}: {value}")

13 Probability Distributions in Finance

Understanding probability distributions is crucial for financial modeling and risk management.

13.1 Common Financial Distributions

from scipy import stats
import matplotlib.pyplot as plt

# Generate sample data for different distributions
np.random.seed(42)
n_samples = 1000

# Normal distribution (common assumption for returns)
normal_returns = np.random.normal(0.001, 0.02, n_samples)

# Student's t-distribution (better for modeling fat tails)
t_returns = stats.t.rvs(df=5, loc=0.001, scale=0.02, size=n_samples)

# Log-normal distribution (for stock prices)
lognormal_prices = np.random.lognormal(mean=4.6, sigma=0.2, size=n_samples)

# Create visualization
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Plot normal distribution
axes[0,0].hist(normal_returns, bins=50, alpha=0.7, density=True)
axes[0,0].set_title('Normal Distribution - Daily Returns')
axes[0,0].set_xlabel('Return')
axes[0,0].set_ylabel('Density')

# Plot t-distribution
axes[0,1].hist(t_returns, bins=50, alpha=0.7, density=True, color='orange')
axes[0,1].set_title('Student\'s t-Distribution - Daily Returns')
axes[0,1].set_xlabel('Return')
axes[0,1].set_ylabel('Density')

# Plot log-normal distribution
axes[1,0].hist(lognormal_prices, bins=50, alpha=0.7, density=True, color='green')
axes[1,0].set_title('Log-Normal Distribution - Stock Prices')
axes[1,0].set_xlabel('Price')
axes[1,0].set_ylabel('Density')

# Q-Q plot comparing normal vs t-distribution
stats.probplot(normal_returns, dist="norm", plot=axes[1,1])
axes[1,1].set_title('Q-Q Plot: Normal Distribution')

plt.tight_layout()
plt.show()

# Statistical tests
print("Distribution Comparison:")
print(f"Normal - Mean: {np.mean(normal_returns):.6f}, Std: {np.std(normal_returns):.6f}")
print(f"t-dist - Mean: {np.mean(t_returns):.6f}, Std: {np.std(t_returns):.6f}")
print(f"Skewness - Normal: {stats.skew(normal_returns):.4f}, t-dist: {stats.skew(t_returns):.4f}")
print(f"Kurtosis - Normal: {stats.kurtosis(normal_returns):.4f}, t-dist: {stats.kurtosis(t_returns):.4f}")

14 Hypothesis Testing in Finance

Statistical hypothesis testing is fundamental to financial research and decision-making.

14.1 Example: Testing Market Efficiency

from scipy.stats import ttest_1samp, jarque_bera, normaltest

# Generate sample stock returns
np.random.seed(123)
stock_returns = np.random.normal(0.0005, 0.02, 252)  # Daily returns

# Test 1: Is the mean return significantly different from zero?
t_stat, p_value = ttest_1samp(stock_returns, 0)

print("Hypothesis Testing Results:")
print("=" * 40)
print(f"Test 1: Mean return = 0?")
print(f"t-statistic: {t_stat:.4f}")
print(f"p-value: {p_value:.6f}")
print(f"Conclusion: {'Reject' if p_value < 0.05 else 'Fail to reject'} null hypothesis at 5% level")

# Test 2: Are returns normally distributed?
jb_stat, jb_pvalue = jarque_bera(stock_returns)

print(f"\nTest 2: Are returns normally distributed?")
print(f"Jarque-Bera statistic: {jb_stat:.4f}")
print(f"p-value: {jb_pvalue:.6f}")
print(f"Conclusion: {'Reject' if jb_pvalue < 0.05 else 'Fail to reject'} normality at 5% level")

# Test 3: Serial correlation test (market efficiency)
from statsmodels.stats.diagnostic import acorr_ljungbox

lb_stat = acorr_ljungbox(stock_returns, lags=10, return_df=True)
print(f"\nTest 3: Serial correlation (Ljung-Box test)")
print(f"First lag p-value: {lb_stat.iloc[0]['lb_pvalue']:.6f}")
print(f"Conclusion: {'Evidence of' if lb_stat.iloc[0]['lb_pvalue'] < 0.05 else 'No evidence of'} serial correlation")

15 Bayesian vs Frequentist Approaches

15.1 Frequentist Approach

In frequentist statistics, probability represents long-run frequencies. Parameters are fixed but unknown, and we use sample data to make inferences.

15.2 Bayesian Approach

In Bayesian statistics, probability represents degrees of belief. We start with prior beliefs and update them with data to get posterior beliefs.

# Simple Bayesian updating example for stock return estimation
import scipy.stats as stats

# Prior belief: stock has mean return of 0.1% daily with uncertainty
prior_mean = 0.001
prior_std = 0.005

# Observed data
observed_returns = np.array([0.002, -0.001, 0.003, 0.001, 0.000])
n_obs = len(observed_returns)
sample_mean = np.mean(observed_returns)
sample_std = 0.02  # Assumed known

# Bayesian updating (conjugate normal-normal case)
posterior_precision = 1/prior_std**2 + n_obs/sample_std**2
posterior_std = np.sqrt(1/posterior_precision)
posterior_mean = (prior_mean/prior_std**2 + n_obs*sample_mean/sample_std**2) / posterior_precision

print("Bayesian Updating Example:")
print("-" * 30)
print(f"Prior: μ = {prior_mean:.4f}, σ = {prior_std:.4f}")
print(f"Sample: mean = {sample_mean:.4f}, n = {n_obs}")
print(f"Posterior: μ = {posterior_mean:.4f}, σ = {posterior_std:.4f}")

# Visualization
x = np.linspace(-0.01, 0.01, 1000)
prior_pdf = stats.norm.pdf(x, prior_mean, prior_std)
posterior_pdf = stats.norm.pdf(x, posterior_mean, posterior_std)

plt.figure(figsize=(10, 6))
plt.plot(x, prior_pdf, label='Prior', linestyle='--')
plt.plot(x, posterior_pdf, label='Posterior', linewidth=2)
plt.axvline(sample_mean, color='red', linestyle=':', label='Sample Mean')
plt.xlabel('Daily Return')
plt.ylabel('Density')
plt.title('Bayesian Updating of Stock Return Beliefs')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

16 Practical Questions and Exercises

16.1 Easier Exercises

16.1.1 1. Calculating Stock Returns

# Calculate the annualized return of a stock
initial_price = 100
final_price = 150
years = 3

# Calculate annualized return
annualized_return = (final_price / initial_price)**(1/years) - 1

print(f"Initial Price: ${initial_price}")
print(f"Final Price: ${final_price}")
print(f"Investment Period: {years} years")
print(f"Annualized Return: {annualized_return:.4f} or {annualized_return*100:.2f}%")

# Alternative: using numpy for more complex calculations
returns_array = np.array([0.14, 0.08, 0.12])  # Annual returns for 3 years
geometric_mean = np.prod(1 + returns_array)**(1/len(returns_array)) - 1
print(f"Geometric mean return: {geometric_mean:.4f} or {geometric_mean*100:.2f}%")

16.1.2 2. Descriptive Statistics of Financial Data

# Analyze a dataset of stock prices
stock_prices = np.array([120, 125, 130, 128, 135])

# Calculate comprehensive statistics
statistics = {
    'Mean': np.mean(stock_prices),
    'Median': np.median(stock_prices),
    'Standard Deviation': np.std(stock_prices, ddof=1),
    'Minimum': np.min(stock_prices),
    'Maximum': np.max(stock_prices),
    'Range': np.ptp(stock_prices)  # Peak-to-peak (range)
}

print("Stock Price Statistics:")
print("-" * 25)
for stat, value in statistics.items():
    print(f"{stat}: {value:.2f}")

# Calculate returns from prices
returns = np.diff(stock_prices) / stock_prices[:-1]
print(f"\nDaily Returns: {returns}")
print(f"Mean Daily Return: {np.mean(returns):.4f}")
print(f"Return Volatility: {np.std(returns, ddof=1):.4f}")

16.1.3 3. Basic Risk Assessment

# Calculate standard deviation of stock returns
stock_returns = np.array([0.05, 0.02, -0.03, 0.04, 0.01])

# Risk metrics
volatility = np.std(stock_returns, ddof=1)
annualized_volatility = volatility * np.sqrt(252)  # Assuming daily returns

# Value at Risk (95% confidence)
var_95 = np.percentile(stock_returns, 5)

# Maximum drawdown simulation
cumulative_returns = np.cumprod(1 + stock_returns)
running_max = np.maximum.accumulate(cumulative_returns)
drawdowns = (cumulative_returns - running_max) / running_max
max_drawdown = np.min(drawdowns)

print("Risk Assessment:")
print("-" * 20)
print(f"Volatility (daily): {volatility:.4f}")
print(f"Annualized Volatility: {annualized_volatility:.4f}")
print(f"95% VaR: {var_95:.4f}")
print(f"Maximum Drawdown: {max_drawdown:.4f}")

# Risk-return visualization
mean_return = np.mean(stock_returns)
plt.figure(figsize=(8, 6))
plt.scatter(volatility, mean_return, s=100, alpha=0.7)
plt.xlabel('Risk (Standard Deviation)')
plt.ylabel('Expected Return')
plt.title('Risk-Return Profile')
plt.grid(True, alpha=0.3)
plt.axhline(y=0, color='r', linestyle='--', alpha=0.5)
plt.axvline(x=0, color='r', linestyle='--', alpha=0.5)
plt.show()

16.2 Advanced Exercises

16.2.1 1. Monte Carlo Simulation for Portfolio Risk

# Monte Carlo simulation for portfolio risk assessment
def monte_carlo_portfolio_simulation(returns, weights, n_simulations=10000, time_horizon=252):
    """
    Perform Monte Carlo simulation for portfolio returns
    """
    n_assets = len(weights)
    portfolio_returns = []
    
    for _ in range(n_simulations):
        # Generate random returns for each asset
        random_returns = np.random.multivariate_normal(
            mean=np.mean(returns, axis=0),
            cov=np.cov(returns.T),
            size=time_horizon
        )
        
        # Calculate portfolio returns
        portfolio_return = np.sum(random_returns * weights, axis=1)
        portfolio_returns.append(np.prod(1 + portfolio_return) - 1)
    
    return np.array(portfolio_returns)

# Example with 3-asset portfolio
np.random.seed(42)
asset_returns = np.random.multivariate_normal(
    mean=[0.001, 0.0008, 0.0012],
    cov=[[0.0004, 0.0001, 0.0002],
         [0.0001, 0.0006, 0.0001],
         [0.0002, 0.0001, 0.0005]],
    size=252
)

portfolio_weights = np.array([0.4, 0.3, 0.3])
simulated_returns = monte_carlo_portfolio_simulation(asset_returns, portfolio_weights)

# Analyze results
print("Monte Carlo Portfolio Simulation Results:")
print("-" * 45)
print(f"Expected Annual Return: {np.mean(simulated_returns):.4f}")
print(f"Annual Volatility: {np.std(simulated_returns):.4f}")
print(f"95% VaR: {np.percentile(simulated_returns, 5):.4f}")
print(f"99% VaR: {np.percentile(simulated_returns, 1):.4f}")

# Visualization
plt.figure(figsize=(10, 6))
plt.hist(simulated_returns, bins=100, alpha=0.7, density=True)
plt.axvline(np.mean(simulated_returns), color='red', linestyle='--', label='Mean')
plt.axvline(np.percentile(simulated_returns, 5), color='orange', linestyle='--', label='95% VaR')
plt.xlabel('Annual Portfolio Return')
plt.ylabel('Density')
plt.title('Monte Carlo Simulation: Portfolio Return Distribution')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

This comprehensive primer provides the statistical foundation necessary for advanced financial data science, converted entirely to Python while maintaining the educational rigor of the original content.