Datasets

Reliable sources and quick-start code

Financial Datasets: Quick Start

Use these beginner-friendly data sources. Start small (few tickers, short dates) to keep things fast and clear.

Yahoo Finance (Free)

Suitable for equities, indices, ETFs.

import yfinance as yf

# Minimal example
df = yf.download(["AAPL", "MSFT"], start="2022-01-01", end="2023-01-01")
df["Adj Close"].head()

# Simple return calculation
returns = df["Adj Close"].pct_change().dropna()
returns.tail()

Notes: data may be adjusted and occasionally revised; verify for assessment work.

FRED (Macro/Economic)

Federal Reserve economic series (GDP, CPI, rates).

import pandas_datareader.data as web

gdp = web.DataReader("GDP", "fred", start="2018-01-01")
gdp.tail()

CSV Fallback (Offline-friendly)

If network access is blocked, work from local CSVs.

import pandas as pd

prices = pd.read_csv("prices_sample.csv", parse_dates=["Date"], index_col="Date")
prices.head()

How to create a CSV quickly:

# After pulling with yfinance, save for later offline use
df["Adj Close"].to_csv("prices_sample.csv")

Good practises

  • Start with one or two tickers, short windows
  • Check .info() or documentation for series definitions
  • Keep a small “data” folder with versioned CSV snapshots

JKP Global Factor Data (Replication resources)

The JKP initiative (Jensen–Kelly–Pedersen) provides a curated, global factor dataset, documentation, and analysis tools. The full dataset (~170 MB) is stored in OneDrive teaching-data/Global-Factor-Data to avoid GitHub size limits. Scripts (create_jkp_master_global.py, etc.) resolve this path automatically; you can override via config/data_root.yml with jkp_data_path.

  • Portal: https://jkpfactors.com
  • Documentation (factor definitions, availability): https://jkpfactors.s3.amazonaws.com/documents/Documentation.pdf
  • JKP/WRDS Guide: https://jkpfactors.com/jkp-wrds-guide
  • GitHub (related research replication): https://github.com/bkelly-lab/ReplicationCrisis

Notes and usage - Access may require registration and/or institutional subscriptions (e.g., WRDS). Follow the portal’s terms and documentation. - For coursework, prefer small, well‑documented slices (few factors, limited horizon) and record exactly which series/versions you used. - Context papers: Jensen, Kelly, and Pedersen (2024); methodology links to Kelly, Malamud, and Zhou (2024) and Gu, Kelly, and Xiu (2020) for model design and evaluation.

References

Gu, Shihao, Bryan Kelly, and Dacheng Xiu. 2020. “Empirical Asset Pricing via Machine Learning.” Review of Financial Studies. https://doi.org/10.1093/rfs/hhaa009.
Jensen, Theis I., Bryan T. Kelly, and Lasse Heje Pedersen. 2024. “Is There a Replication Crisis in Finance?” Journal of Finance. https://doi.org/10.1111/jofi.13249.
Kelly, Bryan T., Semyon Malamud, and Kangying Zhou. 2024. “The Virtue of Complexity in Return Prediction.” Journal of Finance 79 (1): 459–503. https://doi.org/10.1111/jofi.13298.