Lab 8: Fraud Detection as Rare-Event Classification
From synthetic transactions to real Bitcoin data (Elliptic dataset)
- Part A (synthetic data): ~60 minutes
- Part B (Elliptic Bitcoin): ~45 minutes
- Extension (graph analysis): +30 minutes
1 Learning Objectives
By the end of this lab you will be able to:
- Demonstrate why accuracy is meaningless for rare-event classification
- Build a supervised fraud detection pipeline with stratified CV and class weighting
- Select a cost-sensitive decision threshold that minimises expected cost
- Compare Isolation Forest (unsupervised) with supervised and hybrid approaches
- Run walk-forward temporal validation on real Bitcoin transaction data
- Quantify the look-ahead bias gap between shuffled and temporal CV
2 Connection to the slides
This lab directly implements the exercises from the Week 8 slides. Part A uses synthetic transaction data to teach the statistical principles (accuracy trap, cost-sensitive thresholds, temporal CV, hybrid methods). Part B applies the same pipeline to the Elliptic Bitcoin dataset (Weber et al. 2019), which provides genuine temporal drift and a +0.065 AUC look-ahead bias gap.
3 Data
Part A generates synthetic data inline (no external files needed).
Part B uses two parquet files in data/elliptic/:
elliptic_labelled.parquet(~25 MB): 46,564 labelled Bitcoin transactions with 166 anonymised features and 49 time stepselliptic_edges_labelled.parquet(~0.4 MB): 36,624 directed edges between labelled transactions
See data/elliptic/README.md for the full dataset description and download instructions.
4 Part A: Synthetic Transaction Data
4.1 Exercise 1: Generate the dataset
Generate 50,000 synthetic card transactions with a ~1% fraud rate and temporal drift. Features mimic what a real fraud team works with: amount, hour, velocity, foreign merchant flag, account age, and spend ratio.
4.2 Exercise 2: The accuracy trap
Build the simplest possible model (predict every transaction as legitimate). Calculate accuracy and recall. This demonstrates the base rate fallacy from Week 1 and Chapter 05.
4.3 Exercise 3: Supervised pipeline
Train logistic regression with class_weight='balanced' using 5-fold stratified CV. Report both AUC and Average Precision.
4.4 Exercise 4: Default threshold disaster
Fit unweighted logistic regression. Predict using the default 0.5 threshold. Observe that the model catches zero fraud.
4.5 Exercise 5: Cost-sensitive threshold
Sweep thresholds from 0.005 to 0.50, plot the expected cost curve (£20 per false alarm, £1,000 per missed fraud), and find the optimal threshold.
4.6 Exercise 6: Isolation Forest
Fit Isolation Forest with 2% contamination. Plot anomaly score distributions for fraud vs legitimate. Report precision and recall.
4.7 Exercise 7: Hybrid model
Add the Isolation Forest anomaly score as a feature to the supervised model. Measure the AUC lift.
5 Part B: Elliptic Bitcoin Data
5.1 Exercise 8: Load and explore
Load the labelled parquet. Plot the illicit rate by time step to visualise the temporal drift.
5.2 Exercise 9: Shuffled CV vs walk-forward validation
Run 5-fold shuffled stratified CV and walk-forward validation (train on past time steps, test on future). Compare AUC. The +0.065 gap is the strongest evidence for temporal CV in the entire course.
5.3 Exercise 10 (Extension): Graph exploration
Load the edge list. Build a directed graph with NetworkX. Compute degree centrality. Test whether high-centrality nodes are more likely to be illicit.
6 Summary
| Exercise | Key result | Lesson |
|---|---|---|
| 2. Accuracy trap | 99% accurate, 0 fraud caught | Accuracy is useless for rare events |
| 3. Supervised pipeline | AUC ~0.78, AP ~0.04 | AUC flatters; AP tells the truth |
| 4. Default threshold | TP = 0 at threshold 0.5 | Default threshold catches nothing |
| 5. Cost-sensitive | Optimal ~0.015, catches ~50% | Threshold is a business decision |
| 6. Isolation Forest | Low precision, low recall | Unusual ≠ fraudulent |
| 7. Hybrid model | +0.02 AUC lift | Stack unsupervised into supervised |
| 9. Elliptic walk-forward | +0.065 look-ahead bias gap | Temporal CV matters on real data |