Lab 6: Alternative Finance & Credit Risk Scoring: UCI German Credit

Real-world credit scoring: UCI German Credit dataset

Note

Time: ≈ 60 min core · +20 min extensions
Sample answers are hidden, attempt each question before opening.

Open in Colab

0. Setup & Data

We use the UCI Statlog German Credit dataset, 1,000 real loan applications from a German bank, 20 features, binary outcome (0 = repaid, 1 = defaulted). This is the same data structure a marketplace lending platform uses to build a scoring model.

Show code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
import os
warnings.filterwarnings('ignore')

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score, roc_curve
from sklearn.calibration import calibration_curve

# Load German Credit data (works locally and in Colab).
# Try multiple local paths first (Quarto may run from project root or labs/).
# Fall back to UCI source if no local file is found.
_CANDIDATE_PATHS = [
    "../data/alt_finance/german_credit.csv",   # from labs/ directory
    "data/alt_finance/german_credit.csv",       # from project root
    "german_credit.csv",                        # if downloaded in Colab
]
_PUBLIC_MIRROR_URLS = [
    "https://raw.githubusercontent.com/quinfer/fin510-colab-notebooks/main/labs/german_credit.csv",
]
_UCI_URL = (
    "https://archive.ics.uci.edu/ml/machine-learning-databases"
    "/statlog/german/german.data"
)
_UCI_COLS = [
    "checking_status", "duration", "credit_history", "purpose", "credit_amount",
    "savings_status", "employment", "installment_rate", "personal_status",
    "other_parties", "residence_since", "property_magnitude", "age", "other_plans",
    "housing", "existing_credits", "job", "num_dependents", "own_telephone",
    "foreign_worker", "target"
]

df = None
for _path in _CANDIDATE_PATHS:
    if os.path.exists(_path):
        df = pd.read_csv(_path)
        DATA_SOURCE = f"local file ({_path})"
        break

_IS_COLAB = ('COLAB_GPU' in os.environ or 'COLAB_BACKEND_VERSION' in os.environ)
if df is None and _IS_COLAB:
    for _url in _PUBLIC_MIRROR_URLS:
        try:
            df = pd.read_csv(_url)
            DATA_SOURCE = f"public mirror ({_url})"
            break
        except Exception:
            df = None

if df is None:
    _raw = pd.read_csv(_UCI_URL, sep=" ", header=None, names=_UCI_COLS)
    _raw["defaulted"] = (_raw["target"] == 2).astype(int)
    df = _raw.drop(columns="target")
    DATA_SOURCE = "UCI archive (downloaded)"

print(f"Source : {DATA_SOURCE}")
print(f"Shape  : {df.shape[0]} loans × {df.shape[1]} columns")
print(f"Default rate: {df['defaulted'].mean():.1%}  ({df['defaulted'].sum()} defaults)")
Source : local file (data/alt_finance/german_credit.csv)
Shape  : 1000 loans × 21 columns
Default rate: 30.0%  (300 defaults)

1. Explore the Data

Before modelling, understand what you’re working with.

Show code
# Numeric features available
num_cols = [c for c in ['duration', 'credit_amount', 'installment_rate',
                         'age', 'existing_credits', 'num_dependents']
            if c in df.columns]
print(df[num_cols + ['defaulted']].describe().round(1))
       duration  credit_amount  installment_rate     age  existing_credits  \
count    1000.0         1000.0            1000.0  1000.0            1000.0   
mean       20.9         3271.3               3.0    35.5               1.4   
std        12.1         2822.7               1.1    11.4               0.6   
min         4.0          250.0               1.0    19.0               1.0   
25%        12.0         1365.5               2.0    27.0               1.0   
50%        18.0         2319.5               3.0    33.0               1.0   
75%        24.0         3972.2               4.0    42.0               2.0   
max        72.0        18424.0               4.0    75.0               4.0   

       num_dependents  defaulted  
count          1000.0     1000.0  
mean              1.2        0.3  
std               0.4        0.5  
min               1.0        0.0  
25%               1.0        0.0  
50%               1.0        0.0  
75%               1.0        1.0  
max               2.0        1.0  
Show code
# Do defaults vary with duration and credit amount?
if 'duration' in df.columns and 'credit_amount' in df.columns:
    fig, axes = plt.subplots(1, 2, figsize=(10, 4))
    for ax, col, label in zip(axes,
                               ['duration', 'credit_amount'],
                               ['Loan Duration (months)', 'Credit Amount (DM)']):
        for grp, lbl in [(0, 'Repaid'), (1, 'Defaulted')]:
            df.loc[df['defaulted'] == grp, col].plot.hist(
                bins=20, alpha=0.6, ax=ax, label=lbl)
        ax.set_xlabel(label)
        ax.legend()
    plt.tight_layout(); plt.show()

Q1 (think): Longer loans appear riskier. Is this a causal relationship, or might duration be a proxy for something else?

Longer-duration loans have higher default rates for several reasons. First, a longer repayment window increases exposure to life shocks (job loss, illness). Second, larger credit amounts are often granted as longer-term loans, so duration may proxy for amount. Third, lenders sometimes offer longer terms as a concession to higher-risk borrowers, adverse selection embedded in the product design.

Crucially, this is correlation, not causation. Shortening all loans would not automatically reduce defaults. The prediction-vs-causation distinction from Week 1 applies directly: using duration in a predictive model is legitimate, but it does not imply a policy prescription.


2. Prepare Features

Show code
# Identify numeric and categorical columns
cat_cols = [c for c in df.columns if df[c].dtype == 'object']

print(f"Numeric  ({len(num_cols)}): {num_cols}")
print(f"Categorical ({len(cat_cols)}): {cat_cols}")
Numeric  (6): ['duration', 'credit_amount', 'installment_rate', 'age', 'existing_credits', 'num_dependents']
Categorical (13): ['checking_status', 'credit_history', 'purpose', 'savings_status', 'employment', 'personal_status', 'other_parties', 'property_magnitude', 'other_plans', 'housing', 'job', 'own_telephone', 'foreign_worker']
Show code
# Scale numerics; one-hot encode categoricals
scaler = StandardScaler()
X_num = pd.DataFrame(scaler.fit_transform(df[num_cols]), columns=num_cols)

if cat_cols:
    X_cat = pd.get_dummies(df[cat_cols], drop_first=True)
    X_all = pd.concat([X_num, X_cat], axis=1)
else:
    X_all = X_num

y = df['defaulted']
print(f"Feature matrix: {X_all.shape[0]} rows × {X_all.shape[1]} columns")
print(f"  Numeric: {len(num_cols)}  |  One-hot encoded: {X_all.shape[1] - len(num_cols)}")
Feature matrix: 1000 rows × 47 columns
  Numeric: 6  |  One-hot encoded: 41

Q2 (think): Why can’t we feed the string “A11” directly into logistic regression? What’s wrong with converting A11→1, A12→2, A13→3?

Logistic regression requires numeric inputs with meaningful magnitudes. The string “A11” cannot be interpreted mathematically.

Integer label encoding (A11=1, A12=2, A13=3) implies an ordering that does not exist. The model would infer that category 3 is “three times as large” as category 1, which is meaningless for nominal categories like account type. One-hot encoding creates a separate 0/1 column for each category, allowing the model to learn an independent coefficient per level without imposing any artificial ordering. drop_first=True removes one column per feature to avoid perfect multicollinearity (the dummy variable trap).


3. Baseline: Numeric Features Only

Start simple, fit a model on numeric features alone. This is what a traditional bank might do without credit history records.

Show code
X_tr, X_te, y_tr, y_te = train_test_split(
    X_num, y, test_size=0.3, random_state=42, stratify=y)

model_num = LogisticRegression(max_iter=1000, random_state=42)
model_num.fit(X_tr, y_tr)

proba_num = model_num.predict_proba(X_te)[:, 1]
auc_num = roc_auc_score(y_te, proba_num)

print(f"Numeric-only AUC: {auc_num:.3f}")
print(f"Baseline (random classifier): 0.500")
Numeric-only AUC: 0.684
Baseline (random classifier): 0.500

Q3 (think): AUC ≈ {auc_num:.2f}. In plain English, what does this tell you about the model? Is it good enough for a real lender?

An AUC of ~0.60–0.70 means the model correctly ranks a random defaulter above a random non-defaulter roughly 60–70% of the time. It is clearly better than random (0.50), but well below what a production credit model needs. Commercial bureau-based models typically achieve 0.75–0.85 using rich payment history. Our numeric-only model uses just six features, with no credit history, savings balance, or employment status, so the gap is expected. It is a useful starting point, not a deployable product.


4. Richer Model: Add Categorical Features

Now add the coded categorical variables (checking account status, credit history, savings, employment, purpose, etc.) and measure the AUC improvement.

Show code
X_tr_all, X_te_all, y_tr_all, y_te_all = train_test_split(
    X_all, y, test_size=0.3, random_state=42, stratify=y)

model_all = LogisticRegression(max_iter=2000, random_state=42, C=0.3)
model_all.fit(X_tr_all, y_tr_all)

proba_all = model_all.predict_proba(X_te_all)[:, 1]
auc_all = roc_auc_score(y_te_all, proba_all)

print(f"Numeric-only AUC : {auc_num:.3f}")
print(f"All features AUC : {auc_all:.3f}  (+{auc_all - auc_num:.3f})")
Numeric-only AUC : 0.684
All features AUC : 0.802  (+0.118)
Show code
# Which features push predicted risk up or down most?
coefs = pd.Series(model_all.coef_[0], index=X_all.columns)
print("Top 5: INCREASE default risk")
print(coefs.nlargest(5).round(3).to_string())
print("\nTop 5: DECREASE default risk")
print(coefs.nsmallest(5).round(3).to_string())
Top 5: INCREASE default risk
purpose_A46                0.578
property_magnitude_A124    0.549
duration                   0.346
installment_rate           0.300
credit_history_A31         0.293

Top 5: DECREASE default risk
checking_status_A14   -1.317
savings_status_A65    -0.846
credit_history_A34    -0.758
savings_status_A64    -0.684
other_plans_A143      -0.637

Q4 (think): Do the top features make economic sense? Are any potentially unfair under UK equality law?

Features like checking_status_A11 (overdrawn account) and poor credit_history codes make strong economic sense. They directly signal financial distress. duration and credit_amount increasing risk is also intuitive: larger, longer loans expose lenders to more uncertainty.

age typically carries a negative coefficient (older → lower predicted default), which is actuarially plausible but raises a fairness concern. Under the UK Equality Act 2010, age is a protected characteristic in financial services. Using age is legally permissible if there is an objective justification and the difference is proportionate, but a lender must document this and cannot simply argue "it predicts well." The practical concern is that age may be proxying for wealth and stable employment rather than genuine repayment ability, meaning a young applicant with a stable income and good cash flow might be unfairly penalised.


5. Proper Validation: Cross-Validation

A single train/test split depends on which 300 loans happened to land in the test set, so results can be optimistic or pessimistic by chance. 5-fold stratified cross-validation gives a stable estimate with honest uncertainty.

Show code
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

scores_num = cross_val_score(
    LogisticRegression(max_iter=1000), X_num, y, cv=cv, scoring='roc_auc')
scores_all = cross_val_score(
    LogisticRegression(max_iter=2000, C=0.3), X_all, y, cv=cv, scoring='roc_auc')

print(f"Numeric-only (5-fold): {scores_num.mean():.3f} ± {scores_num.std():.3f}")
print(f"All features (5-fold): {scores_all.mean():.3f} ± {scores_all.std():.3f}")
print(f"\nFold-by-fold (all features): {scores_all.round(3)}")
Numeric-only (5-fold): 0.634 ± 0.028
All features (5-fold): 0.788 ± 0.019

Fold-by-fold (all features): [0.786 0.758 0.781 0.817 0.796]

Q5 (think): Compare the CV mean to your single-split result. Is the improvement from adding categorical features consistent across all five folds, or does it disappear in some?

The CV mean provides a more reliable estimate because it averages over five independent splits, reducing the influence of lucky or unlucky data partitions. If the single-split AUC was notably higher than the CV mean, the single split was optimistically biased.

The standard deviation across folds (e.g. ±0.02–0.03 is typical) tells you about model stability. If the improvement from adding categorical features holds in four out of five folds, it is a genuine signal. If it varies wildly, positive in some folds, negative in others, it suggests the features are noisy rather than consistently informative. In a real deployment decision, you would want the improvement to be stable and larger than the fold-to-fold standard deviation before incurring the engineering cost of collecting and storing categorical data.


6. Calibration Check

Good AUC means the model ranks risk correctly. Calibration checks whether the predicted numbers are trustworthy. If the model says “30% default probability”, do 30% of those borrowers actually default?

Show code
prob_true, prob_pred = calibration_curve(y_te_all, proba_all, n_bins=8, strategy='quantile')

fig, ax = plt.subplots(figsize=(7, 5))
ax.plot([0, 1], [0, 1], 'k--', alpha=0.5, label='Perfectly calibrated')
ax.plot(prob_pred, prob_true, 'o-', linewidth=2, markersize=7, label='Our model')
ax.set_xlabel('Predicted probability'); ax.set_ylabel('Observed default rate')
ax.set_title('Calibration Plot'); ax.legend(); ax.grid(alpha=0.3)
plt.tight_layout(); plt.show()

mae = np.mean(np.abs(prob_true - prob_pred))
print(f"Mean calibration error: {mae:.3f}  ({mae*100:.1f} pp on average)")

Mean calibration error: 0.068  (6.8 pp on average)

Q6 (think): Is your model over- or under-confident? What goes wrong for a lending platform if the model systematically overestimates default risk?

A model is overconfident if points lie above the diagonal (e.g. predicts 40% default but only 25% actually default). A model is underconfident if points lie below (predicts 15% but 30% default).

For a platform, overconfidence means loans are priced too expensively. Borrowers who are genuinely creditworthy are assigned high-risk grades, charged excessive interest rates, or rejected outright. The platform loses market share to competitors who price more accurately, and excludes creditworthy borrowers, exactly the opposite of the inclusion narrative.

Underconfidence is the more dangerous failure: risk is underpriced, investors earn less than expected, capital flees the platform. Calibration is therefore as critical as AUC in production. A model that discriminates perfectly but is 10 percentage points miscalibrated will systematically misjudge loan profitability.


7. Risk Grades & Investor Returns

Credit platforms translate predicted probabilities into risk grades (A–D) and set interest rates accordingly. Let’s calculate actual investor returns by grade using the German Credit test set.

Show code
df_te = df.loc[y_te_all.index].copy()
df_te['default_prob'] = proba_all

# Assign grades based on predicted probability
df_te['grade'] = pd.cut(
    df_te['default_prob'],
    bins=[0, 0.20, 0.35, 0.50, 1.01],
    labels=['A', 'B', 'C', 'D'])

# Risk-based interest rates
rate_map = {'A': 0.06, 'B': 0.10, 'C': 0.15, 'D': 0.22}
df_te['rate'] = df_te['grade'].map(rate_map).astype(float)

# Net return: interest minus 1% servicing (if repaid) or annualised principal loss (if defaulted)
df_te['investor_return'] = np.where(
    df_te['defaulted'] == 0,
    df_te['rate'] - 0.01,
    -1.0 / 3)   # lose principal over 3-year loan term

summary = df_te.groupby('grade', observed=True).agg(
    n=('defaulted', 'count'),
    default_rate=('defaulted', 'mean'),
    interest_rate=('rate', 'mean'),
    net_return=('investor_return', 'mean')
).round(3)

print(summary.to_string())
         n  default_rate  interest_rate  net_return
grade                                              
A      123         0.098           0.06       0.013
B       69         0.275           0.10      -0.027
C       47         0.340           0.15      -0.021
D       61         0.705           0.22      -0.173
Show code
fig, ax = plt.subplots(figsize=(7, 4))
colors = ['#2ca02c' if r > 0 else '#d62728' for r in summary['net_return']]
ax.bar(summary.index, summary['net_return'] * 100, color=colors, alpha=0.85, edgecolor='k')
ax.axhline(0, color='black', linewidth=1, linestyle='--')
ax.set_xlabel('Risk Grade'); ax.set_ylabel('Net Annual Return (%)')
ax.set_title('Investor Returns After Defaults & Fees (German Credit test set)')
ax.grid(axis='y', alpha=0.3)
plt.tight_layout(); plt.show()

Q7 (think): Grade D loans carry 22% interest. Should investors lend to them? What does your result suggest about high-interest lending?

High interest rates do not guarantee high returns. Grade D loans carry high predicted default probabilities; when a large fraction of borrowers default and investors lose roughly 33% of principal annually on those loans (spread over a 3-year term), the expected net return turns negative even at 22% interest.

This is the "high-risk trap" documented in marketplace lending literature: platforms advertise attractive headline rates on risky loans, but actual investor returns depend on whether pricing adequately compensates for realised default losses. Grade A and B loans often deliver better risk-adjusted returns because default losses are manageable. A rational investor should focus on expected net return after defaults and fees, not the headline interest rate, precisely the calculation you have just performed. The LendingClub data tells the same story: investors in lower-grade loans often earned less than those in Grade B–C loans, because defaults overwhelmed the extra interest.


8. Extension: Fairness

The model uses age. Older applicants tend to receive better predicted scores.

Task: Create an age-group column (e.g. under-30, 30–50, over-50) and compare mean predicted default probability by group. Does the model give systematically better rates to older borrowers?

Show code
if 'age' in df_te.columns:
    df_te['age_group'] = pd.cut(df_te['age'], bins=[0, 30, 50, 100],
                                  labels=['<30', '30-50', '50+'])
    print(df_te.groupby('age_group', observed=True)[['default_prob', 'defaulted']].mean().round(3))
           default_prob  defaulted
age_group                         
<30               0.355      0.382
30-50             0.263      0.246
50+               0.244      0.205

Q8 (think): Is age-based credit scoring fair? When (if ever) is it legally permissible under UK law?

Under the UK Equality Act 2010, age is a protected characteristic. Using age in a credit model is not automatically unlawful. Section 13 allows for objective justification if the difference in treatment is a proportionate means of achieving a legitimate aim. For a lender, "predicting default accurately" is a legitimate aim, and if age genuinely carries independent predictive power beyond other features, its use may be proportionate.

However, the key question is whether age is proxying for wealth, employment stability, or credit history, characteristics that could be measured directly and more fairly. A younger applicant who has stable income, no existing debts, and consistent cash flow may be denied a loan or charged high rates not because of actual risk, but because they have not yet accumulated the credit history that older borrowers carry. This is the inclusion paradox: the very population alternative finance claims to serve (young adults with thin files) is often penalised by the same models meant to include them. Regulators increasingly expect lenders to audit for disparate impact even when the algorithm is facially neutral.