Glossary of Key Terms

Financial Data Science — FIN306 / FIN510 / FIN720

This glossary collects key terms introduced across the course. Definitions are written for a non-specialist audience but include formal references for those who wish to pursue a topic further. The glossary grows week by week — terms are added as they are introduced in lectures.

A

Alpha: In investment management, the excess return of a strategy or fund over a benchmark (or over the return expected for the level of risk taken). Jensen’s formulation measures alpha as the intercept in a regression of fund returns on market returns — i.e. the return not explained by exposure to the market (Jensen 1968). “Genuine alpha” means the outperformance is attributable to skill or information, not luck or overfitting. A backtest may show impressive alpha in-sample, but if the strategy only worked in one market regime (e.g. bull markets), that alpha is regime-dependent and not robust. See backtest and regime-dependence.

ARIMA: Autoregressive Integrated Moving Average. A classical linear time series model combining three components: an autoregressive (AR) term modelling the current value as a function of $p$ past values; a moving average (MA) term using $q$ past forecast errors; and an integration (I) parameter $d$ that differences the series to achieve stationarity. Notation: ARIMA($p$, $d$, $q$). A random walk is ARIMA(0,1,0); a stationary AR(1) is ARIMA(1,0,0). See Box and Draper (1987) for the classic treatment; Campbell, Lo, and MacKinlay (1997) for financial applications. See GARCH for modelling time-varying volatility.

Attention mechanism: A computational mechanism that allows a neural network to weight the relevance of each element of an input sequence dynamically, rather than compressing the full sequence into a single fixed vector. Formally, given query $Q$, key $K$, and value $V$ matrices: $\text{Attention}(Q, K, V) = \text{softmax}(QK^\top/\sqrt{d_k})V$. The softmax of the scaled dot products acts as an adaptive kernel, assigning each token its own weighted average over all other tokens. Introduced in its modern form by Vaswani et al. (2017) as the sole mechanism of the transformer architecture, replacing RNN hidden states entirely. The matrix $QK^\top$ has an eigenvalue structure: each attention head can be interpreted as learning a different dominant eigenvector of the token-interaction matrix, capturing a distinct semantic relationship. Finance analogy: attention is a weighted conditional expectation where the weights are learned, not prescribed. See transformer, LSTM, and embeddings.

Autocorrelation: The correlation of a time series with its own past values. If today’s return is positively autocorrelated, a positive day tends to be followed by another positive day. Formally, the autocorrelation at lag $k$ is $\rho(k) = \text{Corr}(r_t, r_{t-k})$. For daily financial returns, autocorrelation is close to zero — meaning tomorrow’s return is nearly unpredictable from today’s. By contrast, the autocorrelation of squared returns (a proxy for volatility) is large and persistent, reflecting volatility clustering. See Cont (2001).

AUM (Assets Under Management): The total market value of assets that a fund or financial institution manages on behalf of clients. Used to measure the scale of asset managers and to calculate fee income (e.g., a 0.25% fee on £1bn AUM = £2.5m revenue).

B

Byzantine Fault Tolerant (BFT) consensus: A class of distributed consensus protocols that reach agreement correctly even when up to $f$ of $3f+1$ participants behave arbitrarily — including sending false information, colluding, or simply failing. The name comes from the Byzantine Generals Problem (Lamport, Shostak & Pease, 1982): imagine Byzantine army generals communicating by messenger, some of whom may be traitors sending conflicting orders. The problem asks whether loyal generals can still agree on a plan despite traitors. The answer is yes, provided traitors are fewer than one-third of total participants.

In blockchain contexts, BFT protocols are used by permissioned chains such as Hyperledger Fabric, where validators are known in advance and identified by legal identity rather than anonymous proof-of-work. Because the validator set is fixed and bounded, BFT achieves immediate finality: once a block is committed, it cannot be reversed. This contrasts with Bitcoin’s probabilistic finality, where a transaction is considered safe only after several subsequent blocks have confirmed it (~60 minutes for 6 confirmations).

For fraud detection, the finality model determines response time: a BFT-based enterprise ledger can flag and freeze a transaction within seconds; a PoW-based network requires an hour. See Proof-of-Work, Proof-of-Stake, and DeFi exploit.

Backtest: A retrospective test of an investment strategy or model using historical data, as if the strategy had been applied in the past. A backtest can demonstrate that a strategy would have worked, but cannot prove it will work in future. The gap between in-sample backtest performance and live performance is systematically inflated by overfitting, look-ahead bias, and survivorship bias. Harvey and Liu (2015) provide a comprehensive treatment of backtest methodology and its limitations. See also walk-forward validation.

Bias-variance tradeoff: A fundamental property of statistical learning: any model must balance two sources of prediction error. Bias arises when the model is too constrained to capture the true relationship (underfitting); variance arises when the model is so flexible it fits noise as well as signal (overfitting). For a model $\hat{f}$, the expected mean squared error decomposes as $\text{Bias}^2[\hat{f}] + \text{Var}[\hat{f}] + \sigma^2_\varepsilon$, where $\sigma^2_\varepsilon$ is irreducible noise. In portfolio optimisation, naive MPT has high variance: small changes in estimated returns produce wildly different optimal weights. See Murphy (2012) for the formal treatment; Prado (2018) for financial ML applications. See overfitting and estimation error.

Black-Litterman model: A Bayesian framework for portfolio construction that combines the market equilibrium portfolio with investor views (Reher and Sokolinski 2024). The key insight: rather than estimating expected returns from historical data (which is noisy), start from market-implied returns — the returns that would make the current market portfolio optimal — and then adjust these according to specific views, weighted by the investor’s confidence in each view. The result is a much more stable and diversified set of inputs to MPT than raw historical estimates. Black-Litterman is the dominant approach in institutional portfolio management because it avoids the “garbage in, garbage out” problem of unconstrained mean-variance optimisation. It is a form of shrinkage in which the market portfolio serves as the prior. See Modern Portfolio Theory, estimation error, and shrinkage.

C

Cardinality constraint: A constraint on a portfolio optimisation problem that limits the number of assets held to some maximum $K$ out of a universe of $N$ assets. For example: “hold no more than 50 stocks from the S&P 500.” Adding cardinality constraints transforms the problem from a smooth quadratic programme (solvable exactly by classical methods) into a combinatorial search problem that is NP-hard: there are $\binom{N}{K}$ possible portfolios to evaluate, a number that grows exponentially. For $N = 500$ and $K = 50$, the number of combinations exceeds $10^{62}$ — no exact algorithm can explore them all. This is a principal motivation for evolutionary algorithms in portfolio optimisation. See W. Liu et al. (2024).

Covariance matrix: A symmetric $N \times N$ matrix $\Sigma$ whose $(i,j)$ element is the covariance between the returns of assets $i$ and $j$. The diagonal entries are the variances of individual assets; the off-diagonal entries capture how assets move together. The covariance matrix is the central object in MPT: portfolio variance is $w^\top \Sigma w$ for weight vector $w$. For $N$ assets, the full matrix contains $N(N+1)/2$ unique parameters that must be estimated from historical data. For 1,000 assets, this is 500,500 parameters: a severe estimation error problem. The eigenvalue decomposition $\Sigma = V \Lambda V^\top$ reveals the matrix’s structure: large eigenvalues correspond to dominant risk factors; small ones are noise (see Marcenko-Pastur Law). See Modern Portfolio Theory, eigenvalue, and denoising.

Cost-sensitive threshold: A classification decision boundary chosen to minimise expected cost rather than maximise accuracy. In rare-event problems (fraud detection, default prediction), the default 0.5 probability threshold is almost always wrong: it predicts the majority class for every observation, achieving high accuracy while catching zero events. A cost-sensitive threshold explicitly sets the relative penalties for false positives (e.g. investigation cost) and false negatives (e.g. regulatory fine), then selects the threshold that minimises total expected cost. The optimal threshold shifts towards zero as the cost ratio (false negative / false positive) increases. Different regulatory environments imply different cost ratios and therefore different thresholds, a fact students should discuss in their analysis. See walk-forward validation.

Cross-validation (K-fold): A method for estimating out-of-sample prediction error using only the available data. The idea is to split the training sample into (K) parts (called folds). You fit the model on (K-1) folds and evaluate it on the held-out fold, repeating so that each fold is used once as the validation set. The average validation error approximates the error you should expect on new data.

Cross-validation is most often used to choose hyperparameters such as the regularisation strength () in ridge regression or LASSO. It answers a prediction question: “Which setting minimises expected error on unseen data?” It does not establish causality or make coefficients interpretable.

In finance, naive random K-fold cross-validation can induce look-ahead bias because time series observations are not exchangeable. Use time-aware versions such as walk-forward validation, where training always uses past data and testing uses future data.

See out-of-sample testing, overfitting, and walk-forward validation.

D

Deflated Sharpe Ratio (DSR): A multiple-testing-adjusted version of the Sharpe ratio that corrects for selection bias, backtest overfitting, non-normality of returns, and finite sample length, introduced by Bailey and Prado (2014). Practical implication: a Sharpe of 2.0 chosen as the best of 100 configurations means something fundamentally different from a Sharpe of 2.0 from a single pre-specified test. Reporting the number of configurations tested is therefore not optional: a Sharpe ratio without a trial count is uninterpretable as evidence of skill. See Prado (2018) for implementation.

Decision tree: A non-linear prediction model that approximates the conditional mean (E[Y X]) using a set of if-then rules that recursively partition the feature space. In regression settings, each split chooses a threshold on one predictor (for example, “momentum > 0.05?”) that reduces in-sample squared error the most among candidate splits. The terminal nodes (leaves) then predict a constant, typically the average outcome in that region of the training data.

Trees are attractive because they capture interactions automatically (the effect of value can differ depending on momentum) and produce rule-based explanations. Their main weakness is variance: a single deep tree is unstable, small changes in the data can change the split structure. Ensemble methods such as random forests and gradient boosting stabilise trees by averaging or by sequential correction.

See overfitting, cross-validation, and walk-forward validation.

DeFi exploit (flash loan / oracle manipulation): A category of attack specific to Decentralised Finance (DeFi) protocols on programmable blockchains such as Ethereum. Unlike traditional financial fraud, DeFi exploits can complete entirely within a single transaction block (roughly 12 seconds), which renders conventional fraud monitoring useless.

Three principal attack patterns (Perez and Livshits 2021):

Flash loan attack: borrow an arbitrarily large sum with no collateral, provided the loan is repaid within the same transaction. The attacker uses the borrowed capital to manipulate a price (e.g. artificially inflate a token’s value in a thinly-traded pool), extract profit from a protocol that reads the manipulated price, then repays the loan. If any step fails, the entire transaction reverts — so the attacker risks nothing. Single exploits have exceeded $100M.
Oracle manipulation: DeFi protocols rely on on-chain “oracles” that report asset prices. An attacker with access to a flash loan can temporarily distort a thinly-traded token’s price, causing the protocol to misliquidate positions or allow under-collateralised borrowing at the artificial price.
Re-entrancy: a smart contract calls an external contract mid-execution; the external contract calls back into the original before its internal state is updated, allowing repeated fund withdrawals in a loop. The DAO hack (2016, ~$60M) is the canonical example (Atzei, Bartoletti, and Cimoli 2017).

Detection is fundamentally different from card fraud: there is no temporal drift across months, no customer profile, and the adversary can iterate within blocks. Defence relies on on-chain simulation before transaction inclusion, formal verification of contract logic, and anomaly detection in mempool data. See Isolation Forest and hybrid model.

Decentralised Finance (DeFi): Financial services implemented as self-executing smart contracts on a programmable blockchain (most commonly Ethereum), operating without a central intermediary. Core DeFi primitives include decentralised exchanges (constant-product automated market makers such as Uniswap), lending protocols (Aave, Compound), and stablecoins (DAI). Because the contract logic is public and execution is deterministic, DeFi is auditable but also fully attackable: any logical flaw in a contract can be exploited by anyone with sufficient capital, often atomically within a single block. See DeFi exploit.

Denoising (covariance matrix): The process of removing statistical noise from an estimated covariance matrix before using it in portfolio optimisation. When a covariance matrix is estimated from limited data, many of its eigenvalues reflect sampling noise rather than genuine asset relationships — the Marcenko-Pastur Law provides a theoretical boundary between signal and noise eigenvalues. Denoising proceeds in three steps: (1) eigendecompose the sample covariance matrix $\Sigma = V \Lambda V^\top$; (2) identify noise eigenvalues (those below the Marcenko-Pastur upper bound $\lambda_+$) and replace them with their mean; (3) reconstruct the denoised matrix $\hat{\Sigma} = V \hat{\Lambda} V^\top$. The result is a more stable, better-conditioned matrix that produces more robust portfolio weights out of sample. Denoising is a principled form of shrinkage grounded in Random Matrix Theory. See Random Matrix Theory, Marcenko-Pastur Law, and W. Liu et al. (2024).

E

Efficient frontier: The set of portfolios that offer the highest expected return for a given level of risk (variance), or equivalently, the lowest risk for a given expected return. Introduced by Markowitz (1952) as the central concept of Modern Portfolio Theory. Portfolios on the efficient frontier are called mean-variance efficient; those below it are dominated — there exists another portfolio with the same risk and higher return, or the same return and lower risk. The shape of the efficient frontier depends entirely on the expected returns and covariance matrix of the assets, which must be estimated from data. Because these estimates are noisy (estimation error), the empirical efficient frontier can differ dramatically from the true frontier. Adding more objectives (e.g. skewness, transaction costs) transforms the efficient frontier into a Pareto frontier in a higher-dimensional objective space.

Eigenvalue / Eigenvector: For a square matrix $A$, a scalar $\lambda$ and non-zero vector $v$ satisfying $Av = \lambda v$ are called an eigenvalue and its corresponding eigenvector. The eigenvector is a direction that the matrix does not rotate — it is simply scaled by $\lambda$. In finance, the eigendecomposition of the covariance matrix, $\Sigma = V \Lambda V^\top$, is fundamental: the eigenvectors define the principal directions of risk, and the eigenvalues measure how much variance lies in each direction. The largest eigenvalue typically corresponds to the market factor (all assets tend to move together); the next few capture sector or style effects. Most eigenvalues are small and reflect estimation error rather than genuine risk structure — the Marcenko-Pastur Law formalises exactly which ones. The same concept generalises to rectangular matrices as the Singular Value Decomposition (SVD), which underlies word embeddings and LoRA fine-tuning of transformers — the identical mathematical insight applied to language models. See covariance matrix, Marcenko-Pastur Law, SVD, and denoising.

Embeddings: Dense vector representations of discrete objects (tokens, documents, entities) in a continuous $\mathbb{R}^d$ space, learned during model training. The key property: semantically similar objects are geometrically close (high cosine similarity). Static embeddings (Word2Vec, GloVe) assign one fixed vector per word type via matrix factorisation — the rectangular equivalent of eigendecomposition; contextual embeddings (transformer-based) assign different vectors depending on surrounding context. The embedding dimension $d$ is the number of singular vectors (analogous to eigenvalues) retained — the low-rank approximation of the co-occurrence or representation matrix. For financial applications, embeddings trained on general web text may not accurately represent domain-specific regulatory vocabulary. See Vaswani et al. (2017).

Estimation error: In portfolio optimisation, the uncertainty in the estimated inputs — expected returns and the covariance matrix — due to limited historical data. The critical problem: mean-variance optimisation is highly sensitive to input estimates; small errors in expected returns lead to extreme, unstable portfolio weights. This is sometimes called the “error maximisation” problem — the optimiser treats estimation error as genuine information and doubles down on it. For daily data over 10 years (~2,500 observations) and 100 assets, the covariance matrix has 5,050 free parameters: the system is statistically underdetermined, and the empirical matrix is far from the true matrix. Estimation error is the principal reason that naive MPT frequently underperforms a simple equal-weight portfolio out of sample (Prado 2018). The solutions — shrinkage, Black-Litterman, denoising — all reduce the influence of noisy estimates. See Marcenko-Pastur Law for the theoretical characterisation.

ETF (Exchange-Traded Fund): A fund that tracks an index (e.g., S&P 500) and trades on a stock exchange like a share. ETFs typically have much lower expense ratios than actively managed mutual funds because they require no active stock-picking. The Bloomberg database used in this course contains eight ETFs: SPY, TLT, QQQ, GLD, EFA, BND, IWM, and VNQ, spanning equities, bonds, gold, real estate, and international exposure.

Evolutionary algorithm (MOEA): A class of population-based metaheuristic optimisation methods inspired by biological evolution — natural selection, mutation, and recombination. A population of candidate solutions (portfolios) evolves over generations: better solutions are more likely to survive and produce offspring; random mutations and crossover operations explore new combinations. Multi-Objective Evolutionary Algorithms (MOEAs) simultaneously optimise several conflicting objectives (e.g. maximise return, minimise variance, maximise skewness), producing a Pareto frontier of non-dominated solutions. Evolutionary algorithms are particularly suited to portfolio problems because they do not require the problem to be convex, differentiable, or tractable — they handle NP-hard problems with cardinality constraints, minimum transaction lots, and other real-world restrictions that defeat classical solvers. W. Liu et al. (2024) propose a MOEA framework for large-scale portfolio selection that handles both random and uncertain returns. See NP-hard, Pareto frontier, and cardinality constraint.

F

False discovery problem: In research and investment strategy, the problem that arises when many hypotheses are tested and only the successful results are reported. Under the null hypothesis of no true effect, each test at the 5% threshold produces a false positive with probability 5%; conducting $k$ independent tests yields on average $0.05k$ false positives. Harvey, Liu, and Zhu (2016) reviewed 316 published equity factors and showed that, after adjusting for multiple testing, the majority of claimed factor premia cannot be statistically distinguished from noise. See p-hacking and backtest.

Fat tails: The property of a distribution in which extreme outcomes occur far more frequently than a normal distribution predicts. Measured by excess kurtosis — a normal distribution has kurtosis = 3 (excess kurtosis = 0); financial return series typically have excess kurtosis exceeding 5. The S&P 500’s daily return series exhibits excess kurtosis exceeding 12, meaning moves of 3 or more standard deviations occur roughly 10 times as often as normality implies. Risk models that assume normality (e.g. standard Value at Risk) systematically understate the probability of catastrophic losses. Classified as one of the canonical stylised facts by Cont (2001). See skewness and volatility clustering.

FinTech: Financial Technology. Broadly, technological change in financial services. Industry usage focuses on product verticals (payments, lending, wealth management, insurance). Academic usage focuses on changes to financial functions driven by lower information costs and shifts in market structure (Philippon 2016).

G

GARCH (Generalised Autoregressive Conditional Heteroskedasticity): A time series model for time-varying volatility, introduced by Bollerslev (1986) as an extension of Engle’s ARCH model (Engle 1982). The standard GARCH(1,1) specifies: $\sigma^2_t = \omega + \alpha \varepsilon^2_{t-1} + \beta \sigma^2_{t-1}$, where $\sigma^2_t$ is today’s conditional variance, $\varepsilon_{t-1}$ is the previous period’s return shock, and the constraint $\alpha + \beta < 1$ ensures variance is stationary. The model directly captures volatility clustering: large shocks increase tomorrow’s variance (through $\alpha$), which then persists (through $\beta$). Typical daily equity estimates: $\alpha \approx 0.08$, $\beta \approx 0.90$, confirming high persistence. Note that the GARCH recursion is structurally equivalent to an RNN hidden state constrained to a specific scalar parametric form. See stylised facts.

Gradient boosting (boosted trees): An ensemble method that builds a strong predictor by combining many weak predictors (usually shallow decision trees) fitted sequentially. In regression, each new tree is trained to predict the residuals of the current ensemble, so the model focuses on what it is still getting wrong. The learning rate (shrinkage) controls how much each new tree contributes.

Statistical intuition: boosting primarily targets bias reduction by increasing functional flexibility, but it can overfit without careful regularisation (learning rate, tree depth, number of trees, and time-aware validation).

See decision tree, random forest, cross-validation, and walk-forward validation.

I

Isolation Forest: An unsupervised anomaly detection algorithm introduced by F. T. Liu, Ting, and Zhou (2008). The core intuition is that anomalies are easy to isolate: because they occupy sparse regions of the feature space, a recursive random partitioning (equivalent to building a random decision tree) needs very few splits to separate an anomalous observation from the rest. Each tree records the path length needed to isolate each point; the ensemble average of these path lengths is the anomaly score (shorter = more anomalous). Normal observations, embedded in dense clusters, require many more splits and have longer average paths.

Unlike supervised classifiers, Isolation Forest needs no labels. In fraud detection this is valuable because (a) labels are expensive to obtain, and (b) novel fraud tactics are not yet represented in labelled data. The model learns what normal looks like and flags deviations. A key limitation: unusual is not the same as fraudulent. Large legitimate transactions (foreign holidays, one-off purchases) score anomalous too, producing false positives. In practice, the unsupervised anomaly score is best used as an additional feature fed into a supervised model rather than as a standalone classifier — see hybrid model. The contamination hyperparameter sets the expected proportion of anomalies, shifting the decision threshold but not the underlying scores. See class imbalance and cost-sensitive threshold.

Hybrid model (unsupervised + supervised): In fraud detection, a two-stage architecture in which unsupervised anomaly scores (e.g. from Isolation Forest, an autoencoder, or a graph centrality measure) are computed first and then added as features to a supervised classifier. The supervised model then learns when anomalous-looking transactions are actually fraudulent, calibrating against the label distribution. This pattern addresses the complementary weaknesses of each approach: unsupervised methods surface novel patterns without requiring labels; supervised methods use labels to separate genuine fraud from benign anomalies. The AUC improvement from adding one unsupervised feature is typically modest (+0.01–0.02) but consistent, and stacking multiple unsupervised signals compounds the gain.

L

Log returns: The continuously compounded return on an asset: $g_t = \ln(P_t/P_{t-1})$, where $P_t$ is the price at time $t$. Log returns are preferred in statistical analysis for two reasons: (1) they are additive over time — the multi-period log return is the exact sum of single-period log returns; (2) they prevent negative prices by construction. For small returns ($|r_t| < 5\%$), $g_t \approx r_t$; the difference becomes material for large moves. The stylised facts of asset returns apply to both log returns and simple returns. See Campbell, Lo, and MacKinlay (1997).

LoRA (Low-Rank Adaptation): A parameter-efficient fine-tuning method for large language models that represents weight updates as the product of two low-rank matrices: $\Delta W = AB$, where $A \in \mathbb{R}^{d \times r}$, $B \in \mathbb{R}^{r \times k}$, and $r \ll \min(d, k)$. The motivation is identical to denoising in portfolio construction: most of the task-relevant signal in a weight matrix lives in a low-rank subspace — the top-$r$ singular vectors (see SVD) — and the rest is noise. By restricting updates to this subspace, LoRA achieves performance competitive with full fine-tuning at a fraction of the parameter cost. The connection to the Marcenko-Pastur Law is direct: if a weight matrix has most eigenvalues in the noise range, the effective rank of meaningful information is small, justifying a low-rank update. See SVD, eigenvalue, and embeddings.

Look-ahead bias: The use in a model or backtest of information that would not have been available to a real decision-maker at the time. Examples: normalising data using a full-sample mean computed over the test period; constructing a historical index portfolio using current membership, including companies that were not in the index in earlier years. The test: “At time $t$, could a real decision-maker have legally and practically known this?” Look-ahead bias is among the most common sources of spuriously good backtest performance. See Harvey and Liu (2015) and Prado (2018). See backtest.

LSTM (Long Short-Term Memory): A recurrent neural network architecture designed to learn long-range dependencies in sequential data, introduced by Hochreiter and Schmidhuber (1997). LSTMs address the vanishing gradient problem of standard RNNs by maintaining a separate cell state $c_t$ that passes through time via additive operations. Three learned gates — forget ($f_t$), input ($i_t$), and output ($o_t$) — selectively control what information is erased, written, and exposed. Finance analogy: the forget gate acts as a regime-switching detector, discarding stale model parameters when the market transitions to a new regime. LSTMs substantially outperform vanilla RNNs on tasks with dependencies exceeding ~20 timesteps, but are in turn superseded by transformer architectures for most modern large-scale applications. See transformer and RNN.

M

Mixing service (cryptocurrency tumbler): A service or protocol that obscures the transaction trail on a public blockchain by pooling funds from multiple senders and returning equivalent amounts to intended recipients via a series of intermediate addresses. The goal is to break the linkage between input and output addresses that blockchain analytics firms (Chainalysis, Elliptic) exploit to trace illicit flows. Centralised tumblers (custodial services) carry counterparty risk and can be seized by law enforcement. Decentralised mixing protocols (e.g. Tornado Cash on Ethereum) use smart contracts and cryptographic proofs (zero-knowledge proofs) to provide non-custodial mixing. Tornado Cash was sanctioned by the US Treasury’s OFAC in 2022. See privacy coin and DeFi.

Marcenko-Pastur Law: A result from Random Matrix Theory that characterises the distribution of eigenvalues of a large random covariance matrix. For a matrix estimated from $T$ observations of $M$ assets, if the returns are purely random (no genuine correlations), the eigenvalues $\lambda$ are distributed within the bounds: \[\lambda_{\pm} = \sigma^2 \!\left(1 \pm \sqrt{\tfrac{M}{T}}\right)^{\!2}\] where $Q = T/M$ is the observation-to-asset ratio and $\sigma^2$ is the mean eigenvalue. Eigenvalues inside $[\lambda_-, \lambda_+]$ are statistically indistinguishable from noise. Eigenvalues above $\lambda_+$ carry genuine information about asset correlations. For a typical fund with 100 assets and 250 observations ($Q = 2.5$), most empirical eigenvalues fall inside the noise zone — meaning most of the estimated covariance structure is statistically artefactual. The law gives a theoretically grounded rule for denoising the covariance matrix: replace noise eigenvalues with their mean, preserve signal eigenvalues. The same insight applies to LoRA fine-tuning and SVD-based compression of neural network weight matrices. See eigenvalue, denoising, Random Matrix Theory, and W. Liu et al. (2024).

Modern Portfolio Theory (MPT): The mathematical framework for constructing portfolios that maximise expected return for a given level of risk (variance), introduced by Markowitz (1952). Given $N$ assets with expected return vector $\mu$ and covariance matrix $\Sigma$, the optimal portfolio $w^*$ solves: \[\max_w \; w^\top \mu - \frac{\gamma}{2} w^\top \Sigma w \quad \text{subject to} \; \sum_i w_i = 1\] where $\gamma$ is the investor’s risk aversion. The solution traces the efficient frontier — the set of mean-variance efficient portfolios. MPT is theoretically elegant but faces severe practical challenges: estimation error in $\mu$ and $\Sigma$ causes extreme, unstable weights; the normality assumption ignores fat tails and skewness; and classical quadratic programming solvers do not scale to large portfolios with cardinality constraints. See estimation error, efficient frontier, Black-Litterman, and Prado (2018).

Multi-objective optimisation / Pareto frontier: An optimisation problem with two or more conflicting objectives — for example, maximise return and minimise variance and maximise skewness. Because no single portfolio can simultaneously be best on all objectives, the solution is a Pareto frontier (also called a Pareto front or Pareto set): the set of portfolios such that no other portfolio is better on every objective simultaneously. A portfolio is Pareto-dominated if there exists another portfolio that is at least as good on all objectives and strictly better on at least one. Classical MPT collapses the multi-objective problem into a single scalar (the Sharpe ratio) — in doing so, it discards information about investor preferences for skewness and other moments. Evolutionary algorithms are well suited to generating the full Pareto frontier because they can maintain a diverse population of solutions simultaneously. See W. Liu et al. (2024).

N

NP-hard: A class of combinatorial optimisation problems for which no known algorithm can find the exact solution in polynomial time as the problem size grows. In portfolio optimisation, adding cardinality constraints (limit to $K$ assets from $N$) or minimum transaction lots (invest in whole units) converts the problem from a smooth quadratic programme to an NP-hard combinatorial search. For $N = 1{,}000$ and $K = 50$, the number of candidate portfolios exceeds $10^{62}$ — exhaustive search is computationally infeasible. Evolutionary algorithms provide good approximate solutions for NP-hard problems without guaranteeing the exact optimum. The practical implication: the portfolio problem that robo-advisers actually face — with hundreds of assets and real-world constraints — cannot be solved exactly, only approximated. See evolutionary algorithm and cardinality constraint.

Null hypothesis: The default assumption in a statistical test: that there is no effect, no difference, or no relationship. In financial strategy evaluation, the null hypothesis is typically “this strategy has zero alpha.” The purpose of a statistical test is to assess whether the observed data are surprising enough, under this assumption, to reject it. The conventional threshold (p < 0.05) means: if the null hypothesis were true, there is less than a 5% chance of seeing results this extreme. This threshold is a convention, not a law, and may be far too lenient when many tests are conducted simultaneously (see p-hacking and false discovery problem).

O

Out-of-sample testing: Evaluating a model or strategy on data that was not used in its development or selection. The essential evidence standard in empirical finance and machine learning: a model that works well in-sample but fails out-of-sample is almost certainly overfitted. See walk-forward validation for the rigorous implementation and backtest for the common but weaker alternative.

Overfitting: When a model is fitted so closely to the training data that it captures idiosyncratic noise rather than the underlying pattern, causing poor performance on new data. In quantitative finance, where the signal-to-noise ratio is very low, overfitting is the default failure mode: strategies with many free parameters almost always achieve impressive in-sample results that do not survive out-of-sample testing. See bias-variance tradeoff and Prado (2018).

Overparameterised regime (P ≥ T): A regression setting in which the number of predictors (P) is at least as large as the number of observations (T). With a feature matrix (X ^{T P}), this is the “wide matrix” case, there are more columns than rows.

This matters because the standard OLS formula ( = (X^X){-1}X^y) relies on (X^X) being invertible. When (P T), (X^X) is singular and OLS is not uniquely defined without additional constraints. Intuitively, there are many different coefficient vectors that can fit the training data equally well, which makes estimates unstable and encourages memorisation of noise.

Regularisation (for example, ridge regression) restores well-posedness by adding a penalty that effectively replaces (X^X) with (X^X + I), which is invertible. The special case (P = T) is often called the interpolation boundary: the point at which perfect in-sample fit becomes algebraically easy but statistically misleading.

See shrinkage, bias-variance tradeoff, and out-of-sample testing.

P

Proof-of-Work (PoW): The consensus mechanism used by Bitcoin and early Ethereum. Validators (“miners”) compete to find a nonce value such that the hash of the new block header falls below a target threshold — a computationally expensive search with no shortcut. The first miner to find a valid nonce broadcasts the block and earns the block reward. The security guarantee is economic: attacking the chain (e.g. reversing a confirmed transaction) requires the attacker to redo all the computational work from the target block forward, at a cost exceeding the honest chain’s accumulated work. At Bitcoin’s scale, this requires ~$10 billion in mining hardware and ongoing electricity costs.

The weakness is probabilistic finality: because any miner could, in principle, produce a longer chain from an earlier block, a transaction is only considered safe after several subsequent blocks confirm it (the standard is 6 blocks, ~60 minutes). This “confirmation window” is a direct constraint on fraud detection response time. See Byzantine Fault Tolerant consensus and Proof-of-Stake.

Proof-of-Stake (PoS): A consensus mechanism in which validators are selected to propose and attest to new blocks in proportion to the amount of cryptocurrency they have locked up (“staked”) as collateral. Adopted by Ethereum in 2022 (“The Merge”), PoS replaces the energy-intensive mining of Proof-of-Work with an economic stake. Dishonest or contradictory behaviour (e.g. signing two conflicting blocks for the same slot) triggers slashing: the offending validator’s stake is automatically and irreversibly destroyed by the protocol, imposing a direct financial penalty proportional to the severity of the offence.

PoS achieves stronger finality than PoW (~15 minutes on Ethereum with the Casper FFG finality gadget) and consumes roughly 0.01% of Bitcoin’s energy. The trade-off is a more complex validator set management and the question of “nothing-at-stake” attacks (mitigated by slashing). For fraud detection, the faster finality window means suspicious transactions can be identified more quickly than on Bitcoin. See Proof-of-Work and Byzantine Fault Tolerant consensus.

Privacy coin: A cryptocurrency designed to provide stronger transaction anonymity than Bitcoin or Ethereum, whose public ledgers allow transaction graph analysis. The two most widely used are:

Monero (XMR): uses ring signatures (transactions are signed by one member of a group; outsiders cannot determine which), stealth addresses (one-time recipient addresses), and RingCT (hides transaction amounts). The transaction graph is effectively opaque to chain analytics.
Zcash (ZEC): uses zk-SNARKs (a form of zero-knowledge proof) to allow “shielded” transactions where sender, recipient, and amount are hidden while the blockchain still validates correctness without revealing underlying data. Transparent Zcash transactions have full Bitcoin-like visibility.

For fraud detection, privacy coins present a fundamental challenge: the graph and amount features that support models such as those on the Elliptic dataset do not exist for shielded transactions. Detection relies instead on on/off-ramp monitoring (exchanges with KYC obligations) and behavioural patterns at the point of conversion. Several exchanges have delisted privacy coins under regulatory pressure (FCA, 2021). See mixing service and DeFi exploit.

Pareto frontier: See Multi-objective optimisation.

P-hacking: The practice of testing many statistical specifications, hypotheses, or model configurations and reporting only those that produce a result passing a significance threshold (p < 0.05). At a 5% threshold, 1 in 20 independent tests of a null hypothesis will reject it by chance. Harvey, Liu, and Zhu (2016) show that most of the published equity factor literature is affected. The solutions are pre-registration, out-of-sample validation, and adjusted significance thresholds. See false discovery problem.

Q

Q-ratio (observation-to-asset ratio): The ratio $Q = T/M$ of the number of observations $T$ to the number of assets $M$ in a portfolio estimation problem. The Q-ratio governs the severity of estimation error in the covariance matrix: when $Q \gg 1$, estimates are well-determined; when $Q < 1$, the covariance matrix is rank-deficient and cannot be inverted without regularisation. In the Marcenko-Pastur Law, $Q$ controls the width of the noise zone: small $Q$ (underdetermined) produces wide bounds, meaning almost all eigenvalues are noise. For a typical institutional manager with 1,000 assets and 250 annual observations, $Q = 0.25$ — the problem is severely underdetermined. See Marcenko-Pastur Law and estimation error.

R

Random Matrix Theory (RMT): A branch of mathematics that studies the statistical properties of matrices with random entries, with applications in physics, statistics, and finance. In portfolio optimisation, RMT provides a rigorous framework for understanding the eigenvalue spectrum of large sample covariance matrices. The central result for finance is the Marcenko-Pastur Law: it defines precisely which eigenvalues of an empirical covariance matrix are consistent with pure random noise, and which carry genuine information. RMT-based denoising exploits this result to construct more stable covariance matrices by replacing noise eigenvalues with their theoretical expectation. The same mathematical framework applies to the weight matrices of neural networks — researchers use RMT to analyse training dynamics and guide model compression. See Marcenko-Pastur Law, denoising, and eigenvalue.

Random walk: A time series in which changes from one observation to the next are unpredictable given past history: $P_t = P_{t-1} + \varepsilon_t$, with $\varepsilon_t$ independent and identically distributed. Asset prices closely approximate a random walk — the empirical content of the weak form of the Efficient Market Hypothesis (Fama 1970). A random walk is non-stationary; the solution is to work with log returns (first differences of log prices). See stationarity.

Random forest: A tree-based ensemble method that reduces the variance of a single decision tree by averaging many trees. Each tree is trained on a bootstrap resample of the data (bagging), and at each split it considers only a random subset of predictors. This feature subsampling decorrelates the trees, making the average prediction more stable and improving out-of-sample performance.

Statistical intuition: random forests primarily target variance reduction. They are often a strong baseline for tabular prediction problems, but their non-linear structure means they are not interpreted via regression coefficients.

See decision tree, gradient boosting, overfitting, and cross-validation.

Regime-dependence: The property of a strategy or model that performs well only in certain market conditions (e.g. bull markets, low volatility) and fails in others. A canonical example from the Bloomberg data: the SPY–TLT correlation was negative (diversifying) from 2018 to 2021 but turned positive during the 2022 rate-hiking cycle — any portfolio “optimised” on pre-2022 data suffered a regime break. Testing across multiple regimes (bull, bear, high/low volatility, crisis) is a necessary part of evidence discipline. The correlation flip also violates MPT’s assumption of a stationary covariance matrix. See backtest and Harvey and Liu (2015).

RNN (Recurrent Neural Network): A neural network architecture for sequential data that maintains a hidden state $h_t$ summarising all preceding inputs, updated as $h_t = f(W_{hh} h_{t-1} + W_{xh} x_t + b_h)$. Standard RNNs suffer from vanishing gradients: gradient signals from distant timesteps decay exponentially, limiting useful memory to ~20 timesteps. Finance analogy: the GARCH equation is a one-dimensional RNN with a fixed parametric form. The LSTM was designed to solve the vanishing gradient problem. See Hochreiter and Schmidhuber (1997).

Robo-adviser: An automated investment service that constructs and manages portfolios using algorithms, with minimal human intervention. Typically charges 0.25% of AUM versus ~1% for a traditional adviser. Automates prediction (portfolio optimisation, rebalancing) but relies on the client’s stated risk tolerance for judgement. The economics of robo-advisers are studied in Reher and Sokolinski (2024), who document substantial improvements in access and welfare, particularly for lower-wealth investors. See Modern Portfolio Theory and Black-Litterman.

S

Sharpe ratio: A measure of risk-adjusted return: $\text{SR} = (\bar{r} - r_f) / \sigma$, where $\bar{r}$ is the portfolio return, $r_f$ the risk-free rate, and $\sigma$ the return standard deviation. The canonical formulation is Sharpe (1994). A Sharpe ratio of 1.0 is considered good; 2.0+ exceptional in a live, out-of-sample context. The Sharpe ratio is susceptible to inflation via multiple testing: selecting the best of $k$ strategies by Sharpe inflates the expected maximum Sharpe even when all true Sharpe ratios are zero. The Deflated Sharpe Ratio corrects for this. See Prado (2018).

SHAP values (Shapley Additive Explanations): A post-hoc explanation method for interpreting predictions from complex models. For a fitted model () and an observation (i), SHAP expresses the prediction as an additive decomposition: \[\hat{y}_i = \phi_0 + \sum_{j=1}^{P} \phi_j^{(i)}\] where (_0) is a baseline prediction (defined relative to a background dataset) and (_j^{(i)}) is the contribution of feature (j) to observation (i)’s prediction. The contributions are based on Shapley values from cooperative game theory, which allocate credit across features under a set of axioms.

The central caution is statistical, not computational. SHAP explains the model you fitted, not the true data generating process. With correlated predictors, attributions are not uniquely identified and can shift depending on the background distribution and modelling choices. SHAP is therefore best used as a tool for describing how a model behaves, not as evidence that a feature has a causal effect.

See decision tree, random forest, gradient boosting, overfitting, and out-of-sample testing.

Shrinkage (shrinkage estimator, James-Stein): A family of statistical estimators that deliberately pull noisy estimates towards a target value (often the grand mean, zero, or a Bayesian prior) in order to reduce variance. The principle is simple: a small increase in bias can be worth accepting if it produces a larger reduction in variance, so that mean squared error falls overall.

In regression and forecasting, penalised methods such as ridge regression and LASSO are shrinkage estimators. The penalty strength () controls how aggressively coefficients are shrunk towards zero, and the practical goal is to minimise out-of-sample prediction error rather than to make bias exactly zero.

The term is often introduced via the James-Stein paradox: for three or more parameters, the sample mean is inadmissible, and a shrinkage estimator can always achieve lower mean squared error. In portfolio optimisation, shrinkage is applied to expected returns and to the covariance matrix to reduce estimation error and produce more stable weights out of sample. The Marcenko-Pastur Law provides a principled shrinkage rule for the covariance matrix eigenvalues.

See estimation error, Black-Litterman, denoising, and bias-variance tradeoff.

Signal-to-noise ratio: In financial time series, the ratio of the expected return (signal) to its standard deviation (noise) — equivalent to the Sharpe ratio for a zero-risk-free-rate asset. For monthly equity returns, the signal-to-noise ratio is typically 0.15–0.25. This has a direct consequence for statistical inference: detecting genuine positive returns requires many more observations than most evaluation periods provide. The fundamental statistical power problem in finance: most investment evaluation periods are far too short to validate or refute a strategy. See Harvey and Liu (2015).

Skewness: A measure of the asymmetry of a probability distribution. A negatively skewed distribution has a longer left tail: large losses are more extreme than large gains. Daily equity returns are typically negatively skewed, meaning sharp falls are larger in magnitude than equivalent single-day gains. This is one of the canonical stylised facts. MPT’s variance measure is symmetric — it treats a 20% gain and a 20% loss identically — meaning portfolios optimised on variance alone under-weight the asymmetric risk of large losses. Incorporating skewness as an objective converts the efficient frontier into a Pareto frontier in mean-variance-skewness space. See Cont (2001).

Stationarity: A property of a time series in which the mean, variance, and autocovariance structure are constant over time. Standard statistical models require stationarity. Asset price levels are non-stationary: under a random walk, variance grows proportionally with time. Daily log returns are approximately stationary — while volatility clusters, the unconditional distribution of returns is stable over time. See Campbell, Lo, and MacKinlay (1997).

Stylised facts: Empirical regularities of financial return distributions that appear consistently across markets, asset classes, and time periods, catalogued by Cont (2001). The principal facts: (1) near-zero autocorrelation in returns; (2) fat tails (excess kurtosis); (3) negative skewness; (4) volatility clustering. These are empirical regularities, not theoretical consequences — any model that generates returns must reproduce them to be credible. They directly motivate ARIMA for conditional mean and GARCH for conditional variance.

Survivorship bias: The distortion arising when a sample contains only observations that “survived” a selection process, omitting those that failed. In financial research: mutual fund databases contain only funds still operating; historical stock data excludes delisted stocks. Survivors have on average better performance than the full population because failures are excluded. See backtest and false discovery problem.

SVD (Singular Value Decomposition): The generalisation of eigenvalue decomposition to rectangular matrices. Any matrix $W \in \mathbb{R}^{m \times n}$ can be written as $W = U \Sigma V^\top$, where $U$ and $V$ are orthogonal matrices of left and right singular vectors, and $\Sigma$ is a diagonal matrix of non-negative singular values ordered from largest to smallest. The singular values play the same role as eigenvalues: large values carry signal, small values carry noise. Truncating to the top $r$ singular values gives the best rank-$r$ approximation to $W$ (Eckart-Young theorem). SVD underlies: word embeddings (GloVe and Word2Vec implicitly factorise a co-occurrence matrix); LoRA (weight update matrices are constrained to rank $r$); and the Marcenko-Pastur Law applied to rectangular weight matrices in neural networks. In finance, denoising is a special case of SVD truncation applied to the symmetric covariance matrix. See eigenvalue, LoRA, and embeddings.

T

Transformer: A neural network architecture based entirely on attention mechanisms, without recurrence or convolution, introduced by Vaswani et al. (2017). The core component is multi-head self-attention, which allows every token to attend directly to every other token, with attention weights learned from data. This replaces the sequential hidden-state computation of RNNs and LSTMs with a fully parallel computation, enabling training at internet scale. All modern large language models are transformers. The connection to portfolio mathematics: the weight matrices of transformers have an eigenvalue spectrum that can be analysed using Random Matrix Theory, informing model compression and fine-tuning strategies such as LoRA. See attention mechanism, eigenvalue, LoRA, and LSTM.

V

VIX: The CBOE Volatility Index — a real-time measure of implied volatility derived from S&P 500 option prices, constructed to reflect the market’s expectation of 30-day annualised volatility. Widely used as a “fear gauge”: VIX peaked at ~80 during the March 2020 Covid crash versus a long-run average of ~20. VIX systematically exceeds subsequently realised volatility by a variance risk premium (Carr and Wu 2009; Bollerslev, Tauchen, and Zhou 2009) — the compensation investors demand for bearing volatility uncertainty. The MOVE index is the bond market equivalent of VIX. Both are available in the Bloomberg database used in this course. See volatility clustering.

Volatility clustering: The property of financial return series in which the magnitude of returns is positively autocorrelated: large absolute returns tend to be followed by large absolute returns, and small by small, regardless of sign. Classified as one of the canonical stylised facts by Cont (2001). The GARCH(1,1) model was designed specifically to capture this phenomenon. The practical implication: volatility can be forecast with useful accuracy (R² ≈ 15–40% for variance) even though return levels cannot (R² ≈ 1–2%).

W

Walk-forward validation: An out-of-sample testing methodology in which a model is trained on a rolling window of historical data and evaluated on the immediately following period, stepping forward through time. The cardinal rule: no information from any test window may influence model design or hyperparameter selection. Walk-forward validation is more rigorous than a single in-sample backtest because the model never “sees” future data during any training step. See Prado (2018); see backtest and look-ahead bias.

References

Atzei, Nicola, Massimo Bartoletti, and Tiziana Cimoli. 2017. “A Survey of Attacks on Ethereum Smart Contracts (SoK).” In Principles of Security and Trust (POST), 10204:164–86. Lecture Notes in Computer Science. Springer. https://doi.org/10.1007/978-3-662-54455-6_8.

Bailey, David H., and Marcos López de Prado. 2014. “The Deflated Sharpe Ratio: Correcting for Selection Bias, Backtest Overfitting and Non-Normality.” Journal of Portfolio Management 40 (5): 94–107. https://doi.org/10.2139/ssrn.2460551.

Bollerslev, Tim. 1986. “Generalized Autoregressive Conditional Heteroskedasticity.” Journal of Econometrics 31 (3): 307–27. https://doi.org/10.1016/0304-4076(86)90063-1.

Bollerslev, Tim, George Tauchen, and Hao Zhou. 2009. “Expected Stock Returns and Variance Risk Premia.” Review of Financial Studies 22 (11): 4463–92. https://doi.org/10.1093/rfs/hhp008.

Box, George E. P., and Norman R. Draper. 1987. Empirical Model-Building and Response Surfaces. John Wiley & Sons.

Campbell, John Y., Andrew W. Lo, and A. Craig MacKinlay. 1997. The Econometrics of Financial Markets. Princeton University Press.

Carr, Peter, and Liuren Wu. 2009. “Variance Risk Premiums.” Review of Financial Studies 22 (3): 1311–41. https://doi.org/10.1093/rfs/hhn038.

Cont, Rama. 2001. “Empirical Properties of Asset Returns: Stylized Facts and Statistical Issues.” Quantitative Finance 1 (2): 223–36. https://doi.org/10.1080/713665670.

Engle, Robert F. 1982. “Autoregressive Conditional Heteroscedasticity with Estimates of the Variance of United Kingdom Inflation.” Econometrica 50 (4): 987–1007. https://doi.org/10.2307/1912773.

Fama, Eugene F. 1970. “Efficient Capital Markets: A Review of Theory and Empirical Work.” Journal of Finance 25 (2): 383–417. https://doi.org/10.2307/2325486.

Harvey, Campbell R., and Yan Liu. 2015. “Backtesting.” Journal of Portfolio Management 42 (1): 13–28. https://doi.org/10.3905/jpm.2015.42.1.013.

Harvey, Campbell R., Yan Liu, and Heqing Zhu. 2016“... And the Cross-Section of Expected Returns.” Review of Financial Studies 29 (1): 5–68. https://doi.org/10.1093/rfs/hhv059.

Hochreiter, Sepp, and Jürgen Schmidhuber. 1997. “Long Short-Term Memory.” Neural Computation 9 (8): 1735–80. https://doi.org/10.1162/neco.1997.9.8.1735.

Jensen, Michael C. 1968. “The Performance of Mutual Funds in the Period 1945–1964.” Journal of Finance 23 (2): 389–416. https://doi.org/10.1111/j.1540-6261.1968.tb00815.x.

Liu, Fei Tony, Kai Ming Ting, and Zhi-Hua Zhou. 2008. “Isolation Forest.” In Proceedings of the 8th IEEE International Conference on Data Mining (ICDM), 413–22. https://doi.org/10.1109/ICDM.2008.17.

Liu, Weilong, Yong Zhang, Kailong Liu, Barry Quinn, Xingyu Yang, and Qiao Peng. 2024. “Evolutionary Multi-Objective Optimisation for Large-Scale Portfolio Selection with Both Random and Uncertain Returns.” IEEE Transactions on Evolutionary Computation.

Markowitz, Harry. 1952. “Portfolio Selection.” Journal of Finance 7 (1): 77–91. https://doi.org/10.1111/j.1540-6261.1952.tb01525.x.

Murphy, Kevin P. 2012. Machine Learning: A Probabilistic Perspective. Cambridge, MA: MIT Press.

Perez, Daniel, and Ben Livshits. 2021. “Smart Contract Vulnerabilities: Vulnerable Does Not Imply Exploited.” In 30th USENIX Security Symposium, 1325–41.

Philippon, Thomas. 2016. “The FinTech Opportunity.” Working Paper w22476. National Bureau of Economic Research. https://www.nber.org/system/files/working_papers/w22476/w22476.pdf.

Prado, Marcos López de. 2018. Advances in Financial Machine Learning. John Wiley & Sons.

Reher, Michael, and Stanislav Sokolinski. 2024. “Robo-Advisors and Access to Wealth Management.” Journal of Financial Economics 155: 103829. https://doi.org/10.1016/j.jfineco.2024.103829.

Sharpe, William F. 1994. “The Sharpe Ratio.” Journal of Portfolio Management 21 (1): 49–58. https://doi.org/10.3905/jpm.1994.409501.

Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. “Attention Is All You Need.” In Advances in Neural Information Processing Systems. Vol. 30. https://arxiv.org/abs/1706.03762.

--- title: "Glossary of Key Terms" subtitle: "Financial Data Science — FIN306 / FIN510 / FIN720" format: html: toc: true toc-depth: 2 number-sections: false bibliography: - reading.bib - reading_supp.bib --- This glossary collects key terms introduced across the course. Definitions are written for a non-specialist audience but include formal references for those who wish to pursue a topic further. The glossary grows week by week — terms are added as they are introduced in lectures. --- ## A ::: {#alpha} **Alpha** : In investment management, the excess return of a strategy or fund over a benchmark (or over the return expected for the level of risk taken). Jensen's formulation measures alpha as the intercept in a regression of fund returns on market returns — i.e. the return not explained by exposure to the market [@jensen1968performance]. "Genuine alpha" means the outperformance is attributable to skill or information, not luck or overfitting. A backtest may show impressive alpha in-sample, but if the strategy only worked in one market regime (e.g. bull markets), that alpha is *regime-dependent* and not robust. See [backtest](#backtest) and [regime-dependence](#regime-dependence). ::: ::: {#arima} **ARIMA** : Autoregressive Integrated Moving Average. A classical linear time series model combining three components: an *autoregressive* (AR) term modelling the current value as a function of $p$ past values; a *moving average* (MA) term using $q$ past forecast errors; and an *integration* (I) parameter $d$ that differences the series to achieve [stationarity](#stationarity). Notation: ARIMA($p$, $d$, $q$). A random walk is ARIMA(0,1,0); a stationary AR(1) is ARIMA(1,0,0). See @box1987empirical for the classic treatment; @campbell1997econometrics for financial applications. See [GARCH](#garch) for modelling time-varying volatility. ::: ::: {#attention-mechanism} **Attention mechanism** : A computational mechanism that allows a neural network to weight the relevance of each element of an input sequence dynamically, rather than compressing the full sequence into a single fixed vector. Formally, given query $Q$, key $K$, and value $V$ matrices: $\text{Attention}(Q, K, V) = \text{softmax}(QK^\top/\sqrt{d_k})V$. The softmax of the scaled dot products acts as an adaptive kernel, assigning each token its own weighted average over all other tokens. Introduced in its modern form by @vaswani2017attention as the sole mechanism of the [transformer](#transformer) architecture, replacing RNN hidden states entirely. The matrix $QK^\top$ has an [eigenvalue](#eigenvalue) structure: each attention head can be interpreted as learning a different dominant eigenvector of the token-interaction matrix, capturing a distinct semantic relationship. Finance analogy: attention is a weighted conditional expectation where the weights are learned, not prescribed. See [transformer](#transformer), [LSTM](#lstm), and [embeddings](#embeddings). ::: ::: {#autocorrelation} **Autocorrelation** : The correlation of a time series with its own past values. If today's return is positively autocorrelated, a positive day tends to be followed by another positive day. Formally, the autocorrelation at lag $k$ is $\rho(k) = \text{Corr}(r_t, r_{t-k})$. For daily financial returns, autocorrelation is close to zero — meaning tomorrow's return is nearly unpredictable from today's. By contrast, the autocorrelation of *squared* returns (a proxy for volatility) is large and persistent, reflecting [volatility clustering](#volatility-clustering). See @cont2001empirical. ::: **AUM (Assets Under Management)** : The total market value of assets that a fund or financial institution manages on behalf of clients. Used to measure the scale of asset managers and to calculate fee income (e.g., a 0.25% fee on £1bn AUM = £2.5m revenue). --- ## B ::: {#bft} **Byzantine Fault Tolerant (BFT) consensus** : A class of distributed consensus protocols that reach agreement correctly even when up to $f$ of $3f+1$ participants behave arbitrarily — including sending false information, colluding, or simply failing. The name comes from the Byzantine Generals Problem (Lamport, Shostak & Pease, 1982): imagine Byzantine army generals communicating by messenger, some of whom may be traitors sending conflicting orders. The problem asks whether loyal generals can still agree on a plan despite traitors. The answer is yes, provided traitors are fewer than one-third of total participants. In blockchain contexts, BFT protocols are used by **permissioned chains** such as Hyperledger Fabric, where validators are known in advance and identified by legal identity rather than anonymous proof-of-work. Because the validator set is fixed and bounded, BFT achieves **immediate finality**: once a block is committed, it cannot be reversed. This contrasts with Bitcoin's probabilistic finality, where a transaction is considered safe only after several subsequent blocks have confirmed it (~60 minutes for 6 confirmations). For fraud detection, the finality model determines response time: a BFT-based enterprise ledger can flag and freeze a transaction within seconds; a PoW-based network requires an hour. See [Proof-of-Work](#proof-of-work), [Proof-of-Stake](#proof-of-stake), and [DeFi exploit](#defi-exploit). ::: ::: {#backtest} **Backtest** : A retrospective test of an investment strategy or model using historical data, as if the strategy had been applied in the past. A backtest can demonstrate that a strategy *would have worked*, but cannot prove it *will work* in future. The gap between in-sample backtest performance and live performance is systematically inflated by [overfitting](#overfitting), [look-ahead bias](#look-ahead-bias), and [survivorship bias](#survivorship-bias). @harvey2019backtesting provide a comprehensive treatment of backtest methodology and its limitations. See also [walk-forward validation](#walk-forward-validation). ::: ::: {#bias-variance} **Bias-variance tradeoff** : A fundamental property of statistical learning: any model must balance two sources of prediction error. **Bias** arises when the model is too constrained to capture the true relationship (underfitting); **variance** arises when the model is so flexible it fits noise as well as signal (overfitting). For a model $\hat{f}$, the expected mean squared error decomposes as $\text{Bias}^2[\hat{f}] + \text{Var}[\hat{f}] + \sigma^2_\varepsilon$, where $\sigma^2_\varepsilon$ is irreducible noise. In portfolio optimisation, naive [MPT](#mpt) has high variance: small changes in estimated returns produce wildly different optimal weights. See @murphy2012machine for the formal treatment; @deprado2018advances for financial ML applications. See [overfitting](#overfitting) and [estimation error](#estimation-error). ::: ::: {#black-litterman} **Black-Litterman model** : A Bayesian framework for portfolio construction that combines the market equilibrium portfolio with investor views [@reher2024robo]. The key insight: rather than estimating expected returns from historical data (which is noisy), start from *market-implied returns* — the returns that would make the current market portfolio optimal — and then adjust these according to specific views, weighted by the investor's confidence in each view. The result is a much more stable and diversified set of inputs to [MPT](#mpt) than raw historical estimates. Black-Litterman is the dominant approach in institutional portfolio management because it avoids the "[garbage in, garbage out](#estimation-error)" problem of unconstrained mean-variance optimisation. It is a form of [shrinkage](#shrinkage) in which the market portfolio serves as the prior. See [Modern Portfolio Theory](#mpt), [estimation error](#estimation-error), and [shrinkage](#shrinkage). ::: --- ## C ::: {#cardinality-constraint} **Cardinality constraint** : A constraint on a portfolio optimisation problem that limits the *number* of assets held to some maximum $K$ out of a universe of $N$ assets. For example: "hold no more than 50 stocks from the S&P 500." Adding cardinality constraints transforms the problem from a smooth quadratic programme (solvable exactly by classical methods) into a combinatorial search problem that is [NP-hard](#np-hard): there are $\binom{N}{K}$ possible portfolios to evaluate, a number that grows exponentially. For $N = 500$ and $K = 50$, the number of combinations exceeds $10^{62}$ — no exact algorithm can explore them all. This is a principal motivation for [evolutionary algorithms](#evolutionary-algorithm) in portfolio optimisation. See @liu2024evolutionary. ::: ::: {#covariance-matrix} **Covariance matrix** : A symmetric $N \times N$ matrix $\Sigma$ whose $(i,j)$ element is the covariance between the returns of assets $i$ and $j$. The diagonal entries are the variances of individual assets; the off-diagonal entries capture how assets move together. The covariance matrix is the central object in [MPT](#mpt): portfolio variance is $w^\top \Sigma w$ for weight vector $w$. For $N$ assets, the full matrix contains $N(N+1)/2$ unique parameters that must be estimated from historical data. For 1,000 assets, this is 500,500 parameters: a severe [estimation error](#estimation-error) problem. The eigenvalue decomposition $\Sigma = V \Lambda V^\top$ reveals the matrix's structure: large [eigenvalues](#eigenvalue) correspond to dominant risk factors; small ones are noise (see [Marcenko-Pastur Law](#marcenko-pastur)). See [Modern Portfolio Theory](#mpt), [eigenvalue](#eigenvalue), and [denoising](#denoising). ::: ::: {#cost-sensitive-threshold} **Cost-sensitive threshold** : A classification decision boundary chosen to minimise expected cost rather than maximise accuracy. In rare-event problems (fraud detection, default prediction), the default 0.5 probability threshold is almost always wrong: it predicts the majority class for every observation, achieving high accuracy while catching zero events. A cost-sensitive threshold explicitly sets the relative penalties for false positives (e.g. investigation cost) and false negatives (e.g. regulatory fine), then selects the threshold that minimises total expected cost. The optimal threshold shifts towards zero as the cost ratio (false negative / false positive) increases. Different regulatory environments imply different cost ratios and therefore different thresholds, a fact students should discuss in their analysis. See [walk-forward validation](#walk-forward-validation). ::: ::: {#cross-validation} **Cross-validation (K-fold)** : A method for estimating out-of-sample prediction error using only the available data. The idea is to split the *training* sample into $K$ parts (called **folds**). You fit the model on $K-1$ folds and evaluate it on the held-out fold, repeating so that each fold is used once as the validation set. The average validation error approximates the error you should expect on new data. Cross-validation is most often used to choose **hyperparameters** such as the regularisation strength $\lambda$ in ridge regression or LASSO. It answers a prediction question: "Which setting minimises expected error on unseen data?" It does not establish causality or make coefficients interpretable. In finance, naive random K-fold cross-validation can induce [look-ahead bias](#look-ahead-bias) because time series observations are not exchangeable. Use time-aware versions such as [walk-forward validation](#walk-forward-validation), where training always uses past data and testing uses future data. See [out-of-sample testing](#out-of-sample-testing), [overfitting](#overfitting), and [walk-forward validation](#walk-forward-validation). ::: --- ## D ::: {#deflated-sharpe-ratio} **Deflated Sharpe Ratio (DSR)** : A multiple-testing-adjusted version of the [Sharpe ratio](#sharpe-ratio) that corrects for selection bias, backtest overfitting, non-normality of returns, and finite sample length, introduced by @lopezdeprado2014dsr. Practical implication: a Sharpe of 2.0 chosen as the best of 100 configurations means something fundamentally different from a Sharpe of 2.0 from a single pre-specified test. Reporting the number of configurations tested is therefore not optional: a Sharpe ratio without a trial count is uninterpretable as evidence of skill. See @deprado2018advances for implementation. ::: ::: {#decision-tree} **Decision tree** : A non-linear prediction model that approximates the conditional mean $E[Y \mid X]$ using a set of if-then rules that recursively partition the feature space. In regression settings, each split chooses a threshold on one predictor (for example, “momentum > 0.05?”) that reduces in-sample squared error the most among candidate splits. The terminal nodes (leaves) then predict a constant, typically the average outcome in that region of the training data. Trees are attractive because they capture interactions automatically (the effect of value can differ depending on momentum) and produce rule-based explanations. Their main weakness is variance: a single deep tree is unstable, small changes in the data can change the split structure. Ensemble methods such as [random forests](#random-forest) and [gradient boosting](#gradient-boosting) stabilise trees by averaging or by sequential correction. See [overfitting](#overfitting), [cross-validation](#cross-validation), and [walk-forward validation](#walk-forward-validation). ::: ::: {#defi-exploit} **DeFi exploit (flash loan / oracle manipulation)** : A category of attack specific to [Decentralised Finance (DeFi)](#defi) protocols on programmable blockchains such as Ethereum. Unlike traditional financial fraud, DeFi exploits can complete entirely within a *single transaction block* (roughly 12 seconds), which renders conventional fraud monitoring useless. Three principal attack patterns [@perez2021decentralizing]: - **Flash loan attack**: borrow an arbitrarily large sum with no collateral, provided the loan is repaid within the same transaction. The attacker uses the borrowed capital to manipulate a price (e.g. artificially inflate a token's value in a thinly-traded pool), extract profit from a protocol that reads the manipulated price, then repays the loan. If any step fails, the entire transaction reverts — so the attacker risks nothing. Single exploits have exceeded $100M. - **Oracle manipulation**: DeFi protocols rely on on-chain "oracles" that report asset prices. An attacker with access to a flash loan can temporarily distort a thinly-traded token's price, causing the protocol to misliquidate positions or allow under-collateralised borrowing at the artificial price. - **Re-entrancy**: a smart contract calls an external contract mid-execution; the external contract calls back into the original before its internal state is updated, allowing repeated fund withdrawals in a loop. The DAO hack (2016, ~$60M) is the canonical example [@atzei2017survey]. Detection is fundamentally different from card fraud: there is no temporal drift across months, no customer profile, and the adversary can iterate within blocks. Defence relies on on-chain simulation before transaction inclusion, formal verification of contract logic, and anomaly detection in mempool data. See [Isolation Forest](#isolation-forest) and [hybrid model](#hybrid-model). ::: ::: {#defi} **Decentralised Finance (DeFi)** : Financial services implemented as self-executing smart contracts on a programmable blockchain (most commonly Ethereum), operating without a central intermediary. Core DeFi primitives include decentralised exchanges (constant-product automated market makers such as Uniswap), lending protocols (Aave, Compound), and stablecoins (DAI). Because the contract logic is public and execution is deterministic, DeFi is auditable but also fully attackable: any logical flaw in a contract can be exploited by anyone with sufficient capital, often atomically within a single block. See [DeFi exploit](#defi-exploit). ::: ::: {#denoising} **Denoising (covariance matrix)** : The process of removing statistical noise from an estimated [covariance matrix](#covariance-matrix) before using it in portfolio optimisation. When a covariance matrix is estimated from limited data, many of its [eigenvalues](#eigenvalue) reflect sampling noise rather than genuine asset relationships — the [Marcenko-Pastur Law](#marcenko-pastur) provides a theoretical boundary between signal and noise eigenvalues. Denoising proceeds in three steps: (1) eigendecompose the sample covariance matrix $\Sigma = V \Lambda V^\top$; (2) identify noise eigenvalues (those below the Marcenko-Pastur upper bound $\lambda_+$) and replace them with their mean; (3) reconstruct the denoised matrix $\hat{\Sigma} = V \hat{\Lambda} V^\top$. The result is a more stable, better-conditioned matrix that produces more robust portfolio weights out of sample. Denoising is a principled form of [shrinkage](#shrinkage) grounded in [Random Matrix Theory](#rmt). See [Random Matrix Theory](#rmt), [Marcenko-Pastur Law](#marcenko-pastur), and @liu2024evolutionary. ::: --- ## E ::: {#efficient-frontier} **Efficient frontier** : The set of portfolios that offer the highest expected return for a given level of risk (variance), or equivalently, the lowest risk for a given expected return. Introduced by @markowitz1952portfolio as the central concept of [Modern Portfolio Theory](#mpt). Portfolios on the efficient frontier are called *mean-variance efficient*; those below it are dominated — there exists another portfolio with the same risk and higher return, or the same return and lower risk. The shape of the efficient frontier depends entirely on the expected returns and [covariance matrix](#covariance-matrix) of the assets, which must be estimated from data. Because these estimates are noisy ([estimation error](#estimation-error)), the *empirical* efficient frontier can differ dramatically from the *true* frontier. Adding more objectives (e.g. skewness, transaction costs) transforms the efficient frontier into a [Pareto frontier](#pareto-frontier) in a higher-dimensional objective space. ::: ::: {#eigenvalue} **Eigenvalue / Eigenvector** : For a square matrix $A$, a scalar $\lambda$ and non-zero vector $v$ satisfying $Av = \lambda v$ are called an *eigenvalue* and its corresponding *eigenvector*. The eigenvector is a direction that the matrix does not rotate — it is simply scaled by $\lambda$. In finance, the eigendecomposition of the [covariance matrix](#covariance-matrix), $\Sigma = V \Lambda V^\top$, is fundamental: the eigenvectors define the principal directions of risk, and the eigenvalues measure how much variance lies in each direction. The largest eigenvalue typically corresponds to the market factor (all assets tend to move together); the next few capture sector or style effects. Most eigenvalues are small and reflect [estimation error](#estimation-error) rather than genuine risk structure — the [Marcenko-Pastur Law](#marcenko-pastur) formalises exactly which ones. The same concept generalises to rectangular matrices as the Singular Value Decomposition ([SVD](#svd)), which underlies word [embeddings](#embeddings) and [LoRA](#lora) fine-tuning of transformers — the identical mathematical insight applied to language models. See [covariance matrix](#covariance-matrix), [Marcenko-Pastur Law](#marcenko-pastur), [SVD](#svd), and [denoising](#denoising). ::: ::: {#embeddings} **Embeddings** : Dense vector representations of discrete objects (tokens, documents, entities) in a continuous $\mathbb{R}^d$ space, learned during model training. The key property: semantically similar objects are geometrically close (high cosine similarity). **Static embeddings** (Word2Vec, GloVe) assign one fixed vector per word type via matrix factorisation — the rectangular equivalent of [eigendecomposition](#eigenvalue); **contextual embeddings** ([transformer](#transformer)-based) assign different vectors depending on surrounding context. The embedding dimension $d$ is the number of singular vectors (analogous to eigenvalues) retained — the low-rank approximation of the co-occurrence or representation matrix. For financial applications, embeddings trained on general web text may not accurately represent domain-specific regulatory vocabulary. See @vaswani2017attention. ::: ::: {#estimation-error} **Estimation error** : In portfolio optimisation, the uncertainty in the estimated inputs — expected returns and the [covariance matrix](#covariance-matrix) — due to limited historical data. The critical problem: mean-variance optimisation is highly sensitive to input estimates; small errors in expected returns lead to extreme, unstable portfolio weights. This is sometimes called the "error maximisation" problem — the optimiser treats estimation error as genuine information and doubles down on it. For daily data over 10 years (~2,500 observations) and 100 assets, the [covariance matrix](#covariance-matrix) has 5,050 free parameters: the system is statistically underdetermined, and the empirical matrix is far from the true matrix. Estimation error is the principal reason that naive [MPT](#mpt) frequently underperforms a simple equal-weight portfolio out of sample [@deprado2018advances]. The solutions — [shrinkage](#shrinkage), [Black-Litterman](#black-litterman), [denoising](#denoising) — all reduce the influence of noisy estimates. See [Marcenko-Pastur Law](#marcenko-pastur) for the theoretical characterisation. ::: **ETF (Exchange-Traded Fund)** : A fund that tracks an index (e.g., S&P 500) and trades on a stock exchange like a share. ETFs typically have much lower expense ratios than actively managed mutual funds because they require no active stock-picking. The Bloomberg database used in this course contains eight ETFs: SPY, TLT, QQQ, GLD, EFA, BND, IWM, and VNQ, spanning equities, bonds, gold, real estate, and international exposure. ::: {#evolutionary-algorithm} **Evolutionary algorithm (MOEA)** : A class of population-based metaheuristic optimisation methods inspired by biological evolution — natural selection, mutation, and recombination. A population of candidate solutions (portfolios) evolves over generations: better solutions are more likely to survive and produce offspring; random mutations and crossover operations explore new combinations. Multi-Objective Evolutionary Algorithms (MOEAs) simultaneously optimise several conflicting objectives (e.g. maximise return, minimise variance, maximise skewness), producing a [Pareto frontier](#pareto-frontier) of non-dominated solutions. Evolutionary algorithms are particularly suited to portfolio problems because they do not require the problem to be convex, differentiable, or tractable — they handle [NP-hard](#np-hard) problems with [cardinality constraints](#cardinality-constraint), minimum transaction lots, and other real-world restrictions that defeat classical solvers. @liu2024evolutionary propose a MOEA framework for large-scale portfolio selection that handles both random and uncertain returns. See [NP-hard](#np-hard), [Pareto frontier](#pareto-frontier), and [cardinality constraint](#cardinality-constraint). ::: --- ## F ::: {#false-discovery} **False discovery problem** : In research and investment strategy, the problem that arises when many hypotheses are tested and only the successful results are reported. Under the null hypothesis of no true effect, each test at the 5% threshold produces a false positive with probability 5%; conducting $k$ independent tests yields on average $0.05k$ false positives. @harvey2016and reviewed 316 published equity factors and showed that, after adjusting for multiple testing, the majority of claimed factor premia cannot be statistically distinguished from noise. See [p-hacking](#p-hacking) and [backtest](#backtest). ::: ::: {#fat-tails} **Fat tails** : The property of a distribution in which extreme outcomes occur far more frequently than a normal distribution predicts. Measured by *excess kurtosis* — a normal distribution has kurtosis = 3 (excess kurtosis = 0); financial return series typically have excess kurtosis exceeding 5. The S&P 500's daily return series exhibits excess kurtosis exceeding 12, meaning moves of 3 or more standard deviations occur roughly 10 times as often as normality implies. Risk models that assume normality (e.g. standard Value at Risk) systematically understate the probability of catastrophic losses. Classified as one of the canonical [stylised facts](#stylised-facts) by @cont2001empirical. See [skewness](#skewness) and [volatility clustering](#volatility-clustering). ::: **FinTech** : Financial Technology. Broadly, technological change in financial services. Industry usage focuses on product verticals (payments, lending, wealth management, insurance). Academic usage focuses on changes to financial *functions* driven by lower information costs and shifts in market structure [@philippon2016fintech]. --- ## G ::: {#garch} **GARCH (Generalised Autoregressive Conditional Heteroskedasticity)** : A time series model for *time-varying volatility*, introduced by @bollerslev1986generalized as an extension of Engle's ARCH model [@engle1982autoregressive]. The standard GARCH(1,1) specifies: $\sigma^2_t = \omega + \alpha \varepsilon^2_{t-1} + \beta \sigma^2_{t-1}$, where $\sigma^2_t$ is today's conditional variance, $\varepsilon_{t-1}$ is the previous period's return shock, and the constraint $\alpha + \beta < 1$ ensures variance is stationary. The model directly captures [volatility clustering](#volatility-clustering): large shocks increase tomorrow's variance (through $\alpha$), which then persists (through $\beta$). Typical daily equity estimates: $\alpha \approx 0.08$, $\beta \approx 0.90$, confirming high persistence. Note that the GARCH recursion is structurally equivalent to an [RNN](#rnn) hidden state constrained to a specific scalar parametric form. See [stylised facts](#stylised-facts). ::: ::: {#gradient-boosting} **Gradient boosting (boosted trees)** : An ensemble method that builds a strong predictor by combining many weak predictors (usually shallow [decision trees](#decision-tree)) fitted sequentially. In regression, each new tree is trained to predict the residuals of the current ensemble, so the model focuses on what it is still getting wrong. The learning rate (shrinkage) controls how much each new tree contributes. Statistical intuition: boosting primarily targets bias reduction by increasing functional flexibility, but it can overfit without careful regularisation (learning rate, tree depth, number of trees, and time-aware validation). See [decision tree](#decision-tree), [random forest](#random-forest), [cross-validation](#cross-validation), and [walk-forward validation](#walk-forward-validation). ::: --- ## I ::: {#isolation-forest} **Isolation Forest** : An unsupervised anomaly detection algorithm introduced by @liu2008isolation. The core intuition is that anomalies are *easy to isolate*: because they occupy sparse regions of the feature space, a recursive random partitioning (equivalent to building a random decision tree) needs very few splits to separate an anomalous observation from the rest. Each tree records the *path length* needed to isolate each point; the ensemble average of these path lengths is the anomaly score (shorter = more anomalous). Normal observations, embedded in dense clusters, require many more splits and have longer average paths. Unlike supervised classifiers, Isolation Forest needs no labels. In fraud detection this is valuable because (a) labels are expensive to obtain, and (b) novel fraud tactics are not yet represented in labelled data. The model learns *what normal looks like* and flags deviations. A key limitation: unusual is not the same as fraudulent. Large legitimate transactions (foreign holidays, one-off purchases) score anomalous too, producing false positives. In practice, the unsupervised anomaly score is best used as an *additional feature* fed into a supervised model rather than as a standalone classifier — see [hybrid model](#hybrid-model). The `contamination` hyperparameter sets the expected proportion of anomalies, shifting the decision threshold but not the underlying scores. See [class imbalance](#class-imbalance) and [cost-sensitive threshold](#cost-sensitive-threshold). ::: ::: {#hybrid-model} **Hybrid model (unsupervised + supervised)** : In fraud detection, a two-stage architecture in which unsupervised anomaly scores (e.g. from [Isolation Forest](#isolation-forest), an autoencoder, or a graph centrality measure) are computed first and then added as features to a supervised classifier. The supervised model then learns *when* anomalous-looking transactions are actually fraudulent, calibrating against the label distribution. This pattern addresses the complementary weaknesses of each approach: unsupervised methods surface novel patterns without requiring labels; supervised methods use labels to separate genuine fraud from benign anomalies. The AUC improvement from adding one unsupervised feature is typically modest (+0.01–0.02) but consistent, and stacking multiple unsupervised signals compounds the gain. ::: --- ## L ::: {#log-returns} **Log returns** : The continuously compounded return on an asset: $g_t = \ln(P_t/P_{t-1})$, where $P_t$ is the price at time $t$. Log returns are preferred in statistical analysis for two reasons: (1) they are *additive over time* — the multi-period log return is the exact sum of single-period log returns; (2) they prevent negative prices by construction. For small returns ($|r_t| < 5\%$), $g_t \approx r_t$; the difference becomes material for large moves. The [stylised facts](#stylised-facts) of asset returns apply to both log returns and simple returns. See @campbell1997econometrics. ::: ::: {#lora} **LoRA (Low-Rank Adaptation)** : A parameter-efficient fine-tuning method for large language models that represents weight updates as the product of two low-rank matrices: $\Delta W = AB$, where $A \in \mathbb{R}^{d \times r}$, $B \in \mathbb{R}^{r \times k}$, and $r \ll \min(d, k)$. The motivation is identical to [denoising](#denoising) in portfolio construction: most of the task-relevant signal in a weight matrix lives in a low-rank subspace — the top-$r$ singular vectors (see [SVD](#svd)) — and the rest is noise. By restricting updates to this subspace, LoRA achieves performance competitive with full fine-tuning at a fraction of the parameter cost. The connection to the [Marcenko-Pastur Law](#marcenko-pastur) is direct: if a weight matrix has most eigenvalues in the noise range, the effective rank of meaningful information is small, justifying a low-rank update. See [SVD](#svd), [eigenvalue](#eigenvalue), and [embeddings](#embeddings). ::: ::: {#look-ahead-bias} **Look-ahead bias** : The use in a model or backtest of information that would not have been available to a real decision-maker at the time. Examples: normalising data using a full-sample mean computed over the test period; constructing a historical index portfolio using *current* membership, including companies that were not in the index in earlier years. The test: *"At time $t$, could a real decision-maker have legally and practically known this?"* Look-ahead bias is among the most common sources of spuriously good backtest performance. See @harvey2019backtesting and @deprado2018advances. See [backtest](#backtest). ::: ::: {#lstm} **LSTM (Long Short-Term Memory)** : A recurrent neural network architecture designed to learn long-range dependencies in sequential data, introduced by @hochreiter1997long. LSTMs address the *vanishing gradient problem* of standard [RNNs](#rnn) by maintaining a separate *cell state* $c_t$ that passes through time via additive operations. Three learned gates — forget ($f_t$), input ($i_t$), and output ($o_t$) — selectively control what information is erased, written, and exposed. Finance analogy: the forget gate acts as a regime-switching detector, discarding stale model parameters when the market transitions to a new regime. LSTMs substantially outperform vanilla RNNs on tasks with dependencies exceeding ~20 timesteps, but are in turn superseded by [transformer](#transformer) architectures for most modern large-scale applications. See [transformer](#transformer) and [RNN](#rnn). ::: --- ## M ::: {#mixing-service} **Mixing service (cryptocurrency tumbler)** : A service or protocol that obscures the transaction trail on a public blockchain by pooling funds from multiple senders and returning equivalent amounts to intended recipients via a series of intermediate addresses. The goal is to break the linkage between input and output addresses that blockchain analytics firms (Chainalysis, Elliptic) exploit to trace illicit flows. Centralised tumblers (custodial services) carry counterparty risk and can be seized by law enforcement. Decentralised mixing protocols (e.g. Tornado Cash on Ethereum) use smart contracts and cryptographic proofs (zero-knowledge proofs) to provide non-custodial mixing. Tornado Cash was sanctioned by the US Treasury's OFAC in 2022. See [privacy coin](#privacy-coin) and [DeFi](#defi). ::: ::: {#marcenko-pastur} **Marcenko-Pastur Law** : A result from Random Matrix Theory that characterises the distribution of eigenvalues of a large random covariance matrix. For a matrix estimated from $T$ observations of $M$ assets, if the returns are purely random (no genuine correlations), the eigenvalues $\lambda$ are distributed within the bounds: $$\lambda_{\pm} = \sigma^2 \!\left(1 \pm \sqrt{\tfrac{M}{T}}\right)^{\!2}$$ where $Q = T/M$ is the observation-to-asset ratio and $\sigma^2$ is the mean eigenvalue. Eigenvalues *inside* $[\lambda_-, \lambda_+]$ are statistically indistinguishable from noise. Eigenvalues *above* $\lambda_+$ carry genuine information about asset correlations. For a typical fund with 100 assets and 250 observations ($Q = 2.5$), most empirical eigenvalues fall inside the noise zone — meaning most of the estimated covariance structure is statistically artefactual. The law gives a theoretically grounded rule for [denoising](#denoising) the covariance matrix: replace noise eigenvalues with their mean, preserve signal eigenvalues. The same insight applies to [LoRA](#lora) fine-tuning and [SVD](#svd)-based compression of neural network weight matrices. See [eigenvalue](#eigenvalue), [denoising](#denoising), [Random Matrix Theory](#rmt), and @liu2024evolutionary. ::: ::: {#mpt} **Modern Portfolio Theory (MPT)** : The mathematical framework for constructing portfolios that maximise expected return for a given level of risk (variance), introduced by @markowitz1952portfolio. Given $N$ assets with expected return vector $\mu$ and [covariance matrix](#covariance-matrix) $\Sigma$, the optimal portfolio $w^*$ solves: $$\max_w \; w^\top \mu - \frac{\gamma}{2} w^\top \Sigma w \quad \text{subject to} \; \sum_i w_i = 1$$ where $\gamma$ is the investor's risk aversion. The solution traces the [efficient frontier](#efficient-frontier) — the set of mean-variance efficient portfolios. MPT is theoretically elegant but faces severe practical challenges: [estimation error](#estimation-error) in $\mu$ and $\Sigma$ causes extreme, unstable weights; the normality assumption ignores [fat tails](#fat-tails) and [skewness](#skewness); and classical quadratic programming solvers do not scale to large portfolios with [cardinality constraints](#cardinality-constraint). See [estimation error](#estimation-error), [efficient frontier](#efficient-frontier), [Black-Litterman](#black-litterman), and @deprado2018advances. ::: ::: {#multi-objective} **Multi-objective optimisation / Pareto frontier** : An optimisation problem with two or more conflicting objectives — for example, maximise return *and* minimise variance *and* maximise [skewness](#skewness). Because no single portfolio can simultaneously be best on all objectives, the solution is a *Pareto frontier* (also called a Pareto front or Pareto set): the set of portfolios such that no other portfolio is better on *every* objective simultaneously. A portfolio is *Pareto-dominated* if there exists another portfolio that is at least as good on all objectives and strictly better on at least one. Classical [MPT](#mpt) collapses the multi-objective problem into a single scalar (the Sharpe ratio) — in doing so, it discards information about investor preferences for skewness and other moments. [Evolutionary algorithms](#evolutionary-algorithm) are well suited to generating the full Pareto frontier because they can maintain a diverse population of solutions simultaneously. See @liu2024evolutionary. ::: --- ## N ::: {#np-hard} **NP-hard** : A class of combinatorial optimisation problems for which no known algorithm can find the exact solution in polynomial time as the problem size grows. In portfolio optimisation, adding [cardinality constraints](#cardinality-constraint) (limit to $K$ assets from $N$) or minimum transaction lots (invest in whole units) converts the problem from a smooth quadratic programme to an NP-hard combinatorial search. For $N = 1{,}000$ and $K = 50$, the number of candidate portfolios exceeds $10^{62}$ — exhaustive search is computationally infeasible. [Evolutionary algorithms](#evolutionary-algorithm) provide *good approximate* solutions for NP-hard problems without guaranteeing the exact optimum. The practical implication: the portfolio problem that robo-advisers actually face — with hundreds of assets and real-world constraints — cannot be solved exactly, only approximated. See [evolutionary algorithm](#evolutionary-algorithm) and [cardinality constraint](#cardinality-constraint). ::: ::: {#null-hypothesis} **Null hypothesis** : The default assumption in a statistical test: that there is no effect, no difference, or no relationship. In financial strategy evaluation, the null hypothesis is typically "this strategy has zero [alpha](#alpha)." The purpose of a statistical test is to assess whether the observed data are surprising enough, under this assumption, to reject it. The conventional threshold (p < 0.05) means: if the null hypothesis were true, there is less than a 5% chance of seeing results this extreme. This threshold is a convention, not a law, and may be far too lenient when many tests are conducted simultaneously (see [p-hacking](#p-hacking) and [false discovery problem](#false-discovery)). ::: --- ## O ::: {#out-of-sample-testing} **Out-of-sample testing** : Evaluating a model or strategy on data that was *not* used in its development or selection. The essential evidence standard in empirical finance and machine learning: a model that works well in-sample but fails out-of-sample is almost certainly [overfitted](#overfitting). See [walk-forward validation](#walk-forward-validation) for the rigorous implementation and [backtest](#backtest) for the common but weaker alternative. ::: ::: {#overfitting} **Overfitting** : When a model is fitted so closely to the training data that it captures idiosyncratic noise rather than the underlying pattern, causing poor performance on new data. In quantitative finance, where the [signal-to-noise ratio](#signal-to-noise) is very low, overfitting is the default failure mode: strategies with many free parameters almost always achieve impressive in-sample results that do not survive out-of-sample testing. See [bias-variance tradeoff](#bias-variance) and @deprado2018advances. ::: ::: {#overparameterised} **Overparameterised regime (P ≥ T)** : A regression setting in which the number of predictors $P$ is at least as large as the number of observations $T$. With a feature matrix $X \in \mathbb{R}^{T \times P}$, this is the “wide matrix” case, there are more columns than rows. This matters because the standard OLS formula $\hat{\beta} = (X^\top X)^{-1}X^\top y$ relies on $X^\top X$ being invertible. When $P \ge T$, $X^\top X$ is singular and OLS is not uniquely defined without additional constraints. Intuitively, there are many different coefficient vectors that can fit the training data equally well, which makes estimates unstable and encourages memorisation of noise. Regularisation (for example, ridge regression) restores well-posedness by adding a penalty that effectively replaces $X^\top X$ with $X^\top X + \lambda I$, which is invertible. The special case $P = T$ is often called the interpolation boundary: the point at which perfect in-sample fit becomes algebraically easy but statistically misleading. See [shrinkage](#shrinkage), [bias-variance tradeoff](#bias-variance), and [out-of-sample testing](#out-of-sample-testing). ::: --- ## P ::: {#proof-of-work} **Proof-of-Work (PoW)** : The consensus mechanism used by Bitcoin and early Ethereum. Validators ("miners") compete to find a nonce value such that the hash of the new block header falls below a target threshold — a computationally expensive search with no shortcut. The first miner to find a valid nonce broadcasts the block and earns the block reward. The security guarantee is economic: attacking the chain (e.g. reversing a confirmed transaction) requires the attacker to redo all the computational work from the target block forward, at a cost exceeding the honest chain's accumulated work. At Bitcoin's scale, this requires ~$10 billion in mining hardware and ongoing electricity costs. The weakness is probabilistic finality: because any miner could, in principle, produce a longer chain from an earlier block, a transaction is only considered safe after several subsequent blocks confirm it (the standard is 6 blocks, ~60 minutes). This "confirmation window" is a direct constraint on fraud detection response time. See [Byzantine Fault Tolerant consensus](#bft) and [Proof-of-Stake](#proof-of-stake). ::: ::: {#proof-of-stake} **Proof-of-Stake (PoS)** : A consensus mechanism in which validators are selected to propose and attest to new blocks in proportion to the amount of cryptocurrency they have locked up ("staked") as collateral. Adopted by Ethereum in 2022 ("The Merge"), PoS replaces the energy-intensive mining of Proof-of-Work with an economic stake. Dishonest or contradictory behaviour (e.g. signing two conflicting blocks for the same slot) triggers **slashing**: the offending validator's stake is automatically and irreversibly destroyed by the protocol, imposing a direct financial penalty proportional to the severity of the offence. PoS achieves stronger finality than PoW (~15 minutes on Ethereum with the Casper FFG finality gadget) and consumes roughly 0.01% of Bitcoin's energy. The trade-off is a more complex validator set management and the question of "nothing-at-stake" attacks (mitigated by slashing). For fraud detection, the faster finality window means suspicious transactions can be identified more quickly than on Bitcoin. See [Proof-of-Work](#proof-of-work) and [Byzantine Fault Tolerant consensus](#bft). ::: ::: {#privacy-coin} **Privacy coin** : A cryptocurrency designed to provide stronger transaction anonymity than Bitcoin or Ethereum, whose public ledgers allow transaction graph analysis. The two most widely used are: - **Monero (XMR)**: uses ring signatures (transactions are signed by one member of a group; outsiders cannot determine which), stealth addresses (one-time recipient addresses), and RingCT (hides transaction amounts). The transaction graph is effectively opaque to chain analytics. - **Zcash (ZEC)**: uses zk-SNARKs (a form of zero-knowledge proof) to allow "shielded" transactions where sender, recipient, and amount are hidden while the blockchain still validates correctness without revealing underlying data. Transparent Zcash transactions have full Bitcoin-like visibility. For fraud detection, privacy coins present a fundamental challenge: the graph and amount features that support models such as those on the [Elliptic dataset](#elliptic-dataset) do not exist for shielded transactions. Detection relies instead on on/off-ramp monitoring (exchanges with KYC obligations) and behavioural patterns at the point of conversion. Several exchanges have delisted privacy coins under regulatory pressure (FCA, 2021). See [mixing service](#mixing-service) and [DeFi exploit](#defi-exploit). ::: ::: {#pareto-frontier} **Pareto frontier** : See [Multi-objective optimisation](#multi-objective). ::: ::: {#p-hacking} **P-hacking** : The practice of testing many statistical specifications, hypotheses, or model configurations and reporting only those that produce a result passing a significance threshold (p < 0.05). At a 5% threshold, 1 in 20 independent tests of a null hypothesis will reject it by chance. @harvey2016and show that most of the published equity factor literature is affected. The solutions are pre-registration, out-of-sample validation, and adjusted significance thresholds. See [false discovery problem](#false-discovery). ::: --- ## Q ::: {#q-ratio} **Q-ratio (observation-to-asset ratio)** : The ratio $Q = T/M$ of the number of observations $T$ to the number of assets $M$ in a portfolio estimation problem. The Q-ratio governs the severity of [estimation error](#estimation-error) in the [covariance matrix](#covariance-matrix): when $Q \gg 1$, estimates are well-determined; when $Q < 1$, the covariance matrix is rank-deficient and cannot be inverted without regularisation. In the [Marcenko-Pastur Law](#marcenko-pastur), $Q$ controls the width of the noise zone: small $Q$ (underdetermined) produces wide bounds, meaning almost all eigenvalues are noise. For a typical institutional manager with 1,000 assets and 250 annual observations, $Q = 0.25$ — the problem is severely underdetermined. See [Marcenko-Pastur Law](#marcenko-pastur) and [estimation error](#estimation-error). ::: --- ## R ::: {#rmt} **Random Matrix Theory (RMT)** : A branch of mathematics that studies the statistical properties of matrices with random entries, with applications in physics, statistics, and finance. In portfolio optimisation, RMT provides a rigorous framework for understanding the [eigenvalue](#eigenvalue) spectrum of large sample covariance matrices. The central result for finance is the [Marcenko-Pastur Law](#marcenko-pastur): it defines precisely which eigenvalues of an empirical covariance matrix are consistent with pure random noise, and which carry genuine information. RMT-based [denoising](#denoising) exploits this result to construct more stable covariance matrices by replacing noise eigenvalues with their theoretical expectation. The same mathematical framework applies to the weight matrices of neural networks — researchers use RMT to analyse training dynamics and guide model compression. See [Marcenko-Pastur Law](#marcenko-pastur), [denoising](#denoising), and [eigenvalue](#eigenvalue). ::: ::: {#random-walk} **Random walk** : A time series in which changes from one observation to the next are unpredictable given past history: $P_t = P_{t-1} + \varepsilon_t$, with $\varepsilon_t$ independent and identically distributed. Asset prices closely approximate a random walk — the empirical content of the weak form of the Efficient Market Hypothesis [@fama1970efficient]. A random walk is non-stationary; the solution is to work with log [returns](#log-returns) (first differences of log prices). See [stationarity](#stationarity). ::: ::: {#random-forest} **Random forest** : A tree-based ensemble method that reduces the variance of a single [decision tree](#decision-tree) by averaging many trees. Each tree is trained on a bootstrap resample of the data (bagging), and at each split it considers only a random subset of predictors. This feature subsampling decorrelates the trees, making the average prediction more stable and improving out-of-sample performance. Statistical intuition: random forests primarily target variance reduction. They are often a strong baseline for tabular prediction problems, but their non-linear structure means they are not interpreted via regression coefficients. See [decision tree](#decision-tree), [gradient boosting](#gradient-boosting), [overfitting](#overfitting), and [cross-validation](#cross-validation). ::: ::: {#regime-dependence} **Regime-dependence** : The property of a strategy or model that performs well only in certain market conditions (e.g. bull markets, low volatility) and fails in others. A canonical example from the Bloomberg data: the SPY–TLT correlation was negative (diversifying) from 2018 to 2021 but turned positive during the 2022 rate-hiking cycle — any portfolio "optimised" on pre-2022 data suffered a regime break. Testing across multiple regimes (bull, bear, high/low volatility, crisis) is a necessary part of evidence discipline. The correlation flip also violates [MPT](#mpt)'s assumption of a stationary [covariance matrix](#covariance-matrix). See [backtest](#backtest) and @harvey2019backtesting. ::: ::: {#rnn} **RNN (Recurrent Neural Network)** : A neural network architecture for sequential data that maintains a hidden state $h_t$ summarising all preceding inputs, updated as $h_t = f(W_{hh} h_{t-1} + W_{xh} x_t + b_h)$. Standard RNNs suffer from *vanishing gradients*: gradient signals from distant timesteps decay exponentially, limiting useful memory to ~20 timesteps. Finance analogy: the [GARCH](#garch) equation is a one-dimensional RNN with a fixed parametric form. The [LSTM](#lstm) was designed to solve the vanishing gradient problem. See @hochreiter1997long. ::: ::: {#robo-adviser} **Robo-adviser** : An automated investment service that constructs and manages portfolios using algorithms, with minimal human intervention. Typically charges 0.25% of AUM versus ~1% for a traditional adviser. Automates *prediction* (portfolio optimisation, rebalancing) but relies on the client's stated risk tolerance for *judgement*. The economics of robo-advisers are studied in @reher2024robo, who document substantial improvements in access and welfare, particularly for lower-wealth investors. See [Modern Portfolio Theory](#mpt) and [Black-Litterman](#black-litterman). ::: --- ## S ::: {#sharpe-ratio} **Sharpe ratio** : A measure of risk-adjusted return: $\text{SR} = (\bar{r} - r_f) / \sigma$, where $\bar{r}$ is the portfolio return, $r_f$ the risk-free rate, and $\sigma$ the return standard deviation. The canonical formulation is @sharpe1994ratio. A Sharpe ratio of 1.0 is considered good; 2.0+ exceptional in a live, out-of-sample context. The Sharpe ratio is susceptible to inflation via multiple testing: selecting the best of $k$ strategies by Sharpe inflates the expected maximum Sharpe even when all true Sharpe ratios are zero. The [Deflated Sharpe Ratio](#deflated-sharpe-ratio) corrects for this. See @deprado2018advances. ::: ::: {#shap-values} **SHAP values (Shapley Additive Explanations)** : A post-hoc explanation method for interpreting predictions from complex models. For a fitted model $\hat{f}$ and an observation $i$, SHAP expresses the prediction as an additive decomposition: $$\hat{y}_i = \phi_0 + \sum_{j=1}^{P} \phi_j^{(i)}$$ where $\phi_0$ is a baseline prediction (defined relative to a background dataset) and $\phi_j^{(i)}$ is the contribution of feature $j$ to observation $i$'s prediction. The contributions are based on Shapley values from cooperative game theory, which allocate credit across features under a set of axioms. The central caution is statistical, not computational. SHAP explains the model you fitted, not the true data generating process. With correlated predictors, attributions are not uniquely identified and can shift depending on the background distribution and modelling choices. SHAP is therefore best used as a tool for describing how a model behaves, not as evidence that a feature has a causal effect. See [decision tree](#decision-tree), [random forest](#random-forest), [gradient boosting](#gradient-boosting), [overfitting](#overfitting), and [out-of-sample testing](#out-of-sample-testing). ::: ::: {#shrinkage} **Shrinkage (shrinkage estimator, James-Stein)** : A family of statistical estimators that deliberately pull noisy estimates towards a target value (often the grand mean, zero, or a Bayesian prior) in order to reduce variance. The principle is simple: a small increase in bias can be worth accepting if it produces a larger reduction in variance, so that mean squared error falls overall. In regression and forecasting, penalised methods such as ridge regression and LASSO are shrinkage estimators. The penalty strength $\lambda$ controls how aggressively coefficients are shrunk towards zero, and the practical goal is to minimise out-of-sample prediction error rather than to make bias exactly zero. The term is often introduced via the James-Stein paradox: for three or more parameters, the sample mean is inadmissible, and a shrinkage estimator can always achieve lower mean squared error. In portfolio optimisation, shrinkage is applied to expected returns and to the [covariance matrix](#covariance-matrix) to reduce [estimation error](#estimation-error) and produce more stable weights out of sample. The [Marcenko-Pastur Law](#marcenko-pastur) provides a principled shrinkage rule for the covariance matrix eigenvalues. See [estimation error](#estimation-error), [Black-Litterman](#black-litterman), [denoising](#denoising), and [bias-variance tradeoff](#bias-variance). ::: ::: {#signal-to-noise} **Signal-to-noise ratio** : In financial time series, the ratio of the expected return (signal) to its standard deviation (noise) — equivalent to the [Sharpe ratio](#sharpe-ratio) for a zero-risk-free-rate asset. For monthly equity returns, the signal-to-noise ratio is typically 0.15–0.25. This has a direct consequence for statistical inference: detecting genuine positive returns requires many more observations than most evaluation periods provide. The fundamental statistical power problem in finance: most investment evaluation periods are far too short to validate or refute a strategy. See @harvey2019backtesting. ::: ::: {#skewness} **Skewness** : A measure of the asymmetry of a probability distribution. A *negatively skewed* distribution has a longer left tail: large losses are more extreme than large gains. Daily equity returns are typically negatively skewed, meaning sharp falls are larger in magnitude than equivalent single-day gains. This is one of the canonical [stylised facts](#stylised-facts). [MPT](#mpt)'s variance measure is symmetric — it treats a 20% gain and a 20% loss identically — meaning portfolios optimised on variance alone under-weight the asymmetric risk of large losses. Incorporating skewness as an objective converts the [efficient frontier](#efficient-frontier) into a [Pareto frontier](#pareto-frontier) in mean-variance-skewness space. See @cont2001empirical. ::: ::: {#stationarity} **Stationarity** : A property of a time series in which the mean, variance, and autocovariance structure are constant over time. Standard statistical models require stationarity. Asset price *levels* are non-stationary: under a [random walk](#random-walk), variance grows proportionally with time. Daily log [returns](#log-returns) are approximately stationary — while volatility clusters, the *unconditional* distribution of returns is stable over time. See @campbell1997econometrics. ::: ::: {#stylised-facts} **Stylised facts** : Empirical regularities of financial return distributions that appear consistently across markets, asset classes, and time periods, catalogued by @cont2001empirical. The principal facts: (1) near-zero autocorrelation in returns; (2) [fat tails](#fat-tails) (excess kurtosis); (3) negative [skewness](#skewness); (4) [volatility clustering](#volatility-clustering). These are *empirical regularities*, not theoretical consequences — any model that generates returns must reproduce them to be credible. They directly motivate [ARIMA](#arima) for conditional mean and [GARCH](#garch) for conditional variance. ::: ::: {#survivorship-bias} **Survivorship bias** : The distortion arising when a sample contains only observations that "survived" a selection process, omitting those that failed. In financial research: mutual fund databases contain only funds still operating; historical stock data excludes delisted stocks. Survivors have on average better performance than the full population because failures are excluded. See [backtest](#backtest) and [false discovery problem](#false-discovery). ::: ::: {#svd} **SVD (Singular Value Decomposition)** : The generalisation of [eigenvalue](#eigenvalue) decomposition to rectangular matrices. Any matrix $W \in \mathbb{R}^{m \times n}$ can be written as $W = U \Sigma V^\top$, where $U$ and $V$ are orthogonal matrices of left and right singular vectors, and $\Sigma$ is a diagonal matrix of non-negative singular values ordered from largest to smallest. The singular values play the same role as eigenvalues: large values carry signal, small values carry noise. Truncating to the top $r$ singular values gives the best rank-$r$ approximation to $W$ (Eckart-Young theorem). SVD underlies: word [embeddings](#embeddings) (GloVe and Word2Vec implicitly factorise a co-occurrence matrix); [LoRA](#lora) (weight update matrices are constrained to rank $r$); and the [Marcenko-Pastur Law](#marcenko-pastur) applied to rectangular weight matrices in neural networks. In finance, [denoising](#denoising) is a special case of SVD truncation applied to the symmetric [covariance matrix](#covariance-matrix). See [eigenvalue](#eigenvalue), [LoRA](#lora), and [embeddings](#embeddings). ::: --- ## T ::: {#transformer} **Transformer** : A neural network architecture based entirely on [attention mechanisms](#attention-mechanism), without recurrence or convolution, introduced by @vaswani2017attention. The core component is *multi-head self-attention*, which allows every token to attend directly to every other token, with attention weights learned from data. This replaces the sequential hidden-state computation of [RNNs](#rnn) and [LSTMs](#lstm) with a fully parallel computation, enabling training at internet scale. All modern large language models are transformers. The connection to portfolio mathematics: the weight matrices of transformers have an [eigenvalue](#eigenvalue) spectrum that can be analysed using [Random Matrix Theory](#rmt), informing model compression and fine-tuning strategies such as [LoRA](#lora). See [attention mechanism](#attention-mechanism), [eigenvalue](#eigenvalue), [LoRA](#lora), and [LSTM](#lstm). ::: --- ## V ::: {#vix} **VIX** : The CBOE Volatility Index — a real-time measure of *implied volatility* derived from S&P 500 option prices, constructed to reflect the market's expectation of 30-day annualised volatility. Widely used as a "fear gauge": VIX peaked at ~80 during the March 2020 Covid crash versus a long-run average of ~20. VIX systematically exceeds subsequently realised volatility by a *variance risk premium* [@carr2009variance; @bollerslev2009expected] — the compensation investors demand for bearing volatility uncertainty. The MOVE index is the bond market equivalent of VIX. Both are available in the Bloomberg database used in this course. See [volatility clustering](#volatility-clustering). ::: ::: {#volatility-clustering} **Volatility clustering** : The property of financial return series in which the *magnitude* of returns is positively autocorrelated: large absolute returns tend to be followed by large absolute returns, and small by small, regardless of sign. Classified as one of the canonical [stylised facts](#stylised-facts) by @cont2001empirical. The [GARCH](#garch)(1,1) model was designed specifically to capture this phenomenon. The practical implication: volatility can be forecast with useful accuracy (R² ≈ 15–40% for variance) even though return levels cannot (R² ≈ 1–2%). ::: --- ## W ::: {#walk-forward-validation} **Walk-forward validation** : An [out-of-sample testing](#out-of-sample-testing) methodology in which a model is trained on a rolling window of historical data and evaluated on the immediately following period, stepping forward through time. The cardinal rule: no information from any test window may influence model design or hyperparameter selection. Walk-forward validation is more rigorous than a single in-sample backtest because the model never "sees" future data during any training step. See @deprado2018advances; see [backtest](#backtest) and [look-ahead bias](#look-ahead-bias). ::: --- ## References