Methodology - Mispriced

1. Overview

What the Model Does

The model predicts fair market capitalization from financial statement data using machine learning. It learns historical relationships between company fundamentals (revenue, profits, debt, cash flows) and market valuations across thousands of stocks.

What the Model Does NOT Do

It does not predict future stock prices or returns
It does not account for growth expectations or momentum
It does not provide buy/sell recommendations
It is not a fundamental DCF or comparable company analysis

Key Insight

Mispricing signals are relative, not absolute. A stock showing 20% mispricing means the current market cap exceeds the model's predicted fair value by 20% based on fundamentals alone. This indicates overvaluation — investors are willing to pay beyond what fundamentals suggest, which could reflect growth expectations, brand value, or other intangibles not captured by financial statements.

Cross-Sectional, Not Time-Series

Each quarter is trained independently — the model only compares companies within the same quarter. This means:

No future leakage: The model cannot learn from future quarters
Market regime adaptation: Valuation multiples change over time (e.g., tech was valued higher in 2021)
Fair comparison: Companies are valued against contemporaries, not historical norms

2. Model Architecture

Algorithm

XGBoost

Gradient Boosted Decision Trees

Target Variable

log(market_cap)

Log-transformed for numerical stability

Why Tree-Based Models?

Non-linear relationships: Financial ratios have complex, non-linear effects on valuation
Feature interactions: Trees naturally capture interactions (e.g., high debt is worse for low-margin companies)
Robustness: Less sensitive to outliers and missing data than linear models
No scaling required: Tree splits are invariant to monotonic feature transformations

Fixed Hyperparameters


n_estimators: 200

max_depth: 5

learning_rate: 0.1

subsample: 0.8

colsample_bytree: 0.8

objective: reg:absoluteerror

Fixed parameters ensure consistency across quarters. No hyperparameter tuning is performed.

3. Cross-Validation Methodology

The model uses repeated K-fold cross-validation to generate prediction distributions. This approach prevents data leakage and provides uncertainty estimates.

10

CV Repeats

5

Folds per Repeat

50

Predictions per Stock

K-Fold Cross-Validation Diagram

Each row shows one fold. Blue = training data, Red = test data (held-out).

Fold 1

Test

Fold 2

Test

Fold 3

Test

Fold 4

Test

Fold 5

Test

Why Repeated CV?

Uncertainty quantification: The standard deviation across 50 predictions measures model confidence
Robustness: Averaging reduces sensitivity to specific train/test splits
No data leakage: Each prediction is made on held-out data the model has never seen

4. Feature Engineering

Features are extracted from quarterly financial statements. The model uses a combination of raw fundamentals and financial ratios.

Core Features

Feature	Category	Transform	Fill Strategy
Total Revenue	Fundamentals	log1p	Required
Gross Profit	Fundamentals	log1p	Zero
EBITDA	Fundamentals	log1p	Median
Net Income	Fundamentals	-	Zero
Total Debt	Balance Sheet	log1p	Zero
Total Cash	Balance Sheet	log1p	Zero
Free Cash Flow	Cash Flow	-	Zero
Profit Margin	Ratio	-	Median
Debt-to-Equity	Ratio	log	Median
ROE / ROA	Ratio	-	Median

Current Data Coverage

Feature availability across ~32,000 quarterly snapshots:

Revenue: 91%

Net Income: 72%

Total Debt: 72%

Total Cash: 72%

EBITDA: 64%

Free Cash Flow: 69%

ROA/ROE: 65-71%

Gross Profit: 43%

Transforms Explained

log1p: Applies log(1 + x) to handle large scale differences and zeros
log: Standard log transform for ratio features (excludes zeros)
Median fill: Replaces missing values with sector/industry median
Zero fill: Assumes missing financial data indicates zero (conservative)

5. Mispricing Calculation

Raw Mispricing


                        mispricing = (actual_mcap - predicted_mcap) / actual_mcap

Positive Mispricing

Current market cap exceeds model's predicted fair value. Suggests potential overvaluation — investors are paying beyond fundamentals.

Negative Mispricing

Current market cap is below model's predicted fair value. Suggests potential undervaluation based on fundamentals.

Size Premium Correction

Raw mispricing exhibits a systematic size effect: smaller companies tend to show positive mispricing while larger companies show negative mispricing. This reflects the historical "size premium" where smaller companies trade at higher multiples.


                        size_neutral_mispricing = raw_mispricing - size_premium(market_cap)

The size premium is estimated by fitting a smooth curve (spline or polynomial) to the mispricing vs. market cap relationship. This correction isolates stock-specific mispricing from the systematic size effect.

When to Use Each Mode

Raw: Compare stocks within similar market cap ranges
Size-Neutral: Compare stocks across different market caps (recommended)

Uncertainty Measure


                        relative_std = prediction_std / actual_mcap

Higher relative standard deviation indicates less confident predictions. Stocks with unusual financial profiles or sparse comparable data will have higher uncertainty.

6. Signal Quality & Backtesting

Backtest results measure whether historical mispricing signals predicted future price movements.

Information Coefficient (IC)


                        IC = correlation(mispricing_signal, future_return)

Interpreting IC

IC > 0: Signal worked — overvalued stocks underperformed, undervalued outperformed
IC ~ 0: No predictive signal
IC < 0: Signal inverted — overvalued stocks outperformed (momentum dominated)

On the dashboard, IC is displayed such that positive = good signal (mispricing predicted subsequent returns correctly).

Hit Rate


                        hit_rate = % of stocks where mispricing direction matched return direction

A hit rate above 50% indicates the signal has some directional predictive power. However, magnitude of returns matters more than hit rate for portfolio construction.

Statistical Significance

P-values are corrected using the Benjamini-Hochberg procedure to control false discovery rate when testing multiple hypotheses (horizons x sectors/indices).

Significance Stars

★ p < 0.05 (significant)
★★ p < 5e-4 (highly significant)
★★★ p < 5e-8 (extremely significant)

Horizon Analysis

Backtests are run across multiple forward-looking horizons (e.g., 5, 10, 21, 63, 126 trading days) to understand signal persistence and decay. Shorter horizons capture momentum effects while longer horizons reflect fundamental mean reversion.

7. Limitations & Caveats

Not Financial Advice

This tool is for research and educational purposes only. The mispricing signals should not be used as the sole basis for investment decisions. Always consult with a qualified financial advisor and conduct your own due diligence.

Model Limitations

Backward-looking fundamentals: Financial statements are historical. The model cannot capture future growth expectations, pending acquisitions, or unreleased products.
No intangibles: Brand value, intellectual property, network effects, and other intangible assets are not directly measured in financial statements.
Cross-sectional only: The model compares companies at a single point in time. It does not model time-series dynamics or macroeconomic factors.
Sector mixing: The model trains on all sectors together. Industry-specific valuation multiples may not be fully captured.
Survivorship bias: The dataset includes currently traded stocks. Delisted companies are not included in backtests.

Data Limitations

Variable coverage: Revenue has ~91% coverage, but some features like gross profit have lower availability (~43%). Missing values are filled with sector medians or zeros.
Point-in-time accuracy: Quarterly snapshots may not perfectly align with earnings release dates.
Market cap timing: Historical market caps are reconstructed from price × shares outstanding.

Backtest Caveats

Look-ahead bias: Model hyperparameters were tuned on the full dataset. True out-of-sample performance may differ.
Transaction costs: Backtests do not include trading costs, slippage, or market impact.
Past performance: Historical signal quality does not guarantee future results.