← → / J K / space to navigate

MSc Machine Learning · Thesis

Latent Dynamics of Prediction-Market Price Paths

Repricing-intensity modelling of high-frequency Polymarket order-book data, with terminal absorption at resolution.

Research Motivations

Three structural peculiarities make prediction-market price paths an open, and distinct, modelling object — and one data gap kept the question closed.

  • Bounded & tradable as probability. Prediction markets price real-world events as continuously traded probabilities on (0,1); how those prices form at high frequency is poorly understood.
  • Terminal absorption at a known deadline. Every market is forced to an absorbing outcome ($0 or $1) at a roughly known deadline — a structure standard financial-dynamics models do not capture.
  • Observed terminal outcome. The realised result of every path is recorded — a per-path ground truth that can discipline the model.
  • The data gap. No public high-frequency order-book dataset of a prediction venue existed; I built one, removing the binding constraint on studying these dynamics.

Industry Background & Positioning

Industry background

  • Polymarket is the largest prediction-market venue: central limit order book, on-chain settlement, hundreds of markets resolved weekly.
  • Six months of prior work established the venue clears near-efficiently gross of fees (1,348 markets, ~$445M) with no exploitable directional edge.
  • Self-built 24/7 collection system: top-250 markets, millisecond order-book deltas, audited continuity (~10 min total loss), capture through resolution.

Research positioning

  • Positioned against three literatures: prediction-market efficiency & price discovery; point-process / regime-switching models of event-timed data (Hawkes, MMPP, hidden semi-Markov); and market microstructure.
  • The specific gap: intensity modelling of prediction-market repricing as a function of time-to-resolution — with no standard analogue for the known terminal boundary.
  • Methodological stance: pre-registered tests, market-clustered bootstrap CIs, documented retractions; an open dataset + pipeline as a reusable artifact.

Research Objectives

Three deliverables — designed to hold whether the central signal is rich or null.

  • 1 · Dataset & pipeline. A reusable, gap-audited, millisecond-resolution L2 corpus and the 24/7 collection pipeline that produces it — a contribution in its own right.
  • 2 · Empirical characterisation. Characterise the latent dynamics of price paths at order-book resolution — does identifiable, persistent structure exist?
  • 3 · Absorption-aware intensity model. Model the repricing process generatively and test whether its dynamics change systematically as a market approaches terminal absorption — delivering a defensible thesis whether the structure is rich or null.

Research Methodology & Structure

A probe-and-gate arc: rule out the easy story, identify the real structure, then build the model that structure dictates — and validate it to the standard that retracted earlier findings.

Exp 1 — Is the venue exploitable? (groundwork)
  • Establish venue efficiency and the absence of a directional edge — the negative result that motivates studying dynamics.
  • Five independent NO-EDGE findings; the market price beats every coarse prior tested.
Exp 2 — Identifying the dynamics
  • Probe the latent structure of price paths; identify the correct model class.
  • Two pre-registered probes — falsify the variance-regime model, establish repricing-intensity clustering.
Exp 3 — Absorption-aware model (the contribution)
  • Build the intensity model and test its dependence on time-to-resolution; validate on held-out markets and realised outcomes.
  • Markov-modulated Poisson process with time-to-resolution modulation; variational inference.

Exp 1 — Is the venue exploitable? Complete

Experiment
  • Test whether any directional strategy or coarse prior beats the market.
  • Pre-registered, fee-aware backtests; market-clustered bootstrap CIs.
Data
  • 1,348 resolved markets, ~$445M notional.
  • 400 markets / 188,856 trades for the prior test.
  • On-chain maker reconstruction (100% matched).
Models
  • 4 directional strategies; coarse-prior vs market Brier.
  • Cross-market lead-lag; fee-aware PnL accounting.
  • Multiple-testing correction throughout.
Results
  • Five independent NO-EDGE results; near-efficient gross of fees.
  • Market Brier 0.063 beats every prior (0.18–0.25).
  • Conclusion: study price formation, not edge.
Market Brier 0.063 beats every coarse prior (0.18–0.25), by category
Late-stage market price beats the base-rate prior in every category (market Brier 0.063 vs 0.18–0.25).

Exp 2 — The price-path dynamics Complete

Experiment
  • Identify the correct latent-dynamics model class.
  • Two pre-registered probes (v0 falsify, v1 establish).
Data
  • 10 liquid event markets; ~12M order-book updates.
  • Millisecond deltas reconstructed to mid-price.
  • Gap-audited; masked, never interpolated.
Models
  • v0: Gaussian regime-switching HMM (variance).
  • v1: jump-clustering + sticky-HMM + intensity.
  • Grid-invariance checks (1s / 10s / 60s).
Results
  • Variance-regime model FALSIFIED — regimes flicker (0/10).
  • Repricing intensity clustered & non-Poisson (10/10), CV 2.2–5.3.
  • Persistent hot/cold episodes (ρ=0.74, ~50 min memory).
Repricing events cluster; intensity autocorrelation persists to ~50 minutes
Repricing events cluster strongly; the intensity autocorrelation persists to ~50 minutes — persistence lives in the event intensity, not the per-bin variance.

Exp 3 — Absorption-aware model In progress

Experiment
  • Model repricing intensity vs time-to-resolution τ — the core thesis contribution.
  • Endogenous self-excitation (Hawkes) vs exogenous regime (MMPP) as a real model-selection question.
Data
  • Clean server corpus + gold samples (resolve in-window).
  • ~10–15 complete lifecycles/day, accruing.
  • Event = repricing (>1 tick), price-level stratified; masked gaps as censored intervals.
Models
  • M0: homogeneous Poisson (null).
  • M1: MMPP, hot/cold rate — ESTABLISHED.
  • M2: + time-to-resolution τ (absorbing); Hawkes comparison; variational inference.
Results
  • H1 hot/cold states + dwell — ESTABLISHED (MMPP=M1).
  • H2 intensity collapses as τ→0 (the freeze) — borne out at scale.
  • H3 resolution as absorbing state (intensity→0) — M2b preferred.
Repricing intensity collapses as time-to-resolution τ→0 — the freeze
Repricing intensity collapses as time-to-resolution τ→0 — the freeze.

Contributions to Science Academic

Empirical

  • First order-book-resolution characterisation of prediction-market price dynamics.
  • Identifies repricing-intensity clustering as the governing structure — not price-variance regimes.
  • On a novel, gap-audited dataset with complete path-to-resolution samples.
  • Falsifies the natural variance-regime model — a clean negative result.

Methodological

  • An absorption-aware intensity model for terminally-absorbed price processes.
  • Markov-modulated Poisson process with time-to-resolution modulation.
  • Variational inference; validated by posterior-predictive checks and outcome calibration.
  • A model class with no standard analogue (known terminal boundary).

Impact Statement Industry

A reusable research asset

  • Continuous, gap-audited L2 corpus + collection pipeline for a major prediction venue.
  • Rare data; valuable to microstructure and forecasting researchers.
  • Documented, reproducible, extensible to other venues and event classes.

Methods transferable beyond prediction markets

  • Intensity / point-process modelling applies to any event-timed financial data.
  • Absorption-aware framing is relevant to any deadline-resolved contract.
  • Near-efficiency motivates studying price formation, not prediction.
  • Demonstrates rigorous, retraction-honest empirical practice end to end.
latent dynamics · deck ← Overview 1 / 10