MSc Machine Learning · Thesis Framework
Repricing-intensity modelling of high-frequency Polymarket order-book data, with terminal absorption at resolution.
Juan Mediavilla · University College London · MSc Machine Learning · Supervisor: Prof. Philip Treleaven
Thesis Abstract
Prediction markets price real-world events as continuously traded probabilities, yet the high-frequency process by which those prices form has not been studied at order-book resolution, because no public dataset of a prediction venue's live order book has existed. This thesis introduces such a dataset — a purpose-built, gap-audited, millisecond-resolution corpus of the 250 most active Polymarket markets, captured continuously and through each market's resolution — and uses it to characterise and model the latent dynamics of prediction-market price paths.
Two pre-registered probes establish the governing structure. A natural model — a regime-switching model of price variance — is decisively falsified: apparent volatility regimes do not persist at any timescale. Instead, the price is sticky and reprices in clustered bursts, and the persistent structure lives in the intensity of repricing — how often the market moves — which is strongly clustered and non-Poisson on every market tested (hot episodes persist for tens of minutes; rate autocorrelation to ~50 minutes).
The thesis's contribution is a generative model of this repricing intensity as a latent hot/cold state process (a Markov-modulated Poisson process), extended so that the intensity is modulated by time-to-resolution — the defining feature of prediction markets, in which every price is absorbed at $0 or $1 by a bounded, approximately known deadline. The model is inferred by variational methods and validated on held-out markets and against realised outcomes via the complete path-to-resolution samples in the corpus.
The work is designed to be defensible whatever the empirical outcome: the central question — does repricing intensity sharpen as a market approaches absorption? — has a clean negative answer as well as a positive one. The dissertation therefore delivers three outputs: a reusable high-frequency prediction-market dataset and collection pipeline; an empirical characterisation of prediction-market price dynamics at a resolution not previously available; and an absorption-aware intensity model for terminally-absorbed price processes, a model class with no standard analogue in financial time-series. No deployment of trading capital is involved; the venue has been independently shown to be near-efficient gross of fees, which is precisely what motivates studying price formation rather than prediction.
Thesis Framework · Chapter Micro-abstracts
Context and data, then the empirical discovery, then the model that the discovery dictates, then validation. Each micro-abstract states what the chapter establishes and how.
Establishes the object of study and why it is open. Prediction-market prices are bounded, terminally absorbed at a known deadline, and have an observed terminal outcome — three properties that distinguish them from the equity processes most dynamics models target. Frames the research question (does identifiable, persistent latent structure exist in these price paths, and how is it shaped by approach to resolution?) and states the thesis's defining stance: it is a measurement-and-modelling thesis, not a trading thesis, designed to yield a defensible result whether the central signal is present or absent.
Positions the thesis against three literatures: market microstructure and high-frequency price formation; point-process and regime-switching models of event-timed financial data (Hawkes processes, Markov-modulated Poisson processes, hidden semi-Markov models); and the empirical literature on prediction-market efficiency and price discovery. Identifies the specific gap — intensity modelling of prediction-market repricing as a function of time-to-resolution — and surveys the absorption / bounded-martingale modelling the terminal-boundary contribution builds on.
Documents the dataset and collection system as a contribution in its own right. A purpose-built 24/7 pipeline captures full order-book deltas, periodic snapshots, the trade tape and resolutions for the top-250 Polymarket markets at millisecond resolution, with an audit trail that makes data loss known rather than silent (~10 minutes total over the validation run). Describes order-book reconstruction (delta replay anchored on snapshots), the gold samples (markets observed continuously through absorption), and honest limitations of window length, liquidity selection and single-venue scope.
Reports the groundwork that fixes the thesis's framing. Pre-registered, fee-aware tests across 1,348 markets establish that the venue clears near-efficiently gross of fees and that naive directional strategies have no out-of-sample edge (five independent negative results; the market price beats every coarse prior, Brier 0.063 vs 0.18–0.25). This negative result is load-bearing: prices are informative, so the productive question is how they form, not how to beat them — redirecting the thesis from prediction to dynamics.
The empirical heart of the thesis. Two pre-registered probes adjudicate between competing models of the price path. A Gaussian regime-switching model of per-bin variance is falsified — apparent regimes flicker, lasting a single bin at every grid resolution (1s/10s/60s). Direct tests then show repricing events cluster strongly and are non-Poisson on all ten markets (overdispersion CV 2.2–5.3, rate autocorrelation to ~50 minutes, hot-stays-hot correlation ρ=0.74), and that a sticky variance model still cannot recover the persistence. Conclusion: the price is sticky with clustered repricing, and persistence lives in the event intensity, not the per-bin variance — which dictates the Chapter 6 model class.
Builds the model the data dictates. A baseline homogeneous Poisson process (M0) is rejected by Chapter 5's clustering result; a Markov-modulated Poisson process (M1) introduces a latent hot/cold intensity state with genuine dwell; and the contribution model (M2) makes the intensity — and the state transitions — functions of time-to-resolution τ, capturing the hypothesised sharpening of repricing as the market approaches its absorbing boundary. Inference is variational; events are defined in tick units and stratified by price level; masked gaps enter the likelihood as censored intervals. A self-exciting (Hawkes) formulation is developed as a comparison model, making endogenous-excitation vs exogenous-regime a substantive model-selection question.
Subjects the model to the same standard under which earlier candidate findings were retracted. Validation covers held-out predictive likelihood on out-of-sample markets; posterior-predictive checks on inter-event distributions and episode dwell times; calibration of the model's implied probabilities against realised resolutions on the gold samples; simulation-based recovery checks (fit to data from a known model before any empirical claim); and stability across disjoint market subsets and time windows. The central absorption hypothesis — intensity rises or sharpens as τ→0 — is tested with bootstrap confidence intervals and a pre-registered direction, with “no effect” an admissible outcome.
Consolidates the three deliverables — dataset and pipeline, empirical characterisation, and the absorption-aware intensity model — and states what is learned whichever way the central hypothesis falls. Discusses external validity (single venue, liquidity-selected, a specific event-mix window), transferability of the methods to any deadline-resolved contract or event-timed financial series, and concrete extensions (multi-venue, marked processes, longer accrual). Closes on the methodological stance the thesis demonstrates end-to-end: cheap falsifiable probes, pre-registration, and honest retraction.
Datasets Gathered
All data required to begin modelling is already collected and under version control; the corpus grows daily, which strengthens the absorption analysis over time. No external data acquisition is outstanding.
| Corpus | Contents | Status |
|---|---|---|
| Forward L2 corpus (primary) | Full order-book deltas + snapshots + trade tape, top-250 event markets, millisecond resolution | Live; ~2.4 GB/day, gap-audited |
| Resolution archive | Per-market outcomes + timing; complete path-to-absorption samples | Growing ~10–15/day |
| Historical corpus | Resolved-market snapshots, tape, prior backtest corpora | On hand (1,348 markets) |
Risks & Mitigations