Goal:
I want an exploratory research notebook/script for the 50 Round 5 products. Focus only on data analysis and visualization:
mid price
bid/ask spread
order book structure
group-level normalized prices
group common factor
residuals
residual ACF
price return ACF
pairwise correlations
Timo/Frankfurt-Hedgehogs-style visualizations
Do NOT implement trading rules.
Do NOT generate buy/sell signals.
Do NOT optimize thresholds.
Do NOT use trader_id, buyer, seller, or any identity fields.
Context:
I have 3 days of order book and trade data stored in folder ROUND_5
There are 50 products divided into 10 groups of 5.
Each product has position limit 10, but ignore trading for now.
The goal is to identify which groups/products are worth further research.
Product groups:
GROUPS = {
"GALAXY_SOUNDS": [
"GALAXY_SOUNDS_DARK_MATTER",
"GALAXY_SOUNDS_BLACK_HOLES",
"GALAXY_SOUNDS_PLANETARY_RINGS",
"GALAXY_SOUNDS_SOLAR_WINDS",
"GALAXY_SOUNDS_SOLAR_FLAMES",
],
"SLEEP_POD": [
"SLEEP_POD_SUEDE",
"SLEEP_POD_LAMB_WOOL",
"SLEEP_POD_POLYESTER",
"SLEEP_POD_NYLON",
"SLEEP_POD_COTTON",
],
"MICROCHIP": [
"MICROCHIP_CIRCLE",
"MICROCHIP_OVAL",
"MICROCHIP_SQUARE",
"MICROCHIP_RECTANGLE",
"MICROCHIP_TRIANGLE",
],
"PEBBLES": [
"PEBBLES_XS",
"PEBBLES_S",
"PEBBLES_M",
"PEBBLES_L",
"PEBBLES_XL",
],
"ROBOT": [
"ROBOT_VACUUMING",
"ROBOT_MOPPING",
"ROBOT_DISHES",
"ROBOT_LAUNDRY",
"ROBOT_IRONING",
],
"UV_VISOR": [
"UV_VISOR_YELLOW",
"UV_VISOR_AMBER",
"UV_VISOR_ORANGE",
"UV_VISOR_RED",
"UV_VISOR_MAGENTA",
],
"TRANSLATOR": [
"TRANSLATOR_SPACE_GRAY",
"TRANSLATOR_ASTRO_BLACK",
"TRANSLATOR_ECLIPSE_CHARCOAL",
"TRANSLATOR_GRAPHITE_MIST",
"TRANSLATOR_VOID_BLUE",
],
"PANEL": [
"PANEL_1X2",
"PANEL_2X2",
"PANEL_1X4",
"PANEL_2X4",
"PANEL_4X4",
],
"OXYGEN_SHAKE": [
"OXYGEN_SHAKE_MORNING_BREATH",
"OXYGEN_SHAKE_EVENING_BREATH",
"OXYGEN_SHAKE_MINT",
"OXYGEN_SHAKE_CHOCOLATE",
"OXYGEN_SHAKE_GARLIC",
],
"SNACKPACK": [
"SNACKPACK_CHOCOLATE",
"SNACKPACK_VANILLA",
"SNACKPACK_PISTACHIO",
"SNACKPACK_STRAWBERRY",
"SNACKPACK_RASPBERRY",
],
}
Step 1: Load and clean data
Inspect the local directory and find all Round 5 price/orderbook CSVs and trade CSVs.
Standard IMC orderbook files may contain columns like:
day
timestamp
product
bid_price_1, bid_volume_1
bid_price_2, bid_volume_2
bid_price_3, bid_volume_3
ask_price_1, ask_volume_1
ask_price_2, ask_volume_2
ask_price_3, ask_volume_3
mid_price
profit_and_loss
Standard trade files may contain:
timestamp
buyer
seller
symbol or product
currency
price
quantity
Normalize column names:
symbol -> product
product names should match GROUPS
Completely ignore buyer, seller, trader_id, and any identity-related columns.
Create clean DataFrames:
prices_df
trades_df
Add a global time index:
so that the three days can be plotted continuously.
Save cleaned versions to:
outputs/clean_prices.csv
outputs/clean_trades.csv
Step 2: Compute price, spread, and order book features
For every product and timestamp, compute:
best_bid:
best_ask:
mid:
use mid_price if available
otherwise use (best_bid + best_ask) / 2
spread:
relative_spread:
top_level_imbalance:
bid_volume_1 / (bid_volume_1 + abs(ask_volume_1))
handle missing or zero denominator safely
microprice:
(best_ask * bid_volume_1 + best_bid * abs(ask_volume_1)) / (bid_volume_1 + abs(ask_volume_1))
handle missing values safely
wall_mid:
among bid levels 1-3, find the bid level with largest volume
among ask levels 1-3, find the ask level with largest absolute volume
wall_mid = (bid_wall_price + ask_wall_price) / 2
if levels are missing, return NaN
log_mid:
returns:
mid_return = mid.diff()
log_return = log_mid.diff()
Save product-level feature data to:
Create summary tables:
A. outputs/product_summary.csv
For each product:
mean_mid
std_mid
mean_spread
median_spread
mean_relative_spread
return_std
number_of_timestamps
missing_mid_count
missing_spread_count
B. outputs/group_summary.csv
For each group:
number of products
average spread
average relative spread
average return volatility
average pairwise return correlation
average pairwise level correlation
Step 3: Group factor and residual research
For each group of 5 products:
Create a timestamp-aligned matrix of mid prices:
rows = timestamps
columns = products
Create normalized price plots:
A. normalized by first value:
indexed_price_i,t = 100 * mid_i,t / mid_i,0
B. log-normalized:
log_mid_i,t - log_mid_i,0
Compute group common factors:
A. mean_log_factor:
average of the five log_mid series
B. median_log_factor:
median of the five log_mid series
C. mean_indexed_factor:
average of indexed prices
D. optional PCA factor:
if sklearn is available, compute first principal component of standardized log prices
For each product in each group, fit a simple OLS residual model:
log_mid_i,t = alpha_i + beta_i * group_factor_t + residual_i,t
Do this separately for:
mean_log_factor
median_log_factor
PCA factor if available
Store for each product:
alpha
beta
fitted_fair_value
residual
residual_zscore = (residual - residual_mean) / residual_std
Compute residual diagnostics:
For each product residual:
residual_mean
residual_std
residual_median
residual_iqr
residual_min
residual_max
zero_crossing_count
zero_crossing_count_per_day
ACF of residual for lags:
[1, 2, 5, 10, 20, 50, 100]
ACF of residual changes for same lags
ACF of mid returns for same lags
regression:
residual_{t+h} - residual_t = a + gamma_h * residual_t + error
for h in:
[1, 5, 10, 20, 50, 100]
AR(1):
residual_{t+1} = c + phi * residual_t + error
if 0 < phi < 1:
half_life = -log(2) / log(phi)
else:
half_life = NaN
Compute pairwise correlations:
For each group:
correlation matrix of normalized log price levels
correlation matrix of log returns
correlation matrix of residuals
Save outputs:
outputs/group_factor_series.csv
outputs/product_residuals.csv
outputs/residual_diagnostics.csv
outputs/group_correlation_summary.csv
Visualizations:
Create a figures/ directory and save the following.
For each group:
figures/normalized_prices_{group}.png
plot all 5 products normalized to 100 at the start
one continuous plot across all 3 days
mark day boundaries
figures/log_normalized_prices_{group}.png
plot log_mid_i,t - log_mid_i,0 for all 5 products
mark day boundaries
figures/spreads_{group}.png
plot bid-ask spread over time for all 5 products
include median spread line if useful
figures/relative_spreads_{group}.png
figures/return_correlation_heatmap_{group}.png
figures/level_correlation_heatmap_{group}.png
figures/residuals_{group}.png
plot residual time series for all 5 products
one subplot per product
include horizontal line at 0
include ±1 residual standard deviation bands
figures/residual_zscores_{group}.png
plot residual z-score for all 5 products
include horizontal lines at -2, -1, 0, 1, 2
figures/residual_acf_{group}.png
ACF bar plots for residuals of each product
lags: [1, 2, 5, 10, 20, 50, 100]
figures/return_acf_{group}.png
ACF bar plots for mid/log returns of each product
lags: [1, 2, 5, 10, 20, 50, 100]
figures/residual_future_change_scatter_{group}.png
scatter plot of residual_t vs residual_{t+h} - residual_t
use h = 10 and h = 50
one subplot per product
show fitted regression slope gamma_h
figures/orderbook_dashboard_{group}.html
Plotly interactive dashboard
dropdown to select product
top panel:
bid_price_1, ask_price_1, mid, fitted fair value
second panel:
mid - fitted fair value or log residual
third panel:
bid-ask spread
fourth panel:
trades overlaid as markers using price and quantity
do not show trader identity
make it easy to visually inspect whether the product mean-reverts around the fitted fair value
Final written summary:
At the end of the notebook, print a concise research summary with:
Which groups have the strongest common movement?
Which groups have the highest average pairwise return correlation?
Which groups have the cleanest residual structure?
Which products have the most negative residual mean-reversion gamma?
Which products have the strongest negative return ACF?
Which products have residuals that frequently cross zero?
Which products have spreads too wide to be useful?
Which groups/products should be prioritized for the next step of strategy research?
Important:
Do not implement a trading strategy.
Do not generate buy/sell signals.
Do not do threshold optimization.
Focus only on exploratory analysis and visualization.
Use pandas, numpy, matplotlib, and plotly.
Use sklearn only if available.
Save all outputs under outputs/ and figures/.
Create a jupyter notebook called round5_relative_value_eda.ipynb.