round 5 | Notion

In this folder, read the round5_wiki.md carefully to understand the algorithm trading task and read the Trader_class.md and overall.py to understand the format for writing the strategy. Write a EDA (exploratory data analysis) about the tickers: Instant Translators, Construction Panels: Understand and identify the price patterns. After this, give practical profitable trading strategy (taker, maker, stats arbitrage, signal, momentum, and etc), visualize whenever you can and return me a Jupyter Notebook file named round5_translators_panel_EDA.ipynb
Read the dashboard.html file. Currently, it does not work for parsing multiple days of log data. We want to implement a function in which we can have different tabs at the top to switch between days if we upload a logbook that has multiple days. So each page should still look and function like it currently does, but it needs to be able to parse all 3 days. Provide me with a new dashboard, a. html file.
You are helping me research IMC Prosperity 4 Round 5 algorithmic trading data.

Goal:

I want an exploratory research notebook/script for the 50 Round 5 products. Focus only on data analysis and visualization:

mid price
bid/ask spread
order book structure
group-level normalized prices
group common factor
residuals
residual ACF
price return ACF
pairwise correlations
Timo/Frankfurt-Hedgehogs-style visualizations

Do NOT implement trading rules.

Do NOT generate buy/sell signals.

Do NOT optimize thresholds.

Do NOT use trader_id, buyer, seller, or any identity fields.

Context:

I have 3 days of order book and trade data stored in folder ROUND_5
There are 50 products divided into 10 groups of 5.
Each product has position limit 10, but ignore trading for now.
The goal is to identify which groups/products are worth further research.

Product groups:

GROUPS = {

"GALAXY_SOUNDS": [

    "GALAXY_SOUNDS_DARK_MATTER",

    "GALAXY_SOUNDS_BLACK_HOLES",

    "GALAXY_SOUNDS_PLANETARY_RINGS",

    "GALAXY_SOUNDS_SOLAR_WINDS",

    "GALAXY_SOUNDS_SOLAR_FLAMES",

],

"SLEEP_POD": [

    "SLEEP_POD_SUEDE",

    "SLEEP_POD_LAMB_WOOL",

    "SLEEP_POD_POLYESTER",

    "SLEEP_POD_NYLON",

    "SLEEP_POD_COTTON",

],

"MICROCHIP": [

    "MICROCHIP_CIRCLE",

    "MICROCHIP_OVAL",

    "MICROCHIP_SQUARE",

    "MICROCHIP_RECTANGLE",

    "MICROCHIP_TRIANGLE",

],

"PEBBLES": [

    "PEBBLES_XS",

    "PEBBLES_S",

    "PEBBLES_M",

    "PEBBLES_L",

    "PEBBLES_XL",

],

"ROBOT": [

    "ROBOT_VACUUMING",

    "ROBOT_MOPPING",

    "ROBOT_DISHES",

    "ROBOT_LAUNDRY",

    "ROBOT_IRONING",

],

"UV_VISOR": [

    "UV_VISOR_YELLOW",

    "UV_VISOR_AMBER",

    "UV_VISOR_ORANGE",

    "UV_VISOR_RED",

    "UV_VISOR_MAGENTA",

],

"TRANSLATOR": [

    "TRANSLATOR_SPACE_GRAY",

    "TRANSLATOR_ASTRO_BLACK",

    "TRANSLATOR_ECLIPSE_CHARCOAL",

    "TRANSLATOR_GRAPHITE_MIST",

    "TRANSLATOR_VOID_BLUE",

],

"PANEL": [

    "PANEL_1X2",

    "PANEL_2X2",

    "PANEL_1X4",

    "PANEL_2X4",

    "PANEL_4X4",

],

"OXYGEN_SHAKE": [

    "OXYGEN_SHAKE_MORNING_BREATH",

    "OXYGEN_SHAKE_EVENING_BREATH",

    "OXYGEN_SHAKE_MINT",

    "OXYGEN_SHAKE_CHOCOLATE",

    "OXYGEN_SHAKE_GARLIC",

],

"SNACKPACK": [

    "SNACKPACK_CHOCOLATE",

    "SNACKPACK_VANILLA",

    "SNACKPACK_PISTACHIO",

    "SNACKPACK_STRAWBERRY",

    "SNACKPACK_RASPBERRY",

],

}

Step 1: Load and clean data

Inspect the local directory and find all Round 5 price/orderbook CSVs and trade CSVs.
Standard IMC orderbook files may contain columns like:
- day
- timestamp
- product
- bid_price_1, bid_volume_1
- bid_price_2, bid_volume_2
- bid_price_3, bid_volume_3
- ask_price_1, ask_volume_1
- ask_price_2, ask_volume_2
- ask_price_3, ask_volume_3
- mid_price
- profit_and_loss
Standard trade files may contain:
- timestamp
- buyer
- seller
- symbol or product
- currency
- price
- quantity
Normalize column names:
- symbol -> product
- product names should match GROUPS
Completely ignore buyer, seller, trader_id, and any identity-related columns.
Create clean DataFrames:
- prices_df
- trades_df
Add a global time index:
- global_time = day * 1_000_000 + timestamp
so that the three days can be plotted continuously.
Save cleaned versions to:
- outputs/clean_prices.csv
- outputs/clean_trades.csv

Step 2: Compute price, spread, and order book features

For every product and timestamp, compute:

best_bid:
- bid_price_1
best_ask:
- ask_price_1
mid:
- use mid_price if available
- otherwise use (best_bid + best_ask) / 2
spread:
- best_ask - best_bid
relative_spread:
- spread / mid
top_level_imbalance:
- bid_volume_1 / (bid_volume_1 + abs(ask_volume_1))
- handle missing or zero denominator safely
microprice:
- (best_ask * bid_volume_1 + best_bid * abs(ask_volume_1)) / (bid_volume_1 + abs(ask_volume_1))
- handle missing values safely
wall_mid:
- among bid levels 1-3, find the bid level with largest volume
- among ask levels 1-3, find the ask level with largest absolute volume
- wall_mid = (bid_wall_price + ask_wall_price) / 2
- if levels are missing, return NaN
log_mid:
- log(mid), only if mid > 0
returns:

mid_return = mid.diff()
log_return = log_mid.diff()

Save product-level feature data to:

outputs/product_price_features.csv

Create summary tables:

A. outputs/product_summary.csv

For each product:

mean_mid
std_mid
mean_spread
median_spread
mean_relative_spread
return_std
number_of_timestamps
missing_mid_count
missing_spread_count

B. outputs/group_summary.csv

For each group:

number of products
average spread
average relative spread
average return volatility
average pairwise return correlation
average pairwise level correlation

Step 3: Group factor and residual research

For each group of 5 products:

Create a timestamp-aligned matrix of mid prices:

rows = timestamps

columns = products
Create normalized price plots:

A. normalized by first value:

indexed_price_i,t = 100 * mid_i,t / mid_i,0

B. log-normalized:

log_mid_i,t - log_mid_i,0
Compute group common factors:

A. mean_log_factor:

average of the five log_mid series

B. median_log_factor:

median of the five log_mid series

C. mean_indexed_factor:

average of indexed prices

D. optional PCA factor:

if sklearn is available, compute first principal component of standardized log prices
For each product in each group, fit a simple OLS residual model:

log_mid_i,t = alpha_i + beta_i * group_factor_t + residual_i,t

Do this separately for:
- mean_log_factor
- median_log_factor
- PCA factor if available
Store for each product:
- alpha
- beta
- fitted_fair_value
- residual
- residual_zscore = (residual - residual_mean) / residual_std
Compute residual diagnostics:

For each product residual:
- residual_mean
- residual_std
- residual_median
- residual_iqr
- residual_min
- residual_max
- zero_crossing_count
- zero_crossing_count_per_day
- ACF of residual for lags:
  
  [1, 2, 5, 10, 20, 50, 100]
- ACF of residual changes for same lags
- ACF of mid returns for same lags
- regression:
  
  residual_{t+h} - residual_t = a + gamma_h * residual_t + error
  
  for h in:
  
  [1, 5, 10, 20, 50, 100]
- AR(1):
  
  residual_{t+1} = c + phi * residual_t + error
  
  if 0 < phi < 1:
  
  half_life = -log(2) / log(phi)
  
  else:
  
  half_life = NaN
Compute pairwise correlations:

For each group:
- correlation matrix of normalized log price levels
- correlation matrix of log returns
- correlation matrix of residuals
Save outputs:
- outputs/group_factor_series.csv
- outputs/product_residuals.csv
- outputs/residual_diagnostics.csv
- outputs/group_correlation_summary.csv

Visualizations:

Create a figures/ directory and save the following.

For each group:

figures/normalized_prices_{group}.png
- plot all 5 products normalized to 100 at the start
- one continuous plot across all 3 days
- mark day boundaries
figures/log_normalized_prices_{group}.png
- plot log_mid_i,t - log_mid_i,0 for all 5 products
- mark day boundaries
figures/spreads_{group}.png
- plot bid-ask spread over time for all 5 products
- include median spread line if useful
figures/relative_spreads_{group}.png
- plot spread / mid over time for all 5 products
figures/return_correlation_heatmap_{group}.png
- heatmap of log return correlations
figures/level_correlation_heatmap_{group}.png
- heatmap of normalized log price level correlations
figures/residuals_{group}.png
- plot residual time series for all 5 products
- one subplot per product
- include horizontal line at 0
- include ±1 residual standard deviation bands
figures/residual_zscores_{group}.png
- plot residual z-score for all 5 products
- include horizontal lines at -2, -1, 0, 1, 2
figures/residual_acf_{group}.png
- ACF bar plots for residuals of each product
- lags: [1, 2, 5, 10, 20, 50, 100]
figures/return_acf_{group}.png
- ACF bar plots for mid/log returns of each product
- lags: [1, 2, 5, 10, 20, 50, 100]
figures/residual_future_change_scatter_{group}.png
- scatter plot of residual_t vs residual_{t+h} - residual_t
- use h = 10 and h = 50
- one subplot per product
- show fitted regression slope gamma_h
figures/orderbook_dashboard_{group}.html
- Plotly interactive dashboard
- dropdown to select product
- top panel:
  
  bid_price_1, ask_price_1, mid, fitted fair value
- second panel:
  
  mid - fitted fair value or log residual
- third panel:
  
  bid-ask spread
- fourth panel:
  
  trades overlaid as markers using price and quantity
- do not show trader identity
- make it easy to visually inspect whether the product mean-reverts around the fitted fair value

Final written summary:

At the end of the notebook, print a concise research summary with:

Which groups have the strongest common movement?
Which groups have the highest average pairwise return correlation?
Which groups have the cleanest residual structure?
Which products have the most negative residual mean-reversion gamma?
Which products have the strongest negative return ACF?
Which products have residuals that frequently cross zero?
Which products have spreads too wide to be useful?
Which groups/products should be prioritized for the next step of strategy research?

Important:
- Do not implement a trading strategy.
- Do not generate buy/sell signals.
- Do not do threshold optimization.
- Focus only on exploratory analysis and visualization.
- Use pandas, numpy, matplotlib, and plotly.
- Use sklearn only if available.
- Save all outputs under outputs/ and figures/.
- Create a jupyter notebook called round5_relative_value_eda.ipynb.

Result

Products