Built and deployed a production PPO-based execution engine operating on live order book data across KRX, HKEX, and TWSE — achieving +6.87bp improvement over VWAP baseline in a 4-week institutional proof-of-concept, with zero-shot cross-market transfer outperforming single-market models trained on 1 year of data by 1.5×.
A deep reinforcement learning system that learns intraday trade execution policies directly from raw L2/L3 order book data. The system spans the full lifecycle: a Rust-based market simulator reconstructing L3-level order flow, a Python RL agent with a structured 4-head observation encoder and CNN+LSTM policy, parallel PPO training across 16–128 workers, and TorchScript deployment into a Rust inference service for production use.
The core technical challenge: learning to execute large orders optimally when feedback is delayed (you only know if your order was good after it fills), rewards are sparse, and market microstructure differs fundamentally across exchanges.
Tech stack: Python · PyTorch · Rust · PPO · CNN+LSTM · TorchScript · TCP/UDP market feeds
https://sh1319.github.io/diagrams/ppo_architecture_diagram.html
Raw L2 order book data is collected via TCP/UDP feeds and upgraded to L3 granularity using a Rust-based simulator. This reconstructs individual order-level tracking (order IDs, queue positions, fills) from aggregate book snapshots — enabling realistic simulation of order placement, queue priority, and partial fills.
Why this matters: Most academic RL-for-trading work operates on simplified or synthetic data. Working with reconstructed L3 data means the agent learns from realistic queue dynamics and fill probabilities, which is critical for policies that transfer to production.
The execution environment is implemented in Rust for performance, with a Python interface for the RL agent. At each step (5-second intervals):