We are seeking a highly skilled NPU Runtime Engineer specializing in LLM Serving to design, optimize, and deploy large language models (LLMs) for efficient inference in production environments. This role involves working with cutting-edge AI serving frameworks, optimizing NPU-based inference performance, and integrating LLMs with scalable distributed systems.
Responsibilities and Opportunities
- Design, implement, and optimize LLM inference pipelines for low-latency, high-throughput serving
- Develop and extend vLLM to enhance inference performance on NPUs, including support for Continuous Batching and PagedAttention
- Implement custom vLLM extensions to improve memory management, parallelism, and dynamic batching strategies
- Work with torch.compile and RBLN compiler toolchains to accelerate model execution on NPUs
- Optimize graph transformations, operator fusion, and execution efficiency for LLM inference workloads
- Collaborate with ML engineers, and infrastructure teams to deploy and scale LLM services
Key Qualifications
- Strong proficiency in Python and deep learning frameworks (PyTorch, TensorFlow)
- Deep understanding of LLM architectures, including Transformer-based models and inference optimization techniques
- Hands-on experience with LLM serving frameworks (e.g., vLLM, TensorRT-LLM)
- Solid understanding of model optimization techniques (tensor parallelism, KV cache optimizations, and memory-efficient execution)
- Familiarity with hardware acceleration (GPUs, NPUs, TPUs) and efficient memory management techniques
- Strong debugging and performance profiling skills for high-throughput inference environments
Ideal Qualifications
- Experience with compilers and runtime optimizations
- C++ experience, especially for performance-critical runtime optimizations
- Understanding of torch.compile and graph optimizations