NPU Runtime Software Engineer (LLM Serving)

We are seeking a highly skilled NPU Runtime Engineer specializing in LLM Serving to design, optimize, and deploy large language models (LLMs) for efficient inference in production environments. This role involves working with cutting-edge AI serving frameworks, optimizing NPU-based inference performance, and integrating LLMs with scalable distributed systems.

Responsibilities and Opportunities

Design, implement, and optimize LLM inference pipelines for low-latency, high-throughput serving
Develop and extend vLLM to enhance inference performance on NPUs, including support for Continuous Batching and PagedAttention
Implement custom vLLM extensions to improve memory management, parallelism, and dynamic batching strategies
Work with torch.compile and RBLN compiler toolchains to accelerate model execution on NPUs
Optimize graph transformations, operator fusion, and execution efficiency for LLM inference workloads
Collaborate with ML engineers, and infrastructure teams to deploy and scale LLM services

Key Qualifications

Strong proficiency in Python and deep learning frameworks (PyTorch, TensorFlow)
Deep understanding of LLM architectures, including Transformer-based models and inference optimization techniques
Hands-on experience with LLM serving frameworks (e.g., vLLM, TensorRT-LLM)
Solid understanding of model optimization techniques (tensor parallelism, KV cache optimizations, and memory-efficient execution)
Familiarity with hardware acceleration (GPUs, NPUs, TPUs) and efficient memory management techniques
Strong debugging and performance profiling skills for high-throughput inference environments

Ideal Qualifications

Experience with compilers and runtime optimizations
C++ experience, especially for performance-critical runtime optimizations
Understanding of torch.compile and graph optimizations