Infrastructure Strategy for Large Language Model Deployment: A Comprehensive 120B Parameter Implementation Study

Research Lab: UCLA Trustworthy AI Labs

Researcher: @HOCHAN SON (UCLA-CLASS)

Date: 2025-8-25

Abstract

This study has been provided for the audience in Trustworthy AI Lab, who were looking for the AI hardware choice within the certain budget limit. Also, This study presents a comprehensive infrastructure analysis for deploying and training 120-billion parameter large language models (LLMs) within a $150-200K budget constraint. We examine hardware configurations, ownership versus cloud rental economics, and implementation timelines suitable for academic research environments. Our analysis reveals that memory capacity, rather than computational throughput, represents the primary bottleneck for 120B model deployment. We identify unified memory architectures as cost-effective solutions for inference, while distributed GPU configurations with high-speed interconnects remain essential for training. Single enterprise GPU configurations (H100/H200/B200) emerge as practical intermediate solutions for academic institutions requiring capabilities between consumer hardware and full enterprise clusters. The study provides quantitative ROI analysis demonstrating break-even thresholds and offers practical recommendations for academic institutions pursuing large-scale LLM research.

Keywords: Large Language Models, GPU Infrastructure, Academic Computing, Cost-Benefit Analysis, High-Performance Computing

1. Introduction

Large language models with 120 billion parameters represent a significant computational challenge for academic institutions. Unlike smaller models that can operate on consumer hardware, these systems require specialized infrastructure that traditionally existed only in industrial research laboratories. The GPT-OSS-120B model exemplifies this challenge, requiring approximately 240GB of memory for FP16 inference, with a practical minimum of 80GB for aggressive quantization, and substantial computational resources for training.

Academic research groups face unique constraints compared to industry: limited budgets, irregular workloads, and the need for both experimentation flexibility and potential commercial application. This study addresses these constraints by analyzing infrastructure options that balance cost, performance, and scalability within typical academic budget ranges.

The research question guiding this analysis is: What infrastructure configurations provide optimal cost-performance for 120B parameter LLM deployment in academic research environments? We approach this through systematic evaluation of hardware options, economic modeling of ownership versus rental scenarios, and practical implementation planning.

2. Methodology and Infrastructure Requirements

2.1 Model Requirements Analysis

The 120B parameter GPT-OSS model presents specific memory and computational demands that drive infrastructure decisions. At FP16(floating point) precision, the model requires approximately 240GB of memory, with a practical minimum of 80GB for heavily quantized versions (Q4), while advanced MXFP4 quantization achieves ~60-80GB with superior quality retention compared to traditional methods. These requirements immediately eliminate many consumer-grade options and necessitate either specialized hardware or multi-GPU configurations.

2.2 Hardware Evaluation Framework

We evaluated hardware options across four dimensions:

Memory Capacity: Total available memory for model storage i.e.) 240GB, 1TB
Memory Bandwidth: Data transfer rates affecting inference speed usually the speed measurable between GPU and Memory i.e.) 200Gbps