When Should I Use Fits on Chips?
This article describes when to use Fits on Chips toolkit with specific use cases.
Mar 10, 2025
Introduction
Large Language Models (LLMs) have seen remarkable advancements in recent years. While proprietary models such as OpenAI's ChatGPT and Google's Gemini have dominated the landscape as closed-source API services, the emergence of open-weight large reasoning models like DeepSeek-R1 has significantly shifted the academic and industrial paradigm. The comparable performance of open-weight reasoning models to proprietary counterparts suggests that the LLM service providers could carefully consider adopting self-hosted LLM rather than relying solely on API services. Consequently, the self-hosting of large language models has regained substantial interest, particularly among organizations aiming to optimize performance, reduce operational costs, and maintain heightened control over their AI deployments.
However, self-hosting large language models presents novel challenges, including the need for efficient inference, model optimization, and hardware selection. Without careful tuning, LLM inference can be computationally intensive, leading to high latency and increased operational costs. One key solution to these challenges is model compression such as quantization, pruning, and distillation. Quantization, in particular, reduces model size and computational overhead, enabling the deployment of large-scale models on edge devices or with lower-cost hardware. Yet, compression inevitably impacts output quality, necessitating thorough benchmarking and evaluation of models post-optimization to ensure they maintain acceptable performance for real-world applications.
In a previous blog post (Fits on Chips: Saving LLM Costs Became Easier Than Ever), we introduced Fits on Chips, a powerful LLMOps research tool that helps users easily identify optimal parameter configurations for vLLM and TensorRT-LLM serving scenarios. In this article, we explore user cases demonstrating how Fits on Chips enables efficient LLM inference, from tuning models in constrained environments to selecting the best GPU for different workloads and benchmarking quantized models to assess the trade-offs between compression and output quality.
Case 1: Which framework should I use on constrained environment?
In this experiment, we aim to identify the best-performing frameworks between vLLM and TensorRT-LLM in terms of throughput under a constrained setting with a target Time-Per-Output-Token (TPOT) of ≤ 20 ms. To benchmark this scenario, we sampled 256 sentences from Dynamic-sonnet-1K whose average input length is 512, and fixed the number of output tokens to 128.
Frameworks version, Model and Hardware
- Frameworks: vLLM (v0.7.2), TensorRT-LLM (v0.16.0)
- Model: Llama-3.1-8B-Instruct (BF16)
- H/W: 1 x NVIDIA A100 (80G), AMD EPYC 7713 64-Core processors, 1007.67 GB RAM

As shown in Figure 1, we systematically adjusted max batch size (max_num_seq in vLLM and max_batch_size in TensorRT-LLM) while keeping all other parameters at their default values, except for ignore_eos=True and use_v2_block_manager=True using Fits on Chips. The detailed explanation on using Fits on Chips is provided in the previous blog post and user guide.

Figures 2 illustrates the throughput and TPOT results across both frameworks. Regardless of the framework and the usage of automatic prefix caching, the maximum batch size that satisfies the TPOT ≤ 20 ms constraint was 32. Automatic prefix caching further improved the serving performance in both frameworks due to the presence of common prefix tokens across sentences, particularly system prompts shared within the dataset. With the batch size of 32 and automatic prefix caching, vLLM achieved a peak throughput of 1511.5 tokens/sec, while TensorRT-LLM reached 1723.1 tokens/sec, making TensorRT-LLM the more efficient choice in this constrained scenario. The ability to easily modify parameters and compare results using Fits on Chips allowed us to efficiently explore various configurations and to search the optimal setting for LLM serving.

Fits on Chips enabled easy parameter modification and result comparison, allowing us to efficiently explore various configurations and find the optimal setting for LLM serving. As shown in Figure 3, Fits on Chips also supports the ability to draw graphs with x-axis as various parameters and y-axis as the benchmark results. This feature allows user to easily visualize experiment results and identify the best combination of parameters.
Case 2: Which device should I use between A100 and H100?
In this experiment, we leverage Fits on Chips' built-in cost estimation feature (see Figure 4), which allows users to input device cost ($/hours) and obtain an estimated cost per 1 million tokens during benchmarking. By applying this to different dataset types—prefill-heavy (e.g., summarization tasks) and decode-heavy (e.g., text generation tasks)—we determine the more economical GPU choice for each scenario.
For the prefill-heavy dataset, we used 256 sampled sentences from the Dynamic-sonnet-4K dataset, with an average token length of 2048 and a fixed output token length of 256. In the decode-heavy dataset, we used the same 256 sampled sentences as in User Case 1, but increased the output token length to 4096.
Frameworks version, Model and Hardware
- Frameworks: TensorRT-LLM (v0.16.0)
- Model: Llama-3.1-8B-Instruct (BF16)
- H/W
- 1 x NVIDIA A100 SXM (80G), AMD EPYC 7713 64-Core processors, 1007.67 GB RAM, 1.89 $/hour
- 1 x NVIDIA H100 SXM (80G), Intel Xeon(R) Platinum 8470, 1007.67 GB RAM , 2.89 $/hour



The results in Figure 6 showed the estimated cost per 1 million tokens for the two devices. In the Decode-heavy scenario, the costs were similar, but the H100 was slightly more cost-efficient than the A100. Although the H100 fee is about 1.5x higher than that of the A100, its performance advantage offsets this cost difference, leading to comparable estimated serving costs. However, the performance gap between the devices widened in the Prefill-heavy case, resulting in the A100 being about 1.59x more expensive than the H100.
This experiment demonstrated the importance of GPU selection and accurately estimating token generation costs based on workload characteristics. Fits on Chips simplifies this decision-making process by automating benchmarking and cost estimation, enabling users to make informed GPU choices that optimize both performance and budget.
Case 3: Benchmark & Evaluation of quantized models
Quantization is a widely used technique for reducing the computational cost of LLM inference, but it comes with potential trade-offs in model accuracy and output quality. To systematically evaluate these effects, we leveraged Fits on Chips, which also provides built-in evaluation capabilities using popular datasets (see Figure 7). By using this feature, we tested Mistral-7B-Instruct-v0.3 and its quantized variants to assess both accuracy and throughput improvements. Except for ignore_eos=True and use_v2_block_manager=True, all other parameters were set to the default configurations provided by Fits on Chips.


Experiment Setup
- Model: Mixtral-7B-Instruct-v0.3 and its quantized variants (GPTQ_W4A16, AWQ_W4A16, W8A8)
- Hardware: 1 × NVIDIA A100 (80G), AMD EPYC 7713 64-Core Processor, 1007.67 GB RAM
- Framework: vLLM (v0.7.2)
- Evaluation Datasets: ARC-Challenge (10-shot), TruthfulQA (MC2, 0-shot), PubMedQA (0-shot)
- Benchmark Dataset: Dynamic-Sonnet-1K with fixed 1K output token length

Figure 9 describes the benchmark and evaluation results on Mistral-7B-Instruct-v0.3 and its quantized variants in Fits on Chips. The evaluation results showed that quantized models exhibit marginal variation in accuracy across all datasets. Note that all these quality evaluations were also done in Fits on Chips platform automatically.

Beyond automated evaluation, Fits on Chips also provides a playground feature, allowing users to manually input example text and inspect the generated responses. Figure 10 illustrates an output quality assessment for the model quantized with AWQ, confirming that it generates coherent and high-quality responses. The evaluation and playground results validated the usability of quantized model.
From a performance perspective, quantization yielded mixed results in benchmark tests. Models quantized with AWQ and W8A8 achieved a throughput improvement of approximately 1.3× and 1.28×, respectively, compared to the full-precision model. However, the model quantized with GPTQ did not outperform the baseline, highlighting that not all quantization methods lead to inference speedups under the same conditions. While a deeper analysis of these results is beyond the scope of this post, these findings underscore the need for case-by-case benchmark when selecting a quantization method.
Ultimately, deploying quantized or otherwise compressed LLMs requires balancing evaluation accuracy, output quality, and actual performance improvements in hardware. Fits on Chips significantly reduces the time and effort required for this process, providing an automated and structured approach to quantized model evaluation and benchmarking.
Conclusion
As LLM deployment continues to expand across various environments, optimizing inference performance has become a crucial challenge. Each decision including selecting the serving framework, hardware, models and parameter configuration directly impacts both cost-effectiveness and user experience. Given the complexity of these choices, data-driven benchmarking and evaluation are essential to ensure that models perform optimally under real-world constraints.
However, manually experimenting with different configurations, frameworks, and hardware can be time-consuming and resource-intensive. Fits on Chips simplifies this process, providing an automated and structured approach to LLM optimization. With built-in benchmarking, evaluation tools, and cost estimation features, it enables users to quickly test different setups, compare performance, and make informed decisions without the need for extensive manual tuning.
If you're looking to optimize LLM inference, reduce deployment costs, or efficiently evaluate models, Fits on Chips is the tool you need. Try it today and streamline your LLM performance tuning process! 🚀
Share article
Join the SqueezeBits newsletter today!