When Should I Use Fits on Chips?

This article describes when to use Fits on Chips toolkit with specific use cases.

Mar 10, 2025

Contents

Introduction Case 1: Which framework should I use on constrained environment?Case 2: Which device should I use between A100 and H100? Case 3: Benchmark & Evaluation of quantized models Conclusion

Introduction

Large Language Models (LLMs) have seen remarkable advancements in recent years. While proprietary models such as OpenAI's ChatGPT and Google's Gemini have dominated the landscape as closed-source API services, the emergence of open-weight large reasoning models like DeepSeek-R1 has significantly shifted the academic and industrial paradigm. The comparable performance of open-weight reasoning models to proprietary counterparts suggests that the LLM service providers could carefully consider adopting self-hosted LLM rather than relying solely on API services. Consequently, the self-hosting of large language models has regained substantial interest, particularly among organizations aiming to optimize performance, reduce operational costs, and maintain heightened control over their AI deployments.

However, self-hosting large language models presents novel challenges, including the need for efficient inference, model optimization, and hardware selection. Without careful tuning, LLM inference can be computationally intensive, leading to high latency and increased operational costs. One key solution to these challenges is model compression such as quantization, pruning, and distillation. Quantization, in particular, reduces model size and computational overhead, enabling the deployment of large-scale models on edge devices or with lower-cost hardware. Yet, compression inevitably impacts output quality, necessitating thorough benchmarking and evaluation of models post-optimization to ensure they maintain acceptable performance for real-world applications.

In a previous blog post (Fits on Chips: Saving LLM Costs Became Easier Than Ever), we introduced Fits on Chips, a powerful LLMOps research tool that helps users easily identify optimal parameter configurations for vLLM and TensorRT-LLM serving scenarios. In this article, we explore user cases demonstrating how Fits on Chips enables efficient LLM inference, from tuning models in constrained environments to selecting the best GPU for different workloads and benchmarking quantized models to assess the trade-offs between compression and output quality.

Case 1: Which framework should I use on constrained environment?

In this experiment, we aim to identify the best-performing frameworks between vLLM and TensorRT-LLM in terms of throughput under a constrained setting with a target Time-Per-Output-Token (TPOT) of ≤ 20 ms. To benchmark this scenario, we sampled 256 sentences from Dynamic-sonnet-1K whose average input length is 512, and fixed the number of output tokens to 128.

Frameworks version, Model and Hardware

Frameworks: vLLM (v0.7.2), TensorRT-LLM (v0.16.0)

Model: Llama-3.1-8B-Instruct (BF16)

H/W: 1 x NVIDIA A100 (80G), AMD EPYC 7713 64-Core processors, 1007.67 GB RAM

Figure 1. Fits on chips project page created while benchmarking Llama-3.1-8B-Instruct with vLLM on A100. Benchmark results of vLLM on A100 varying max_num_seqs and enable_prefix_caching using Fits on Chips.

As shown in Figure 1, we systematically adjusted max batch size (max_num_seq in vLLM and max_batch_size in TensorRT-LLM) while keeping all other parameters at their default values, except for ignore_eos=True and use_v2_block_manager=True using Fits on Chips. The detailed explanation on using Fits on Chips is provided in the previous blog post and user guide.

Figure 2. Benchmark results of Llama-3.1-8B-Instruct on A100 with (a) vLLM and (b) TensorRT-LLM varying the max batch size and the usage of automatic prefix caching.

Figures 2 illustrates the throughput and TPOT results across both frameworks. Regardless of the framework and the usage of automatic prefix caching, the maximum batch size that satisfies the TPOT ≤ 20 ms constraint was 32. Automatic prefix caching further improved the serving performance in both frameworks due to the presence of common prefix tokens across sentences, particularly system prompts shared within the dataset. With the batch size of 32 and automatic prefix caching, vLLM achieved a peak throughput of 1511.5 tokens/sec, while TensorRT-LLM reached 1723.1 tokens/sec, making TensorRT-LLM the more efficient choice in this constrained scenario. The ability to easily modify parameters and compare results using Fits on Chips allowed us to efficiently explore various configurations and to search the optimal setting for LLM serving.

Figure 3. Example graphs about thorughput vs. max_num_seqs and TPOT vs. max_num_seqs using the graph-drawing feature supported in Fits on Chips.

Fits on Chips enabled easy parameter modification and result comparison, allowing us to efficiently explore various configurations and find the optimal setting for LLM serving. As shown in Figure 3, Fits on Chips also supports the ability to draw graphs with x-axis as various parameters and y-axis as the benchmark results. This feature allows user to easily visualize experiment results and identify the best combination of parameters.

Case 2: Which device should I use between A100 and H100?

In this experiment, we leverage Fits on Chips' built-in cost estimation feature (see Figure 4), which allows users to input device cost ($/hours) and obtain an estimated cost per 1 million tokens during benchmarking. By applying this to different dataset types—prefill-heavy (e.g., summarization tasks) and decode-heavy (e.g., text generation tasks)—we determine the more economical GPU choice for each scenario.

For the prefill-heavy dataset, we used 256 sampled sentences from the Dynamic-sonnet-4K dataset, with an average token length of 2048 and a fixed output token length of 256. In the decode-heavy dataset, we used the same 256 sampled sentences as in User Case 1, but increased the output token length to 4096.

Frameworks version, Model and Hardware

Frameworks: TensorRT-LLM (v0.16.0)

Model: Llama-3.1-8B-Instruct (BF16)

H/W

1 x NVIDIA A100 SXM (80G), AMD EPYC 7713 64-Core processors, 1007.67 GB RAM, 1.89 $/hour
1 x NVIDIA H100 SXM (80G), Intel Xeon(R) Platinum 8470, 1007.67 GB RAM , 2.89 $/hour

Figure 4. Method to setting device cost ($/hour) for nodes in Fits on Chips.

Figure 5. Fits on chips project page created while benchmarking Llama-3.1-8B-Instruct with TensorRT-LLM on H100 and A100.

Figure 6. Analysis on estimated cost ($/1M tokens) with Prefill-heavy and Decode-heavy dataset in A100 and H100 devices.

The results in Figure 6 showed the estimated cost per 1 million tokens for the two devices. In the Decode-heavy scenario, the costs were similar, but the H100 was slightly more cost-efficient than the A100. Although the H100 fee is about 1.5x higher than that of the A100, its performance advantage offsets this cost difference, leading to comparable estimated serving costs. However, the performance gap between the devices widened in the Prefill-heavy case, resulting in the A100 being about 1.59x more expensive than the H100.

This experiment demonstrated the importance of GPU selection and accurately estimating token generation costs based on workload characteristics. Fits on Chips simplifies this decision-making process by automating benchmarking and cost estimation, enabling users to make informed GPU choices that optimize both performance and budget.

Case 3: Benchmark & Evaluation of quantized models

Quantization is a widely used technique for reducing the computational cost of LLM inference, but it comes with potential trade-offs in model accuracy and output quality. To systematically evaluate these effects, we leveraged Fits on Chips, which also provides built-in evaluation capabilities using popular datasets (see Figure 7). By using this feature, we tested Mistral-7B-Instruct-v0.3 and its quantized variants to assess both accuracy and throughput improvements. Except for ignore_eos=True and use_v2_block_manager=True, all other parameters were set to the default configurations provided by Fits on Chips.

Figure 7. Method to setting evaluation dataset in Fits on Chips. Evaluation-related parameters such as num-shots and eval_batch_size are also available.

Figure 8. Fits on chips project page created while evaluating and benchmarking Mistral-7B-Instruct-v0.3 and its quantized variants.

Experiment Setup

Model: Mixtral-7B-Instruct-v0.3 and its quantized variants (GPTQ_W4A16, AWQ_W4A16, W8A8)

Hardware: 1 × NVIDIA A100 (80G), AMD EPYC 7713 64-Core Processor, 1007.67 GB RAM

Framework: vLLM (v0.7.2)

Evaluation Datasets: ARC-Challenge (10-shot), TruthfulQA (MC2, 0-shot), PubMedQA (0-shot)

Benchmark Dataset: Dynamic-Sonnet-1K with fixed 1K output token length

Figure 9. Benchmark and evaluation results on Mistral-7B-Instruct-v0.3 and its quantized variants in Fits on Chips.

Figure 9 describes the benchmark and evaluation results on Mistral-7B-Instruct-v0.3 and its quantized variants in Fits on Chips. The evaluation results showed that quantized models exhibit marginal variation in accuracy across all datasets. Note that all these quality evaluations were also done in Fits on Chips platform automatically.

Figure 10. Playground example for the weight-only quantized Mistral-7B-Instruct-v0.3 with AWQ.

Beyond automated evaluation, Fits on Chips also provides a playground feature, allowing users to manually input example text and inspect the generated responses. Figure 10 illustrates an output quality assessment for the model quantized with AWQ, confirming that it generates coherent and high-quality responses. The evaluation and playground results validated the usability of quantized model.

From a performance perspective, quantization yielded mixed results in benchmark tests. Models quantized with AWQ and W8A8 achieved a throughput improvement of approximately 1.3× and 1.28×, respectively, compared to the full-precision model. However, the model quantized with GPTQ did not outperform the baseline, highlighting that not all quantization methods lead to inference speedups under the same conditions. While a deeper analysis of these results is beyond the scope of this post, these findings underscore the need for case-by-case benchmark when selecting a quantization method.

Ultimately, deploying quantized or otherwise compressed LLMs requires balancing evaluation accuracy, output quality, and actual performance improvements in hardware. Fits on Chips significantly reduces the time and effort required for this process, providing an automated and structured approach to quantized model evaluation and benchmarking.

Conclusion

As LLM deployment continues to expand across various environments, optimizing inference performance has become a crucial challenge. Each decision including selecting the serving framework, hardware, models and parameter configuration directly impacts both cost-effectiveness and user experience. Given the complexity of these choices, data-driven benchmarking and evaluation are essential to ensure that models perform optimally under real-world constraints.

However, manually experimenting with different configurations, frameworks, and hardware can be time-consuming and resource-intensive. Fits on Chips simplifies this process, providing an automated and structured approach to LLM optimization. With built-in benchmarking, evaluation tools, and cost estimation features, it enables users to quickly test different setups, compare performance, and make informed decisions without the need for extensive manual tuning.

If you're looking to optimize LLM inference, reduce deployment costs, or efficiently evaluate models, Fits on Chips is the tool you need. Try it today and streamline your LLM performance tuning process! 🚀

Fits on Chips

Discover the best LLM deployment parameters for your Chips.

https://fitsonchips.ai/