Fits on Chips: Saving LLM Costs Became Easier Than Ever
This article introduces Fits on Chips, an LLMOps toolkit for performance evaluation.
Feb 26, 2025
1. Introduction

Large Language Models (LLMs) have been at the forefront of recent technological advancements, driven by their sophisticated natural language processing and enhanced multi-modality. The development of reasoning models, such as o3-mini and DeepSeek-R1, has further underscored the practical applications of LLMs, spurring innovative solutions across various industries.
Historically, many breakthroughs in this field have been kept proprietary, with access to the most advanced models restricted to dedicated APIs. The substantial scale of these models and the significant financial investment required for their development have presented challenges to open-sourcing these technologies.
However, recent advancements like DeepSeek-R1 have demonstrated that open-source models can achieve high-level performance, showcasing comparable capabilities to the proprietary models. This progress is particularly appealing to organizations that need to fine-tune models to meet specialized customer requirements or adhere to strict security and compliance standards. For these companies, open-source LLMs provide a viable means to maintain data control while continuing to explore the full potential of the technology.
One practical route to harness these benefits is to serve LLMs in-house. By hosting models internally, businesses can iterate on and adapt them as necessary while retaining ownership of the data and processes involved. As open-source solutions continue to improve, more enterprises are exploring the option of running their own LLMs, reducing dependency on external services and unlocking greater flexibility in how they innovate.
As more teams look for self-hosted LLM solutions, the need for easy setup and optimization has grown. Running models locally involves a lot of trial and error with different settings and system configurations, which can quickly become overwhelming without the right tools. That’s where Fits on Chips comes in. Fits on Chips focuses on simplifying how teams design experiments, set up benchmarking environments, and iterate on performance metrics. In the following sections, we’ll briefly explore the key factors that affect LLM serving. We’ll also examine how Fits on Chips centralizes and automates parameter exploration, ultimately lowering the barriers to running high-quality LLMs in an organization’s own environment.
2. LLM Serving Frameworks & Key Performance Metrics

Choosing the right framework and configuration for serving your LLM is essential for achieving consistent performance. In practice, many teams find themselves comparing vLLM and TensorRT-LLM, which each bring distinct optimizations to the table. As discussed in our previous posts, vLLM's open-source framework facilitates easy extensibility, making it especially appealing for those seeking to integrate cutting-edge models and serving techniques. In contrast, TensorRT-LLM leverages NVIDIA’s TensorRT library to access lower-level hardware optimizations, often resulting in superior serving performance compared to vLLM. Which one is the best for you depends on factors like your hardware setup, model size, and how much control you need over scheduling and memory management.
Beyond framework choice, configuration parameters such as maximum batch size and maximum number of tokens can dramatically influence your results. Our experiments showed that while increasing batch size generally boosts throughput, it can also lead to degraded Time-To-First-Token (TTFT) or Time-Per-Output-Token (TPOT). Therefore, tracking key metrics like TTFT, TPOT, and throughput under different conditions is crucial. A setup optimized for one metric may introduce trade-offs in another, so it’s important to clarify whether your priority is minimal latency, maximum throughput, or a balanced approach.
(For detailed information, please refer to the post below)
As we’ve seen in both prefill-heavy and decode-heavy scenarios, the best serving configurations rarely come from relying on default settings or making single-parameter adjustments. Instead, a process of iterative experimentation—observing how each parameter change affects your metrics under realistic request rates and pairing those insights with knowledge of each framework’s unique strengths—often leads to more efficient setups.
It’s worth noting, however, that maximum batch size is merely one of many parameters available in both vLLM and TensorRT-LLM. Advanced scheduling strategies, memory optimizations, and other tuning options can significantly impact performance and may warrant their own thorough experimentation. Whether you choose vLLM or TensorRT-LLM, systematically exploring this broader range of parameters is central to unlocking truly high-performance LLM serving.
3. The Challenge of Parameter Tuning

Parameter tuning can often feel like trying to solve a moving puzzle. Even with a well-defined set of metrics, there is no single “best” configuration that applies universally. Each service scenario has its own nuances. Hardware constraints, dataset characteristics, and the specific version of a serving framework all play a part in shaping optimal settings. A parameter setup that works perfectly for one environment might underperform in another. This variability can make the search for an ideal configuration a never-ending process.
Another challenge lies in conducting structured experiments. It’s one thing to hypothesize about how an increased batch size or an adjusted request rate level might improve throughput, but it’s quite another to systematically test every combination of parameters in a real-world environment. Experimentation can become time-consuming and resource-intensive, particularly if the process isn’t well-coordinated among team members. Without a unified framework to keep everyone on the same page, these groups can end up working in silos, leading to inconsistencies in how parameters are tested and interpreted.
Despite these obstacles, fine-tuning parameters remains crucial for delivering a smooth user experience and staying cost-efficient. Adjustments in request rate or batch size can significantly reduce response times, but they might also shift resource utilization in unexpected ways. Similarly, over-optimizing for throughput could degrade latency for certain users who require faster initial responses. Balancing these trade-offs is a delicate act, and it only becomes more challenging when models evolve or workloads scale. Small tweaks, like changing the queue length or toggling a specific optimization flag, may unlock better performance but also require careful validation to ensure that these benefits hold over time. Ultimately, it’s the willingness to iterate, gather empirical data, and collaborate across roles that helps teams discover parameter configurations aligned with their performance goals. Recognizing the complexity of this process is the first step toward managing it successfully.
4. The Fits on Chips Solution

At its core, Fits on Chips simplifies how you configure hardware, models, and datasets, and then guides you through the process of tuning parameters, running experiments, and interpreting results. Previously, multiple ML engineers would create spreadsheets or other planning documents to outline their experiments, then each person would set up their own environment or share a pre-configured one with colleagues. They also had to ensure that necessary datasets and models were allocated sufficient storage or network resources, often juggling various constraints just to make the experiments run smoothly. Finally, after executing the tests, they would review the results and manually update a shared document, hoping everyone remained on the same page.
With Fits on Chips, all these steps—planning, environment configuration, resource allocation, and result tracking—are handled within a unified interface. Teams that might otherwise operate in silos can now collaborate in real time, which not only streamlines the entire workflow but also makes any gains in performance or cost efficiency readily visible to everyone involved. This unified approach ensures that the most up-to-date information is always accessible, reducing the risk of miscommunication and helping teams iterate more effectively.
1. Configure Testing Materials
Fits on Chips provides a shared workspace where you can register your nodes (hardware), models, and datasets. After installing the provided Docker image on your hardware, you can add the device as a “Node” to the platform. Registering models simply requires specifying the repository path from Hugging Face and pasting it into the Model registration form. Similarly, datasets can be added by providing the relevant repository path. Once registered, these resources become available across your team, which reduces the need for ad hoc sharing or repetitive setups and helps everyone work under consistent conditions.
2. Setup Experiment Parameters
In Fits on Chips, experiments are organized into “projects” which correspond to specific hardware/software combinations—like GPU + TensorRT-LLM or GAUDI + vLLM. Within each project, you can define multiple experiments and select from the nodes, models, and datasets you have registered earlier. This stage is where you decide which parameters, such as request rate or batch size, you want to test. The interface includes options for choosing which variables to keep fixed and which to vary. Although the Fits on Chips suggests starting with default configurations, it also encourages you to explore custom settings that may better match your performance goals. This structured approach aims to reduce the trial-and-error nature of parameter tuning, helping you discover which levers actually move your metrics in the desired direction.
3. Launch Experiments
Once you’ve settled on a set of parameter combinations, launching the experiment is straightforward. Selecting the desired configurations and clicking the “Benchmark” button begins the process of running each test sequence. A status indicator will show you the progress of active benchmarks, offering real-time feedback on whether they are running smoothly or encountering any issues. This minimizes the guesswork around whether your experiments are being executed as intended. Fits on Chips consolidates this information into a single view, eliminating the need to hunt through logs or manually track each run.
In addition, the “Playground” and “Evaluation” features allow you not only to test how efficiently your model can be served, but also to assess its actual functionality. The playground feature allows you to interact with the LLM running in a Docker image configured with the experiment’s parameter settings. The evaluation feature lets you measure and compare the LLM’s functionality using widely adopted LLM evaluation metrics. These features are seamlessly integrated into Fits on Chips.
4. Get and Share Insights
After the experiments are complete, the platform provides visualization tools for comparing results across multiple runs. You can use line plots to see how metrics like TTFT, TPOT, or throughput change in response to different parameter settings. Parallel coordinates are also available for a multi-dimensional view of how various parameters interact. This makes it easier to identify correlations and trade-offs—such as improved throughput at the cost of higher latency—and to discuss them openly with the team. Since all data is saved within the platform, sharing insights becomes as simple as granting access or exporting the visualizations. Product managers can quickly see how a particular setting impacts user-facing latency, while engineers can dive into the logs to troubleshoot performance bottlenecks. In this way, Fits on Chips not only centralizes data but also encourages collaborative decision-making about next steps.
By going through these four stages—configuring testing materials, setting parameters, running experiments, and analyzing results—teams can shift from fragmented, manual processes to a more integrated and repeatable workflow. While it may not eliminate every complexity of LLM deployment, Fits on Chips offers a significant step forward for groups looking to systematically optimize their models and share insights with minimal friction.
5. Conclusion & Future Outlook
The journey towards optimized and efficient LLM serving can be complex, but it also holds immense potential for organizations trying to deploy and manage their own language models in-house.
Throughout these chapters, we have explored the challenges of parameter tuning, the significance of choosing the right serving framework, and how a cohesive platform like Fits on Chips can simplify the fragmented and tedious process. While no single solution can resolve all the intricacies of LLM deployment, structured experimentation and improved collaboration can go a long way toward achieving reliable, high-performance outcomes.
Looking ahead, the field is likely to evolve alongside new hardware innovations (such as cutting-edge GPU or TPU architectures) and emerging LLM frameworks that promise even faster and more scalable serving capabilities. We can also expect further emphasis on collaborative features—tools that enable multiple roles within a team to manage experiments and interpret results together. As data volumes grow and use cases diversify, strong data analytics and visualization capabilities will become increasingly vital for extracting actionable insights.
For those who are just beginning to explore LLM serving or find themselves bogged down in trial-and-error testing, now is the right time to experiment with Fits on Chips. By systematically refining parameters and sharing results more seamlessly, teams can not only save time but also discover configurations that genuinely align with their performance and cost objectives. Continuous learning and adaptation will remain key, so we encourage you to dive in, iterate, and keep pushing the boundaries of what your LLMs can do.
Try Fits on Chips now!
You can start exploring these features for free! Click the link below to get started.
Share article
Join the SqueezeBits newsletter today!