vLLM Hands-on Workshop with Rebellions & SqueezeBits: A Recap

Rebellions and SqueezeBits Co-Host a vLLM Hands-on Workshop: Workshop Highlights, PyTorch Best Practices, Performance Optimization, and Developer First-Hand Tips!

Goeun Kang

Dec 10, 2025

vLLM Hands-on Workshop with Rebellions & SqueezeBits: A Recap

Contents

What Made This Workshop So Special?Why vLLM Matters?Expanding vLLM to NPU Environments Practicing on Production-Grade Infrastructure Verifying Performance Optimization First-Hand Growing with the vLLM Community

In 2025, SqueezeBits and Rebellions co-hosted a hands-on workshop that stood out as a highlight for both our team and the attendees. In October and November, strong interest and participation helped us successfully wrap up both sessions of the vLLM Hands-on Workshop with Rebellions & SqueezeBits.

Attendees gained hands-on experience with large language models (LLMs) using vLLM on Rebellions' NPU hardware, which is usually hard to come by.

VLLM Hands-On Workshop with Rebellions & SqueezeBits — vLLM Hands-On Workshop with Rebellions & SqueezeBits

What Made This Workshop So Special?

Building on the momentum of the first vLLM Korea Meetup in August 2025, this workshop was a direct response to the community's growing heat. As AI adoption accelerates across Korea, vLLM is rapidly becoming the go-to inference engine, sparking an intense demand for practical, hands-on training.

So, SqueezeBits and Rebellions designed a practical-led training built on authentic NPU development environments. The objective was straightforward: to provide a high-access experience for engineers to deploy LLMs with vLLM on specialized hardware usually reserved for deep-tech labs.

vLLM: The De Facto Open GenAI Inference Platform

Why vLLM Matters?

Scaling LLM services in production requires more than just speed. It demands highly efficient inference. As user traffic grows, the cost and computational complexity of managing those requests can quickly become a bottleneck.

vLLM solves this problem. It is a high-throughput serving engine that maximizes GPU utilization, reduces latency, handles more concurrent requests, and lowers operating costs.

In short, vLLM makes it practical to serve larger, more capable AI models at a reasonable cost.

Rebellions' Chip Roadmap — Rebellions' chip roadmap

Expanding vLLM to NPU Environments

This workshop stood out because attendees ran vLLM on an NPU (Neural Processing Unit) — hardware most developers have never worked with directly.

vLLM's architecture prioritizes extensibility. The vLLM-RBLN Plugin for Rebellions' ATOM™ chip unlocked vLLM's full serving capabilities without complex system-level development.

The curriculum was so intuitive that anyone with a basic grasp of Python and PyTorch could jump right in. Even after a long day at work, the energy in the room stayed high as attendees remained fully locked in until the very end.

Practicing on Production-Grade Infrastructure

All exercises ran on Rebellions' ATOM™-MAX NPU servers. Participants worked in conditions that matched a real production environment.

Kubernetes provided a stable infrastructure layer, so attendees could start working immediately without manual setup. PyTorch-based workflows — tensor operations, model inference — worked the same way they do on GPUs, keeping the learning curve low.

Attendees shared feedback like: "I expected new hardware to be complicated, but it turned out to be much easier and more practical than I thought."

Verifying Performance Optimization First-Hand

The workshop went beyond running models. It covered performance optimization techniques critical for enterprise deployments.

Attendees started with basic Hugging Face Transformers inference, then used the RBLN profiler to identify bottlenecks. They progressed through Optimum and vLLM inference with Flash Attention, KV caching, and continuous batching applied. Each step showed measurable improvements in memory usage and response speed.

The session also demonstrated Mixture of Experts (MoE) architectures running on NPUs. Seeing a large-scale MoE model run on NPU hardware, attendees came away convinced that enterprise-grade AI services can scale effectively on NPU-based infrastructure.

The seamless experience was made possible by the vLLM-RBLN Plugin, which preserved the existing GPU-based code flow with minimal changes.

Growing with the vLLM Community

This workshop was not a demo or a lecture. It was a hands-on session running on a Kubernetes-based production infrastructure, giving attendees real operational experience they can apply directly.

SqueezeBits and Rebellions are planning more joint vLLM sessions for 2026 with more workshops, more meetups, and expanded technical content. Follow SqueezeBits on LinkedIn to catch up on upcoming event announcements!

Contents

Event

vLLM Hands-on Workshop with Rebellions & SqueezeBits: A Recap

Rebellions and SqueezeBits Co-Host a vLLM Hands-on Workshop: Workshop Highlights, PyTorch Best Practices, Performance Optimization, and Developer First-Hand Tips!

Goeun Kang

Dec 10, 2025

Contents

Attendees gained hands-on experience with large language models (LLMs) using vLLM on Rebellions' NPU hardware, which is usually hard to come by.

What Made This Workshop So Special?

Why vLLM Matters?

vLLM solves this problem. It is a high-throughput serving engine that maximizes GPU utilization, reduces latency, handles more concurrent requests, and lowers operating costs.

In short, vLLM makes it practical to serve larger, more capable AI models at a reasonable cost.

Expanding vLLM to NPU Environments

This workshop stood out because attendees ran vLLM on an NPU (Neural Processing Unit) — hardware most developers have never worked with directly.

vLLM's architecture prioritizes extensibility. The vLLM-RBLN Plugin for Rebellions' ATOM™ chip unlocked vLLM's full serving capabilities without complex system-level development.

Practicing on Production-Grade Infrastructure

All exercises ran on Rebellions' ATOM™-MAX NPU servers. Participants worked in conditions that matched a real production environment.

Attendees shared feedback like: "I expected new hardware to be complicated, but it turned out to be much easier and more practical than I thought."

Verifying Performance Optimization First-Hand

The workshop went beyond running models. It covered performance optimization techniques critical for enterprise deployments.

The seamless experience was made possible by the vLLM-RBLN Plugin, which preserved the existing GPU-based code flow with minimal changes.

Growing with the vLLM Community

This workshop was not a demo or a lecture. It was a hands-on session running on a Kubernetes-based production infrastructure, giving attendees real operational experience they can apply directly.

Contents