logo
|
Blog
  • Yetter
  • OwLite
  • Fits on Chips
  • SqueezeBits
  • 🌐
Event

vLLM Hands-on Workshop with Rebellions & SqueezeBits: A Recap

Rebellions and SqueezeBits Co-Host a vLLM Hands-on Workshop: Workshop Highlights, PyTorch Best Practices, Performance Optimization, and Developer First-Hand Tips
Goeun Kang's avatar
Goeun Kang
Dec 10, 2025
vLLM Hands-on Workshop with Rebellions & SqueezeBits: A Recap
Contents
What Made This Workshop So Special?Why vLLM Matters?Expanding vLLM to NPU EnvironmentsPracticing on Production-Grade InfrastructureVerifying Performance Optimization First-HandGrowing with the vLLM CommunityWhy vLLM Matters?Expanding vLLM to NPU EnvironmentsPracticing on Production-Grade InfrastructureVerifying Performance Optimization First-HandGrowing with the vLLM CommunityWhat Made This Workshop So Special?Why vLLM Matters?Expanding vLLM to NPU EnvironmentsPracticing on Production-Grade InfrastructureVerifying Performance Optimization First-HandGrowing with the vLLM CommunityWhat Made This Workshop So Special?Why vLLM Matters?Expanding vLLM to NPU EnvironmentsPracticing on Production-Grade InfrastructureVerifying Performance Optimization First-HandGrowing with the vLLM Community

In 2025, SqueezeBits and Rebellions co-hosted a hands-on workshop that stood out as a highlight for both our team and the attendees. In October and November, strong interest and participation helped us successfully wrap up both sessions of the vLLM Hands-on Workshop with Rebellions & SqueezeBits.

Attendees gained hands-on experience with large language models (LLMs) using vLLM on Rebellions' NPU hardware, which is usually hard to come by.

VLLM Hands-On Workshop with Rebellions & SqueezeBits
vLLM Hands-On Workshop with Rebellions & SqueezeBits

What Made This Workshop So Special?

Building on the momentum of the first vLLM Korea Meetup in August 2025, this workshop was a direct response to the community's growing heat. As AI adoption accelerates across Korea, vLLM is rapidly becoming the go-to inference engine, sparking an intense demand for practical, hands-on training.

So, SqueezeBits and Rebellions designed a practical-led training built on authentic NPU development environments. The objective was straightforward: to provide a high-access experience for engineers to deploy LLMs with vLLM on specialized hardware usually reserved for deep-tech labs.

vLLM: The De Facto Open GenAI Inference Platform
vLLM: The De Facto Open GenAI Inference Platform

Why vLLM Matters?

Scaling LLM services in production requires more than just speed. It demands highly efficient inference. As user traffic grows, the cost and computational complexity of managing those requests can quickly become a bottleneck.

vLLM solves this problem. It is a high-throughput serving engine that maximizes GPU utilization, reduces latency, handles more concurrent requests, and lowers operating costs.

In short, vLLM makes it practical to serve larger, more capable AI models at a reasonable cost.

Rebellions' Chip Roadmap
Rebellions' chip roadmap

Expanding vLLM to NPU Environments

This workshop stood out because attendees ran vLLM on an NPU (Neural Processing Unit) — hardware most developers have never worked with directly.

vLLM's architecture prioritizes extensibility. The vLLM-RBLN Plugin for Rebellions' ATOM™ chip unlocked vLLM's full serving capabilities without complex system-level development.

The curriculum was so intuitive that anyone with a basic grasp of Python and PyTorch could jump right in. Even after a long day at work, the energy in the room stayed high as attendees remained fully locked in until the very end.

Rebellions' ATOM™-MAX
Rebellions' ATOM™-MAX

Practicing on Production-Grade Infrastructure

All exercises ran on Rebellions' ATOM™-MAX NPU servers. Participants worked in conditions that matched a real production environment.

Kubernetes provided a stable infrastructure layer, so attendees could start working immediately without manual setup. PyTorch-based workflows — tensor operations, model inference — worked the same way they do on GPUs, keeping the learning curve low.

Attendees shared feedback like: "I expected new hardware to be complicated, but it turned out to be much easier and more practical than I thought."

Introduction to torch-rbln
Introduction to torch-rbln

Verifying Performance Optimization First-Hand

The workshop went beyond running models. It covered performance optimization techniques critical for enterprise deployments.

Attendees started with basic Hugging Face Transformers inference, then used the RBLN profiler to identify bottlenecks. They progressed through Optimum and vLLM inference with Flash Attention, KV caching, and continuous batching applied. Each step showed measurable improvements in memory usage and response speed.

Introduction to RBLN Profiler
Introduction to RBLN Profiler

The session also demonstrated Mixture of Experts (MoE) architectures running on NPUs. Seeing a large-scale MoE model run on NPU hardware, attendees came away convinced that enterprise-grade AI services can scale effectively on NPU-based infrastructure.

The seamless experience was made possible by the vLLM-RBLN Plugin, which preserved the existing GPU-based code flow with minimal changes.

Growing with the vLLM Community

This workshop was not a demo or a lecture. It was a hands-on session running on a Kubernetes-based production infrastructure, giving attendees real operational experience they can apply directly.

SqueezeBits and Rebellions are planning more joint vLLM sessions for 2026 — more workshops, more meetups, and expanded technical content. Follow SqueezeBits on LinkedIn to catch up on upcoming event announcements!

Why vLLM Matters?

Scaling LLM services in production requires more than just speed. It demands highly efficient inference. As user traffic grows, the cost and computational complexity of managing those requests can quickly become a bottleneck.

vLLM solves this problem. It is a high-throughput serving engine that maximizes GPU utilization, reduces latency, handles more concurrent requests, and lowers operating costs.

In short, vLLM makes it practical to serve larger, more capable AI models at a reasonable cost.

Rebellions' Chip Roadmap
Rebellions' chip roadmap

Expanding vLLM to NPU Environments

This workshop stood out because attendees ran vLLM on an NPU (Neural Processing Unit) — hardware most developers have never worked with directly.

vLLM's architecture prioritizes extensibility. The vLLM-RBLN Plugin for Rebellions' ATOM™ chip unlocked vLLM's full serving capabilities without complex system-level development.

The curriculum was so intuitive that anyone with a basic grasp of Python and PyTorch could jump right in. Even after a long day at work, the energy in the room stayed high as attendees remained fully locked in until the very end.

Rebellions' ATOM™-MAX
Rebellions' ATOM™-MAX

Practicing on Production-Grade Infrastructure

All exercises ran on Rebellions' ATOM™-MAX NPU servers. Participants worked in conditions that matched a real production environment.

Kubernetes provided a stable infrastructure layer, so attendees could start working immediately without manual setup. PyTorch-based workflows — tensor operations, model inference — worked the same way they do on GPUs, keeping the learning curve low.

Attendees shared feedback like: "I expected new hardware to be complicated, but it turned out to be much easier and more practical than I thought."

Introduction to torch-rbln
Introduction to torch-rbln

Verifying Performance Optimization First-Hand

The workshop went beyond running models. It covered performance optimization techniques critical for enterprise deployments.

Attendees started with basic Hugging Face Transformers inference, then used the RBLN profiler to identify bottlenecks. They progressed through Optimum and vLLM inference with Flash Attention, KV caching, and continuous batching applied. Each step showed measurable improvements in memory usage and response speed.

Introduction to RBLN Profiler
Introduction to RBLN Profiler

The session also demonstrated Mixture of Experts (MoE) architectures running on NPUs. Seeing a large-scale MoE model run on NPU hardware, attendees came away convinced that enterprise-grade AI services can scale effectively on NPU-based infrastructure.

The seamless experience was made possible by the vLLM-RBLN Plugin, which preserved the existing GPU-based code flow with minimal changes.

Growing with the vLLM Community

This workshop was not a demo or a lecture. It was a hands-on session running on a Kubernetes-based production infrastructure, giving attendees real operational experience they can apply directly.

SqueezeBits and Rebellions are planning more joint vLLM sessions for 2026 — more workshops, more meetups, and expanded technical content. Follow SqueezeBits on LinkedIn to catch up on upcoming event announcements!

In 2025, SqueezeBits and Rebellions co-hosted a hands-on workshop that stood out as a highlight for both our team and the attendees. In October and November, strong interest and participation helped us successfully wrap up both sessions of the vLLM Hands-on Workshop with Rebellions & SqueezeBits.

Attendees gained hands-on experience with large language models (LLMs) using vLLM on Rebellions' NPU hardware, which is usually hard to come by.

VLLM Hands-On Workshop with Rebellions & SqueezeBits
vLLM Hands-On Workshop with Rebellions & SqueezeBits

What Made This Workshop So Special?

Building on the momentum of the first vLLM Korea Meetup in August 2025, this workshop was a direct response to the community's growing heat. As AI adoption accelerates across Korea, vLLM is rapidly becoming the go-to inference engine, sparking an intense demand for practical, hands-on training.

So, SqueezeBits and Rebellions designed a practical-led training built on authentic NPU development environments. The objective was straightforward: to provide a high-access experience for engineers to deploy LLMs with vLLM on specialized hardware usually reserved for deep-tech labs.

vLLM: The De Facto Open GenAI Inference Platform
vLLM: The De Facto Open GenAI Inference Platform

Why vLLM Matters?

Scaling LLM services in production requires more than just speed. It demands highly efficient inference. As user traffic grows, the cost and computational complexity of managing those requests can quickly become a bottleneck.

vLLM solves this problem. It is a high-throughput serving engine that maximizes GPU utilization, reduces latency, handles more concurrent requests, and lowers operating costs.

In short, vLLM makes it practical to serve larger, more capable AI models at a reasonable cost.

Rebellions' Chip Roadmap
Rebellions' chip roadmap

Expanding vLLM to NPU Environments

This workshop stood out because attendees ran vLLM on an NPU (Neural Processing Unit) — hardware most developers have never worked with directly.

vLLM's architecture prioritizes extensibility. The vLLM-RBLN Plugin for Rebellions' ATOM™ chip unlocked vLLM's full serving capabilities without complex system-level development.

The curriculum was so intuitive that anyone with a basic grasp of Python and PyTorch could jump right in. Even after a long day at work, the energy in the room stayed high as attendees remained fully locked in until the very end.

Rebellions' ATOM™-MAX
Rebellions' ATOM™-MAX

Practicing on Production-Grade Infrastructure

All exercises ran on Rebellions' ATOM™-MAX NPU servers. Participants worked in conditions that matched a real production environment.

Kubernetes provided a stable infrastructure layer, so attendees could start working immediately without manual setup. PyTorch-based workflows — tensor operations, model inference — worked the same way they do on GPUs, keeping the learning curve low.

Attendees shared feedback like: "I expected new hardware to be complicated, but it turned out to be much easier and more practical than I thought."

Introduction to torch-rbln
Introduction to torch-rbln

Verifying Performance Optimization First-Hand

The workshop went beyond running models. It covered performance optimization techniques critical for enterprise deployments.

Attendees started with basic Hugging Face Transformers inference, then used the RBLN profiler to identify bottlenecks. They progressed through Optimum and vLLM inference with Flash Attention, KV caching, and continuous batching applied. Each step showed measurable improvements in memory usage and response speed.

Introduction to RBLN Profiler
Introduction to RBLN Profiler

The session also demonstrated Mixture of Experts (MoE) architectures running on NPUs. Seeing a large-scale MoE model run on NPU hardware, attendees came away convinced that enterprise-grade AI services can scale effectively on NPU-based infrastructure.

The seamless experience was made possible by the vLLM-RBLN Plugin, which preserved the existing GPU-based code flow with minimal changes.

Growing with the vLLM Community

This workshop was not a demo or a lecture. It was a hands-on session running on a Kubernetes-based production infrastructure, giving attendees real operational experience they can apply directly.

SqueezeBits and Rebellions are planning more joint vLLM sessions for 2026 — more workshops, more meetups, and expanded technical content. Follow SqueezeBits on LinkedIn to catch up on upcoming event announcements!

In 2025, SqueezeBits and Rebellions co-hosted a hands-on workshop that stood out as a highlight for both our team and the attendees. In October and November, strong interest and participation helped us successfully wrap up both sessions of the vLLM Hands-on Workshop with Rebellions & SqueezeBits.

Attendees gained hands-on experience with large language models (LLMs) using vLLM on Rebellions' NPU hardware, which is usually hard to come by.

VLLM Hands-On Workshop with Rebellions & SqueezeBits
vLLM Hands-On Workshop with Rebellions & SqueezeBits

What Made This Workshop So Special?

Building on the momentum of the first vLLM Korea Meetup in August 2025, this workshop was a direct response to the community's growing heat. As AI adoption accelerates across Korea, vLLM is rapidly becoming the go-to inference engine, sparking an intense demand for practical, hands-on training.

So, SqueezeBits and Rebellions designed a practical-led training built on authentic NPU development environments. The objective was straightforward: to provide a high-access experience for engineers to deploy LLMs with vLLM on specialized hardware usually reserved for deep-tech labs.

vLLM: The De Facto Open GenAI Inference Platform
vLLM: The De Facto Open GenAI Inference Platform

Why vLLM Matters?

Scaling LLM services in production requires more than just speed. It demands highly efficient inference. As user traffic grows, the cost and computational complexity of managing those requests can quickly become a bottleneck.

vLLM solves this problem. It is a high-throughput serving engine that maximizes GPU utilization, reduces latency, handles more concurrent requests, and lowers operating costs.

In short, vLLM makes it practical to serve larger, more capable AI models at a reasonable cost.

Rebellions' Chip Roadmap
Rebellions' chip roadmap

Expanding vLLM to NPU Environments

This workshop stood out because attendees ran vLLM on an NPU (Neural Processing Unit) — hardware most developers have never worked with directly.

vLLM's architecture prioritizes extensibility. The vLLM-RBLN Plugin for Rebellions' ATOM™ chip unlocked vLLM's full serving capabilities without complex system-level development.

The curriculum was so intuitive that anyone with a basic grasp of Python and PyTorch could jump right in. Even after a long day at work, the energy in the room stayed high as attendees remained fully locked in until the very end.

Rebellions' ATOM™-MAX
Rebellions' ATOM™-MAX

Practicing on Production-Grade Infrastructure

All exercises ran on Rebellions' ATOM™-MAX NPU servers. Participants worked in conditions that matched a real production environment.

Kubernetes provided a stable infrastructure layer, so attendees could start working immediately without manual setup. PyTorch-based workflows — tensor operations, model inference — worked the same way they do on GPUs, keeping the learning curve low.

Attendees shared feedback like: "I expected new hardware to be complicated, but it turned out to be much easier and more practical than I thought."

Introduction to torch-rbln
Introduction to torch-rbln

Verifying Performance Optimization First-Hand

The workshop went beyond running models. It covered performance optimization techniques critical for enterprise deployments.

Attendees started with basic Hugging Face Transformers inference, then used the RBLN profiler to identify bottlenecks. They progressed through Optimum and vLLM inference with Flash Attention, KV caching, and continuous batching applied. Each step showed measurable improvements in memory usage and response speed.

Introduction to RBLN Profiler
Introduction to RBLN Profiler

The session also demonstrated Mixture of Experts (MoE) architectures running on NPUs. Seeing a large-scale MoE model run on NPU hardware, attendees came away convinced that enterprise-grade AI services can scale effectively on NPU-based infrastructure.

The seamless experience was made possible by the vLLM-RBLN Plugin, which preserved the existing GPU-based code flow with minimal changes.

Growing with the vLLM Community

This workshop was not a demo or a lecture. It was a hands-on session running on a Kubernetes-based production infrastructure, giving attendees real operational experience they can apply directly.

SqueezeBits and Rebellions are planning more joint vLLM sessions for 2026 — more workshops, more meetups, and expanded technical content. Follow SqueezeBits on LinkedIn to catch up on upcoming event announcements!

Share article

The official SqueezeBits Tech blog

RSS·Powered by Inblog