Jiwoong Choi - SqueezeBits

Jiwoong Choi

Disaggregated Inference on Apple Silicon: NPU prefill and GPU decode

Disaggregated Inference on Apple Silicon: NPU prefill and GPU decode

In this article, we introduce how to run LLMs efficiently on Apple Silicon with disaggregated inference technique.

The Missing Piece of TensorRT-LLM

The Missing Piece of TensorRT-LLM

This article is about an open-source library for direct conversion of PyTorch models to TensorRT-LLM.

TechFits on Chips

SqueezeBits

RSS·Powered by Inblog