Winning both speed and quality: How Yetter deals with diffusion models
Explore how the Yetter Inference Engine overcomes the limitations of step caching and model distillation for diffusion models. We analyze latency, diversity, quality, and negative-prompt handling to reveal what truly matters for scalable, real-time image generation.
Yetter, the GenAI API service: AI Optimization, Out of the Box
Meet 'Yetter': the generative AI API service built for speed, efficiency, and scalability. Powered by our optimization inference engine, it delivers reliable image, video, and future LLM services at a fraction of the cost.
Guided Decoding Performance on vLLM and SGLang
The guide to LLM guided decoding! This deep-dive benchmark compares XGrammar and LLGuidance on vLLM and SGLang to help you find the optimal setup for generating structured output based on your use case.
Disaggregated Inference on Apple Silicon: NPU prefill and GPU decode
In this article, we introduce how to run LLMs efficiently on Apple Silicon with disaggregated inference technique.