Small Language Models in Production: Optimizing inference, reducing costs, and delivering enterprise-ready AI with quantization and distillation metho (Paperback) (ISBN-13: 9798268181524)

Name: Small Language Models in Production: Optimizing inference, reducing costs, and delivering enterprise-ready AI with quantization and distillation metho
Brand: Talia Graham
SKU: 9798268181524
Price: 29.99 USD
Availability: InStock

Vendor: Talia Graham

Product type: Books

Format: Paperback Paperback

Product variants

$29.99; ~~$29.99~~; $29.99
Unit price: per

Quantity:

Subtotal: $29.99

Small Language Models in Production: Optimizing inference, reducing costs, and delivering enterprise-ready AI with quantization and distillation metho

$29.99

Choose options

Quantity:

Small Language Models in Production: Optimizing inference, reducing costs, and delivering enterprise-ready AI with quantization and distillation metho

$29.99

Format: Paperback Paperback

Product variants

Description

Ship enterprise ready AI that is fast, affordable, and controllable with small language models engineered through quantization and distillation.

Many teams want the benefits of language models, but costs, latency, and compliance block real progress. This book focuses on making production systems work on real infrastructure, with methods that lower memory use, improve tokens per second, and keep behavior auditable. You will see where small models beat larger ones, how to size fleets for peak demand, and how to align performance targets with budgets. The material is grounded in healthcare, finance, retail, and manufacturing examples, so the guidance maps cleanly to day to day decisions.

You will learn practical approaches that move beyond proofs of concept. The book explains how to compress and serve models without losing essential quality, how to benchmark instruction following and safety, and how to meet obligations under current governance standards. Each topic connects to production tasks, such as rollout planning, model monitoring, and incident response. The goal is clear, help you deploy reliable systems that meet service levels and cost controls.

apply weight only quantization with int8 or int4 using gptq and awq
use activation quantization including smoothquant and fp8
reduce long context costs with kv cache quantization and eviction
serve at scale with vllm paged attention and continuous batching
tune tensorrt llm schedulers for throughput and tail latency
deploy hugging face tgi on gaudi and inferentia2
use speculative decoding and inflight batching in production
plan hardware across h100 h200 b200 and evaluate gaudi 3
model tokens per second ttft and end to end throughput
run edge and on device with llamacpp gguf mlc webgpu and apple mlx
convert pipelines to gguf onnx directml openvino ir and nncf
evaluate with mt bench and ifeval plus safety multilingual math and code
map risks with owasp llm top 10 and set enterprise controls
operate under eu ai act timelines and the nist ai rmf profile
build logging monitoring canaries autoscaling and rollback plans

Code heavy guide: includes working examples, configs, and commands that you can adapt to real services, from serving stacks to evaluation pipelines.

Get the playbook for small language models in production, and start building systems that are fast, cost aware, and ready for enterprise use, grab your copy today.

Author: Talia Graham
ISBN-13: 9798268181524
Publisher: Independently Published
Language: English
Published: 10/02/2025
Pages: 278
Format: Paperback
Weight: 1.07lbs
Size: 10.00h x 7.00w x 0.58d

Trending Now

Popular Products

Project Hail Mary

Theo of Golden

Yesteryear

A Court of Thorns and Roses 6

Trending Now

Popular Products

Project Hail Mary

Theo of Golden

Yesteryear

A Court of Thorns and Roses 6

Small Language Models in Production: Optimizing inference, reducing costs, and delivering enterprise-ready AI with quantization and distillation metho (Paperback) (ISBN-13: 9798268181524)

Small Language Models in Production: Optimizing inference, reducing costs, and delivering enterprise-ready AI with quantization and distillation metho

Small Language Models in Production: Optimizing inference, reducing costs, and delivering enterprise-ready AI with quantization and distillation metho

Customer Reviews

Recently Viewed Products

Before you leave...

20% off

CODESALE20

Trending Now

Popular Products

Project Hail Mary

Theo of Golden

Yesteryear

A Court of Thorns and Roses 6

Trending Now

Popular Products

Project Hail Mary

Theo of Golden

Yesteryear

A Court of Thorns and Roses 6

Trending Now

Popular Products

Small Language Models in Production: Optimizing inference, reducing costs, and delivering enterprise-ready AI with quantization and distillation metho (Paperback) (ISBN-13: 9798268181524)

Small Language Models in Production: Optimizing inference, reducing costs, and delivering enterprise-ready AI with quantization and distillation metho

Small Language Models in Production: Optimizing inference, reducing costs, and delivering enterprise-ready AI with quantization and distillation metho

Description

Customer Reviews

Recently Viewed Products

Shop the look

Choose options

Edit option

Choose options

Before you leave...

20% off

CODESALE20

Trending Now

Popular Products