{"product_id":"small-language-models-in-production-talia-graham-9798268181524","title":"Small Language Models in Production: Optimizing inference, reducing costs, and delivering enterprise-ready AI with quantization and distillation metho","description":"\u003cp\u003e\u003cb\u003eShip enterprise ready AI that is fast, affordable, and controllable with small language models engineered through quantization and distillation.\u003c\/b\u003e\u003c\/p\u003e\u003cp\u003eMany teams want the benefits of language models, but costs, latency, and compliance block real progress. This book focuses on making production systems work on real infrastructure, with methods that lower memory use, improve tokens per second, and keep behavior auditable. You will see where small models beat larger ones, how to size fleets for peak demand, and how to align performance targets with budgets. The material is grounded in healthcare, finance, retail, and manufacturing examples, so the guidance maps cleanly to day to day decisions.\u003c\/p\u003e\u003cp\u003eYou will learn practical approaches that move beyond proofs of concept. The book explains how to compress and serve models without losing essential quality, how to benchmark instruction following and safety, and how to meet obligations under current governance standards. Each topic connects to production tasks, such as rollout planning, model monitoring, and incident response. The goal is clear, help you deploy reliable systems that meet service levels and cost controls.\u003c\/p\u003e\u003cul\u003e\n\u003cli\u003eapply weight only quantization with int8 or int4 using gptq and awq\u003c\/li\u003e\n\u003cli\u003euse activation quantization including smoothquant and fp8\u003c\/li\u003e\n\u003cli\u003ereduce long context costs with kv cache quantization and eviction\u003c\/li\u003e\n\u003cli\u003eserve at scale with vllm paged attention and continuous batching\u003c\/li\u003e\n\u003cli\u003etune tensorrt llm schedulers for throughput and tail latency\u003c\/li\u003e\n\u003cli\u003edeploy hugging face tgi on gaudi and inferentia2\u003c\/li\u003e\n\u003cli\u003euse speculative decoding and inflight batching in production\u003c\/li\u003e\n\u003cli\u003eplan hardware across h100 h200 b200 and evaluate gaudi 3\u003c\/li\u003e\n\u003cli\u003emodel tokens per second ttft and end to end throughput\u003c\/li\u003e\n\u003cli\u003erun edge and on device with llamacpp gguf mlc webgpu and apple mlx\u003c\/li\u003e\n\u003cli\u003econvert pipelines to gguf onnx directml openvino ir and nncf\u003c\/li\u003e\n\u003cli\u003eevaluate with mt bench and ifeval plus safety multilingual math and code\u003c\/li\u003e\n\u003cli\u003emap risks with owasp llm top 10 and set enterprise controls\u003c\/li\u003e\n\u003cli\u003eoperate under eu ai act timelines and the nist ai rmf profile\u003c\/li\u003e\n\u003cli\u003ebuild logging monitoring canaries autoscaling and rollback plans\u003c\/li\u003e\n\u003c\/ul\u003e\u003cp\u003e\u003ci\u003eCode heavy guide\u003c\/i\u003e: includes working examples, configs, and commands that you can adapt to real services, from serving stacks to evaluation pipelines.\u003c\/p\u003e\u003cp\u003e\u003cb\u003eGet the playbook for small language models in production, and start building systems that are fast, cost aware, and ready for enterprise use, grab your copy today.\u003c\/b\u003e\u003c\/p\u003e\u003cbr\u003e\u003cbr\u003e\u003cb\u003eAuthor:\u003c\/b\u003e Talia Graham\u003cbr\u003e\u003cb\u003eISBN-13:\u003c\/b\u003e 9798268181524\u003cbr\u003e\u003cb\u003ePublisher:\u003c\/b\u003e Independently Published\u003cbr\u003e\u003cb\u003eLanguage:\u003c\/b\u003e English\u003cbr\u003e\u003cb\u003ePublished:\u003c\/b\u003e 10\/02\/2025\u003cbr\u003e\u003cb\u003ePages:\u003c\/b\u003e 278\u003cbr\u003e\u003cb\u003eFormat:\u003c\/b\u003e Paperback\u003cbr\u003e\u003cb\u003eWeight:\u003c\/b\u003e 1.07lbs\u003cbr\u003e\u003cb\u003eSize:\u003c\/b\u003e 10.00h x 7.00w x 0.58d","brand":"Talia Graham","offers":[{"title":"Paperback","offer_id":47637493416191,"sku":"9798268181524","price":29.99,"currency_code":"USD","in_stock":true}],"thumbnail_url":"\/\/cdn.shopify.com\/s\/files\/1\/0662\/2982\/9887\/files\/img_28d47c55-2eb4-4dab-aa72-32087ee5f575.jpg?v=1765116548","url":"https:\/\/www.whiterainbookhouse.com\/products\/small-language-models-in-production-talia-graham-9798268181524","provider":"WR Book House","version":"1.0","type":"link"}