{"product_id":"speech-ai-and-multimodal-models-ansel-corbyn-9798273025103","title":"Speech AI and Multimodal Models with Nvidia Nemo: Build automatic speech recognition, text-to speech, and vision-language systems with production-grad","description":"\u003cp\u003e\u003cb\u003eBuild dependable speech and multimodal systems from data to deployment with NeMo, Riva, Triton, and NIM.\u003c\/b\u003e\u003c\/p\u003e\u003cp\u003eShipping ASR, TTS, and vision language features is hard because real traffic, latency budgets, and safety rules punish vague guidance. Teams need a concrete stack, tested workflows, and playbooks that hold up under load.\u003c\/p\u003e\u003cp\u003eThis book gives practitioners a practical path. Train with NeMo, serve with Triton and Riva, package stable APIs with NIM, and wire observability, safety, and rollout controls so your services stay reliable after launch.\u003c\/p\u003e\u003cul\u003e\n\u003cli\u003eMap the NVIDIA stack in production, NeMo for training, Riva for runtime, NIM for standard APIs, Triton for serving and metrics\u003c\/li\u003e\n\u003cli\u003eSet up containers, GPU drivers, CUDA, and validation checks for a clean starting environment\u003c\/li\u003e\n\u003cli\u003eBuild NeMo manifests, create tarred WebDataset shards, and manage data versions for repeatable training\u003c\/li\u003e\n\u003cli\u003eApply text processing that works in products, PnC models for punctuation and case, grammar based ITN with Sparrowhawk\u003c\/li\u003e\n\u003cli\u003eChoose and justify architectures, CTC and RNNT tradeoffs, FastConformer for short and long speech, Parakeet for multilingual, Canary for translation and timestamps\u003c\/li\u003e\n\u003cli\u003eDesign streaming with intent, lookahead, chunk size, and padding choices that balance latency and accuracy\u003c\/li\u003e\n\u003cli\u003eRun NeMo 2 configs and NeMo Run cleanly, migrate experiments, track ablations, and keep results comparable\u003c\/li\u003e\n\u003cli\u003eEvaluate with WER, CER, MER, and slice by accent, SNR, and channel so quality numbers reflect reality\u003c\/li\u003e\n\u003cli\u003eAdd diarization that operators can trust, VAD with MarbleNet, embeddings with TitaNet, and MSDD integration\u003c\/li\u003e\n\u003cli\u003eExport for serving the right way, ONNX or TorchScript paths, TensorRT where appropriate, and Triton model repos that scale\u003c\/li\u003e\n\u003cli\u003eTune Riva streaming ASR, chunk and padding settings, punctuation and ITN options, diarization flags and limits\u003c\/li\u003e\n\u003cli\u003eStand up NIM ASR endpoints with an OpenAI compatible surface and autoscale them with Helm on Kubernetes\u003c\/li\u003e\n\u003cli\u003eBuild TTS that sounds right and runs fast, FastPitch with HiFi GAN or BigVGAN, voice cloning data, lexicons, SSML controls\u003c\/li\u003e\n\u003cli\u003eManage prosody and latency for streaming audio, set clause sizes and playback buffers that feel responsive\u003c\/li\u003e\n\u003cli\u003eProtect your product, content safeguards in TTS, consent gates for data and cloning, redaction and retention policies\u003c\/li\u003e\n\u003cli\u003eMeasure what matters, Triton metrics in Prometheus and Grafana, practical alert rules that catch real issues\u003c\/li\u003e\n\u003cli\u003eLoad test with perf analyzer sweeps, batch and concurrency tuning, sequence batching for conversational traffic\u003c\/li\u003e\n\u003cli\u003eEngineer reliability, fault injection and backpressure, graceful degradation under spikes and partial failures\u003c\/li\u003e\n\u003cli\u003eWire NeMo Guardrails around ASR, TTS, and VLM flows so outputs stay on policy\u003c\/li\u003e\n\u003cli\u003eWatermark and detect audio with AudioSeal and formalize a detection pipeline\u003c\/li\u003e\n\u003cli\u003eUnderstand licenses and terms, NVIDIA AI Enterprise scope, Riva EULA, and NGC usage expectations\u003c\/li\u003e\n\u003cli\u003eUse production playbooks with SLOs, cost caps, and rollback guards that turn operations into repeatable steps\u003c\/li\u003e\n\u003c\/ul\u003e\u003cp\u003eThis is a code heavy guide with working Python, YAML, JSON, and Shell examples that you can adapt directly into real services.\u003c\/p\u003e\u003cp\u003e\u003cb\u003eGet the guide and build systems your users can rely on.\u003c\/b\u003e\u003c\/p\u003e\u003cbr\u003e\u003cbr\u003e\u003cb\u003eAuthor:\u003c\/b\u003e Ansel Corbyn\u003cbr\u003e\u003cb\u003eISBN-13:\u003c\/b\u003e 9798273025103\u003cbr\u003e\u003cb\u003ePublisher:\u003c\/b\u003e Independently Published\u003cbr\u003e\u003cb\u003eLanguage:\u003c\/b\u003e English\u003cbr\u003e\u003cb\u003ePublished:\u003c\/b\u003e 11\/04\/2025\u003cbr\u003e\u003cb\u003ePages:\u003c\/b\u003e 308\u003cbr\u003e\u003cb\u003eFormat:\u003c\/b\u003e Paperback\u003cbr\u003e\u003cb\u003eWeight:\u003c\/b\u003e 1.18lbs\u003cbr\u003e\u003cb\u003eSize:\u003c\/b\u003e 10.00h x 7.00w x 0.65d","brand":"Ansel Corbyn","offers":[{"title":"Paperback","offer_id":48014327808255,"sku":"9798273025103","price":39.99,"currency_code":"USD","in_stock":true}],"thumbnail_url":"\/\/cdn.shopify.com\/s\/files\/1\/0662\/2982\/9887\/files\/img_d567915c-7b9c-465b-819c-2dd204208168.jpg?v=1767747920","url":"https:\/\/www.whiterainbookhouse.com\/products\/speech-ai-and-multimodal-models-ansel-corbyn-9798273025103","provider":"WR Book House","version":"1.0","type":"link"}