Adolf Hohl
Efficient deployment and inference of GPU-accelerated LLMs
#1about 2 minutes
The evolution of generative AI from experimentation to production
Generative AI has rapidly moved from experimentation with models like Llama and Mistral to production-ready applications in 2024.
#2about 3 minutes
Comparing managed AI services with the DIY approach
Managed services offer ease of use but limited control, while a do-it-yourself approach provides full control but introduces significant complexity.
#3about 4 minutes
Introducing NVIDIA NIM for simplified LLM deployment
NVIDIA Inference Microservices (NIM) provide a containerized, OpenAI-compatible solution for deploying models anywhere with enterprise support.
#4about 2 minutes
Boosting inference throughput with lower precision quantization
Using lower precision formats like FP8 dramatically increases model inference throughput, providing more performance for the same hardware investment.
#5about 2 minutes
Overview of the NVIDIA AI Enterprise software platform
The NVIDIA AI Enterprise platform is a cloud-native software stack that abstracts away low-level complexities like CUDA to streamline AI pipeline development.
#6about 2 minutes
A look inside the NIM container architecture
NIM containers bundle optimized inference tools like TensorRT-LLM and Triton Inference Server to accelerate models on specific GPU hardware.
#7about 3 minutes
How to run and interact with a NIM container
A NIM container can be launched with a simple Docker command, automatically discovering hardware and exposing OpenAI-compatible API endpoints for interaction.
#8about 2 minutes
Efficiently serving custom models with LoRA adapters
NIM enables serving multiple customized LoRA adapters on a single base model simultaneously, saving memory while providing distinct model endpoints.
#9about 3 minutes
How NIM automatically handles hardware and model optimization
NIM simplifies deployment by automatically selecting the best pre-compiled model based on the detected GPU architecture and user preference for latency or throughput.
Related jobs
Jobs that call for the skills explored in this talk.
Matching moments
23:24 MIN
Deploying models with TensorRT and Triton Inference Server
Trends, Challenges and Best Practices for AI at the Edge
15:54 MIN
Deploying enterprise AI applications with NVIDIA NIM
WWC24 - Ankit Patel - Unlocking the Future Breakthrough Application Performance and Capabilities with NVIDIA
16:17 MIN
Deploying and scaling models with NVIDIA NIM on Kubernetes
LLMOps-driven fine-tuning, evaluation, and inference with NVIDIA NIM & NeMo Microservices
00:05 MIN
Introduction to large-scale AI infrastructure challenges
Your Next AI Needs 10,000 GPUs. Now What?
09:43 MIN
The technical challenges of running LLMs in browsers
From ML to LLM: On-device AI in the Browser
05:47 MIN
Using NVIDIA NIMs and blueprints to deploy models
Your Next AI Needs 10,000 GPUs. Now What?
15:25 MIN
Deploying custom inference workloads with NVIDIA NIMs
From foundation model to hosted AI solution in minutes
01:09 MIN
Running large language models locally with Web LLM
Generative AI power on the web: making web apps smarter with WebGPU and WebNN
Featured Partners
Related Videos
Your Next AI Needs 10,000 GPUs. Now What?
Anshul Jindal & Martin Piercy
Self-Hosted LLMs: From Zero to Inference
Roberto Carratalá & Cedric Clyburn
LLMOps-driven fine-tuning, evaluation, and inference with NVIDIA NIM & NeMo Microservices
Anshul Jindal
WWC24 - Ankit Patel - Unlocking the Future Breakthrough Application Performance and Capabilities with NVIDIA
Ankit Patel
Unveiling the Magic: Scaling Large Language Models to Serve Millions
Patrick Koss
Exploring LLMs across clouds
Tomislav Tipurić
Unlocking the Power of AI: Accessible Language Model Tuning for All
Cedric Clyburn & Legare Kerrison
Adding knowledge to open-source LLMs
Sergio Perez & Harshita Seth
Related Articles
View all articles.gif?w=240&auto=compress,format)
.gif?w=240&auto=compress,format)


From learning to earning
Jobs that call for the skills explored in this talk.


Senior Machine Learning Engineer (LLM & GPU Architecture)
European Tech Recruit
Intermediate
Docker
PyTorch
Kubernetes
Computer Vision
Machine Learning




AI/ML Team Lead - Generative AI (LLMs, AWS)
Provectus
Remote
€96K
Senior
PyTorch
Tensorflow
Computer Vision
+2


