A focused 6-month plan for transitioning from Cloud Architecture into AI Infrastructure and LLM Serving — leveraging existing strengths rather than starting from scratch.
model.py generate loop and KV cache specifically. Skip full training. github.com/karpathy/nanoGPT| # | Project | Goal | Tech | Target Metric |
|---|---|---|---|---|
| 1 |
Production Chat API
FastAPI + vLLM with streaming, auth, rate limiting, Prometheus + Grafana monitoring
|
Reliable, observable serving layer | FastAPI, vLLM, Prometheus, Docker | <500ms TTFT @ 50 concurrent users |
| 2 |
Quantization Benchmark Suite
Compare AWQ / GPTQ / FP8 on same model, measure quality vs speed vs VRAM tradeoffs
|
Demonstrate optimisation depth | vLLM, AutoAWQ, lm-eval-harness, Streamlit | 2–3× throughput gain with <2% quality drop |
| 3 |
Load Test + Autoscaling PoC
Deploy vLLM, run bursty load tests, implement autoscaling via KEDA + custom metrics
|
Scaling under real-world traffic | Kubernetes, KEDA, locust, Prometheus | 99% requests <1.5s TTFT across 5× traffic spike |
| 4 |
Multi-Model Cost Router
Route requests between 8B fast-path and 70B quality fallback based on complexity, user tier, or cost budget
|
Distributed systems + economics thinking | FastAPI, vLLM multi-instance, Redis, Ray | 40% avg cost reduction maintaining P95 latency SLO |
| 5 |
Offline Batch Pipeline
Distributed batch processor for 10k+ documents using Ray + vLLM, fault tolerance, checkpointing
|
Throughput at scale beyond online serving | Ray, vLLM offline mode, pyarrow | 300+ tok/s effective throughput across 8 GPUs |
| Area | ML Theory Needed | Backend Leverage | Ramp-Up (Your BG) | 2026 Demand | Fit |
|---|---|---|---|---|---|
| 🎯 Inference Engineering | Low–Medium | Extremely High | 2–4 months | ★★★★★ | Primary |
| AI Backend / Platform | Very Low | Extremely High | 1–3 months | ★★★★★ | Strong |
| MLOps / LLMOps | Low | High | 2–4 months | ★★★★★ | Strong |
| Agent Systems Backend | Low–Medium | High | 2–4 months | ★★★★★ | Good |
| RAG / Knowledge Pipelines | Low | High | 1–3 months | ★★★★★ | Good |
| Production Evals | Low | Medium–High | 1–2 months | ★★★★★ | Good |
| AI Guardrails / Security | Low | High | 2–3 months | ★★★★★ | Good |
| ML Research / Training | Very High | Low | 12–24 months | ★★★★★ | Poor Fit |
Your Microsoft-to-cloud-architect story is clean and sellable. Getting one strong AI infra role here is the unlock for everything else — including frontier labs 12–18 months later.
Steepest learning curve and most relevant experience fastest. Avoid companies that are just OpenAI API wrappers — look for ones that build and run their own inference stack.
Highly competitive. These roles typically want people who've already shipped at production AI scale. Your cloud architecture background is relevant, but you'd be competing with 2–3 year inference veterans. Build the track record first, then target these. The path is: one strong AI infra role → frontier lab, not direct entry.
⚡ Month 3 gut-check: Have you had real interviews, not just applications? Getting to technical screens means you're competitive. If not, you still have 3 months to close specific gaps — and you'll know exactly what they are.
Orchestration, multi-model routing, cost-aware dispatching, agent composition infrastructure — all benefit from deep inference understanding
Edge AI, robotics, gaming real-time inference, scientific computing — same skills, different constraints
Bursty, tree-like inference with branch caching, speculative prefetch, multi-step routing — direct extension of serving expertise
Cheaper inference → people use it for harder problems (long-context agents, video generation, embodied AI) → optimisation remains scarce and valuable
| Resource | What You Learn | Format | Link |
|---|---|---|---|
vLLM Quickstart Official starting point — install, serve a model, hit it with the OpenAI-compatible API |
Setup, offline batch, online serving, API keys | Docs | docs.vllm.ai/quickstart |
vLLM GitHub + Examples Walk the examples/ directory — offline inference, benchmarking, quantization, multi-GPU |
Hands-on with real engine internals, sampling params, engine args | Code | github.com/vllm-project/vllm |
HuggingFace LLM Course — Optimized Inference Deployment Chapter covering TGI vs vLLM vs llama.cpp comparison, PagedAttention deep-dive, production deployment patterns |
Framework comparison, KV cache mechanics, when to use which engine | Course | huggingface.co/learn/llm-course |
mlabonne LLM Course (41k ⭐) End-to-end LLM engineer roadmap — strong inference section covering Flash Attention, KV cache, speculative decoding, quantization, serving at scale |
Broadest coverage of inference topics with curated links to best resources | Course | github.com/mlabonne/llm-course |
HuggingFace Inference Optimization Docs KV cache, static KV cache + torch.compile, speculative decoding, FlashAttention-2 — all with code |
Production-grade optimisation techniques directly usable in HF Transformers | Docs | huggingface.co/docs/transformers/llm_optims |
NVIDIA — Mastering LLM Techniques: Inference Optimization Deep technical blog covering memory distribution, quantization types, model parallelism, batching — written by NVIDIA engineers |
GPU hardware perspective on every key optimization; explains why things work | Blog | developer.nvidia.com/blog |
| Resource | What You Learn | Format | Link |
|---|---|---|---|
Stanford CS229S — Systems for ML Graduate-level course on efficient training and inference — covers tensor/pipeline/expert parallelism, LLM serving systems, hardware-software co-design |
Academic foundation for understanding distributed inference at depth; closest thing to frontier-lab thinking | Course | stanford.edu/courses/cs229s |
vLLM Distributed Inference Docs Tensor parallel, pipeline parallel, data parallel in vLLM — configuration, when to use each, multi-node setup |
Hands-on multi-GPU serving; understanding --tensor-parallel-size and its tradeoffs | Docs | docs.vllm.ai/distributed_serving |
llm-d — Kubernetes-Native Distributed Inference Open-source framework (Red Hat + Google) for disaggregated serving on Kubernetes — prefix-aware routing, KV cache across nodes, disaggregated prefill/decode |
Production-grade architecture for serving at cluster scale; 40% latency reduction on DeepSeek V3.1 | Project | github.com/llm-d/llm-d |
Ray Serve — LLM Serving Docs Ray's native LLM serving layer — autoscaling, multi-model routing, vLLM integration, OpenAI-compatible API, prefix-aware routing |
Production multi-model orchestration with Ray; widely used at OpenAI, Uber, Instacart | Docs | docs.ray.io/serve/llm |
KEDA + vLLM Autoscaling Guide (Red Hat) Step-by-step walkthrough of KEDA-based autoscaling for vLLM using Prometheus custom metrics — queue depth, TTFT, GPU utilization as scaling signals |
Practical autoscaling that goes beyond CPU/memory to inference-specific signals | Blog | developers.redhat.com/kserve-keda |
Kubernetes LLM Autoscaling Complete Guide HPA + KEDA + VPA patterns for LLM workloads, with ScaledObject YAML examples for multi-model routing |
Kubernetes-native scaling patterns — directly applicable to your infra background | Guide | collabnix.com/k8s-llm-autoscaling |
Meta Engineering — Scaling LLM Inference Deep technical post from Meta on tensor, context, and expert parallelism in production — how they achieved <1 min for 1M token prefill on 32 H100 hosts |
Real-world production architecture from a frontier lab — benchmark targets and approach | Blog | engineering.fb.com/scaling-llm |
NVIDIA TensorRT-LLM NVIDIA's production-grade inference library — multi-GPU, FP8 quantization, speculative decoding, disaggregated serving. Complement to vLLM for NVIDIA-heavy deployments |
Maximum perf on NVIDIA hardware; world-record numbers on DeepSeek R1 on Blackwell | Lib | github.com/NVIDIA/TensorRT-LLM · docs |
| Resource | What You Learn | Format | Link |
|---|---|---|---|
AutoAWQ — 4-bit Quantization Production-ready AWQ quantization — 3× speedup and 3× memory reduction vs FP16. Start here for hands-on quantization experiments |
GEMM vs GEMV tradeoffs, fused modules, batch size implications — the practical side of quant | Lib | github.com/casper-hansen/AutoAWQ |
AutoGPTQ GPTQ-based quantization — layer-wise approach, complementary to AWQ. Use both to understand tradeoffs in your benchmark suite |
GPTQ vs AWQ differences; when each wins; Marlin kernel integration | Lib | github.com/AutoGPTQ/AutoGPTQ |
vLLM FP8 Quantization Docs Native FP8 support in vLLM — how to run FP8 models, tradeoffs vs INT4, hardware requirements (H100 / A100) |
The third quantization format to benchmark; hardware-aware precision selection | Docs | docs.vllm.ai/quantization |
lm-evaluation-harness (EleutherAI) Standard framework for measuring model quality after quantization — perplexity, benchmarks (ARC, HellaSwag, MMLU). Essential for your benchmark dashboard |
Rigorous quality measurement across quantization methods — what interviewers want to see | Lib | github.com/EleutherAI/lm-evaluation-harness |
Speculative Decoding Tutorial (arXiv 2503.00491) Comprehensive academic tutorial covering draft model architectures, verification strategies, EAGLE-3 — 2–4× speedup while maintaining original output distribution |
Deep understanding of how speculative decoding works and why it's powerful | Paper | arxiv.org/abs/2503.00491 |
SpeculativeDecodingPapers (curated list) Curated reading list of all major speculative decoding papers — EAGLE, EAGLE-2/3, self-speculative, quantized draft models, LongSpec |
Stay current on the most active research area in inference efficiency | Reading List | github.com/hemingkx/SpeculativeDecodingPapers |
vLLM Speculators Library Official vLLM library for building, evaluating, and storing speculative decoding algorithms — EAGLE-3 support built in |
Hands-on implementation of speculative decoding inside vLLM serving stack | Lib | vllm speculators |
| Resource | What You Learn | Format | Link |
|---|---|---|---|
vLLM Metrics & Monitoring Docs Built-in Prometheus metrics — TTFT, TPOT, queue depth, GPU utilisation, cache hit rates. Start here for observability baseline |
What inference-specific metrics look like vs generic infra metrics; what to alert on | Docs | docs.vllm.ai/observability |
Locust Load Testing Python-based load testing — simulate bursty LLM traffic, measure P50/P95/P99 TTFT under concurrency, find breaking points |
Realistic load simulation; understand queueing behaviour under different traffic patterns | Lib | locust.io |
Prometheus + Grafana for LLM Infra Standard observability stack — prometheus-fastapi-instrumentator for FastAPI, NVIDIA DCGM exporter for GPU metrics, pre-built Grafana dashboards |
Production-grade dashboards for GPU util, TTFT histograms, request rates, OOM events | Guide | prometheus.io/docs |
Databricks — LLM Inference Best Practices Production guide covering SLO design, graceful degradation, OOM handling, canary deployments, model versioning in serving |
Practical reliability thinking from a company running LLMs at scale in production | Guide | databricks.com/llm-inference-best-practices |