AI Inference Engineering — Career Transition Plan

01 Existing Strengths (Don't Re-Learn)

⚙️

Cloud Architecture

Multi-cloud design, cost optimisation, CDN — directly applicable to inference cost-per-token economics

🔗

Backend Systems

APIs, concurrency, async patterns, Docker/Kubernetes — the serving layer for inference is pure backend

🎮

Gaming Scale

Latency-sensitive, high-concurrency workloads at Tencent scale — rare context most candidates lack

💰

Infra Economics

Group procurement, 26–60% cost savings — maps directly to GPU cost optimisation thinking

🏢

Enterprise Context

Microsoft + Tencent background; comfortable with CTO-level stakeholders and enterprise infrastructure

🐍

Python Proficiency

Strong working knowledge — inference engineering is predominantly Python at the application layer

02 6-Month Learning Path

Month 1

Week 1–2

Close the PyTorch Gap — Fast

Sebastian Raschka — PyTorch in One Hour — dense, inference-focused, zero fluff. Perfect bridge from backend to ML. sebastianraschka.com/teaching/pytorch-1h
Karpathy's nanoGPT — read and run it. Focus on model.py generate loop and KV cache specifically. Skip full training. github.com/karpathy/nanoGPT
Load Llama-3.1-8B via HuggingFace Transformers — run generation, measure VRAM and tok/s. This is the end goal of week 2.
Understand: what is TTFT, TPOT, KV cache, autoregressive loop, why inference is memory-bound

Month 1–2

Core investment

Get Hands-On with vLLM (Primary Focus)

Install vLLM locally, serve a model, hit it with real requests from a script — github.com/vllm-project/vllm · docs.vllm.ai
Build the FastAPI wrapper — streaming responses, JWT/API key auth, per-user rate limiting, Prometheus metrics (TTFT, TPOT, GPU util, queue depth), Grafana dashboard. This is your portfolio project.
Run locust load tests against it — understand where latency comes from under concurrency — locust.io
Experiment with quantization (AWQ/FP8) — benchmark quality vs speed. AutoAWQ · AutoGPTQ
Understand prefix caching, continuous batching, PagedAttention — the core innovations that make vLLM fast
Frame everything in cost-per-token economics — this is a genuine differentiator most candidates miss

Month 2–3

Go one level deeper

Deepen in One Direction (Pick One)

If scaling excites you → Multi-GPU tensor parallelism, Kubernetes autoscaling with KEDA, multi-model routing with cost-aware dispatching
If efficiency excites you → Quantization benchmark suite (AWQ vs GPTQ vs FP8), speculative decoding experiments, offline batch pipeline with Ray
If reliability excites you → Full observability stack, graceful OOM handling, SLO design, circuit breakers under load

⚠️ Pick one and finish it properly. One done well beats three half-finished.

Month 1–6

Parallel

Running in Parallel Throughout

Job market conversations from Week 1 — not month 4. Even exploratory conversations give you real signal on where you're competitive
LinkedIn updated ✓ — headline and about section done, skills added (LLM Inference, vLLM, GPU Infrastructure, AI Platform Engineering)
GitHub documentation — every project public with README, benchmark tables, before/after metrics, real numbers
One blog post — Medium or personal site. E.g. "How I 3×'d Throughput on Llama-70B with vLLM PagedAttention". Links impress interviewers.

03 Portfolio Project Ladder

#	Project	Goal	Tech	Target Metric
1	Production Chat API FastAPI + vLLM with streaming, auth, rate limiting, Prometheus + Grafana monitoring	Reliable, observable serving layer	FastAPI, vLLM, Prometheus, Docker	<500ms TTFT @ 50 concurrent users
2	Quantization Benchmark Suite Compare AWQ / GPTQ / FP8 on same model, measure quality vs speed vs VRAM tradeoffs	Demonstrate optimisation depth	vLLM, AutoAWQ, lm-eval-harness, Streamlit	2–3× throughput gain with <2% quality drop
3	Load Test + Autoscaling PoC Deploy vLLM, run bursty load tests, implement autoscaling via KEDA + custom metrics	Scaling under real-world traffic	Kubernetes, KEDA, locust, Prometheus	99% requests <1.5s TTFT across 5× traffic spike
4	Multi-Model Cost Router Route requests between 8B fast-path and 70B quality fallback based on complexity, user tier, or cost budget	Distributed systems + economics thinking	FastAPI, vLLM multi-instance, Redis, Ray	40% avg cost reduction maintaining P95 latency SLO
5	Offline Batch Pipeline Distributed batch processor for 10k+ documents using Ray + vLLM, fault tolerance, checkpointing	Throughput at scale beyond online serving	Ray, vLLM offline mode, pyarrow	300+ tok/s effective throughput across 8 GPUs

04 AI Career Path Comparison (2026)

Area	ML Theory Needed	Backend Leverage	Ramp-Up (Your BG)	2026 Demand	Fit
🎯 Inference Engineering	Low–Medium	Extremely High	2–4 months	★★★★★	Primary
AI Backend / Platform	Very Low	Extremely High	1–3 months	★★★★★	Strong
MLOps / LLMOps	Low	High	2–4 months	★★★★★	Strong
Agent Systems Backend	Low–Medium	High	2–4 months	★★★★★	Good
RAG / Knowledge Pipelines	Low	High	1–3 months	★★★★★	Good
Production Evals	Low	Medium–High	1–2 months	★★★★★	Good
AI Guardrails / Security	Low	High	2–3 months	★★★★★	Good
ML Research / Training	Very High	Low	12–24 months	★★★★★	Poor Fit

05 London Job Market Strategy

Primary Target

Larger Tech with AI Infra

Google Cloud / DeepMind adjacent teams
Microsoft Azure AI (strong leverage from your past)
AWS AI/ML infrastructure
Meta London AI infra

Your Microsoft-to-cloud-architect story is clean and sellable. Getting one strong AI infra role here is the unlock for everything else — including frontier labs 12–18 months later.

Parallel Bets

AI-Native Startups

Mistral (Paris HQ, London presence)
Wayve — autonomous AI
Stability AI
US AI companies with London engineering hubs

Steepest learning curve and most relevant experience fastest. Avoid companies that are just OpenAI API wrappers — look for ones that build and run their own inference stack.

Step-After-Next (12–18 months)

Frontier Labs — DeepMind etc.

Highly competitive. These roles typically want people who've already shipped at production AI scale. Your cloud architecture background is relevant, but you'd be competing with 2–3 year inference veterans. Build the track record first, then target these. The path is: one strong AI infra role → frontier lab, not direct entry.

06 Two Parallel Tracks (Not Sequential)

Track 1 — Build (Month 1–3)

PyTorch crash course, nanoGPT, HuggingFace basics
vLLM local setup + first benchmarks
FastAPI + vLLM production API (portfolio project)
Quantization experiments with documented results
GitHub public with real numbers and README
One blog post about what you built

Track 2 — Market (Month 1–6)

LinkedIn updated ✓ — start now, not month 4
Exploratory recruiter conversations from Week 1
Target: real interviews by Month 3 checkpoint
If getting through to technical screens → on track
If not → adjust gaps with 3 months still remaining
Use interview questions to guide what to build next

⚡ Month 3 gut-check: Have you had real interviews, not just applications? Getting to technical screens means you're competitive. If not, you still have 3 months to close specific gaps — and you'll know exactly what they are.

07 Why This Bet Is Future-Proof

If standardised runtimes dominate

Move up the stack

Orchestration, multi-model routing, cost-aware dispatching, agent composition infrastructure — all benefit from deep inference understanding

If hardware improves 10×

Domain-specific optimisation

Edge AI, robotics, gaming real-time inference, scientific computing — same skills, different constraints

If agents dominate

Reasoning infrastructure

Bursty, tree-like inference with branch caching, speculative prefetch, multi-step routing — direct extension of serving expertise

Jevons Paradox effect

Demand keeps growing

Cheaper inference → people use it for harder problems (long-context agents, video generation, embodied AI) → optimisation remains scarce and valuable

08 Resources & Courses by Track

        ◈ Core vLLM — Start Here (Months 1–2)
      

Resource	What You Learn	Format	Link
vLLM Quickstart Official starting point — install, serve a model, hit it with the OpenAI-compatible API	Setup, offline batch, online serving, API keys	Docs	docs.vllm.ai/quickstart
vLLM GitHub + Examples Walk the examples/ directory — offline inference, benchmarking, quantization, multi-GPU	Hands-on with real engine internals, sampling params, engine args	Code	github.com/vllm-project/vllm
HuggingFace LLM Course — Optimized Inference Deployment Chapter covering TGI vs vLLM vs llama.cpp comparison, PagedAttention deep-dive, production deployment patterns	Framework comparison, KV cache mechanics, when to use which engine	Course	huggingface.co/learn/llm-course
mlabonne LLM Course (41k ⭐) End-to-end LLM engineer roadmap — strong inference section covering Flash Attention, KV cache, speculative decoding, quantization, serving at scale	Broadest coverage of inference topics with curated links to best resources	Course	github.com/mlabonne/llm-course
HuggingFace Inference Optimization Docs KV cache, static KV cache + torch.compile, speculative decoding, FlashAttention-2 — all with code	Production-grade optimisation techniques directly usable in HF Transformers	Docs	huggingface.co/docs/transformers/llm_optims
NVIDIA — Mastering LLM Techniques: Inference Optimization Deep technical blog covering memory distribution, quantization types, model parallelism, batching — written by NVIDIA engineers	GPU hardware perspective on every key optimization; explains why things work	Blog	developer.nvidia.com/blog

        ◈ Deep Track 1: Scaling & Distributed Inference ← Your primary focus
      

Tensor parallelism, multi-GPU routing, autoscaling, disaggregated serving, cost-aware dispatching

Resource	What You Learn	Format	Link
Stanford CS229S — Systems for ML Graduate-level course on efficient training and inference — covers tensor/pipeline/expert parallelism, LLM serving systems, hardware-software co-design	Academic foundation for understanding distributed inference at depth; closest thing to frontier-lab thinking	Course	stanford.edu/courses/cs229s
vLLM Distributed Inference Docs Tensor parallel, pipeline parallel, data parallel in vLLM — configuration, when to use each, multi-node setup	Hands-on multi-GPU serving; understanding --tensor-parallel-size and its tradeoffs	Docs	docs.vllm.ai/distributed_serving
llm-d — Kubernetes-Native Distributed Inference Open-source framework (Red Hat + Google) for disaggregated serving on Kubernetes — prefix-aware routing, KV cache across nodes, disaggregated prefill/decode	Production-grade architecture for serving at cluster scale; 40% latency reduction on DeepSeek V3.1	Project	github.com/llm-d/llm-d
Ray Serve — LLM Serving Docs Ray's native LLM serving layer — autoscaling, multi-model routing, vLLM integration, OpenAI-compatible API, prefix-aware routing	Production multi-model orchestration with Ray; widely used at OpenAI, Uber, Instacart	Docs	docs.ray.io/serve/llm
KEDA + vLLM Autoscaling Guide (Red Hat) Step-by-step walkthrough of KEDA-based autoscaling for vLLM using Prometheus custom metrics — queue depth, TTFT, GPU utilization as scaling signals	Practical autoscaling that goes beyond CPU/memory to inference-specific signals	Blog	developers.redhat.com/kserve-keda
Kubernetes LLM Autoscaling Complete Guide HPA + KEDA + VPA patterns for LLM workloads, with ScaledObject YAML examples for multi-model routing	Kubernetes-native scaling patterns — directly applicable to your infra background	Guide	collabnix.com/k8s-llm-autoscaling
Meta Engineering — Scaling LLM Inference Deep technical post from Meta on tensor, context, and expert parallelism in production — how they achieved <1 min for 1M token prefill on 32 H100 hosts	Real-world production architecture from a frontier lab — benchmark targets and approach	Blog	engineering.fb.com/scaling-llm
NVIDIA TensorRT-LLM NVIDIA's production-grade inference library — multi-GPU, FP8 quantization, speculative decoding, disaggregated serving. Complement to vLLM for NVIDIA-heavy deployments	Maximum perf on NVIDIA hardware; world-record numbers on DeepSeek R1 on Blackwell	Lib	github.com/NVIDIA/TensorRT-LLM · docs

        ◈ Deep Track 2: Efficiency — Quantization & Speculative Decoding ← Also strong interest
      

AWQ, GPTQ, FP8, speculative decoding, cost-per-token optimisation benchmarking

Resource	What You Learn	Format	Link
AutoAWQ — 4-bit Quantization Production-ready AWQ quantization — 3× speedup and 3× memory reduction vs FP16. Start here for hands-on quantization experiments	GEMM vs GEMV tradeoffs, fused modules, batch size implications — the practical side of quant	Lib	github.com/casper-hansen/AutoAWQ
AutoGPTQ GPTQ-based quantization — layer-wise approach, complementary to AWQ. Use both to understand tradeoffs in your benchmark suite	GPTQ vs AWQ differences; when each wins; Marlin kernel integration	Lib	github.com/AutoGPTQ/AutoGPTQ
vLLM FP8 Quantization Docs Native FP8 support in vLLM — how to run FP8 models, tradeoffs vs INT4, hardware requirements (H100 / A100)	The third quantization format to benchmark; hardware-aware precision selection	Docs	docs.vllm.ai/quantization
lm-evaluation-harness (EleutherAI) Standard framework for measuring model quality after quantization — perplexity, benchmarks (ARC, HellaSwag, MMLU). Essential for your benchmark dashboard	Rigorous quality measurement across quantization methods — what interviewers want to see	Lib	github.com/EleutherAI/lm-evaluation-harness
Speculative Decoding Tutorial (arXiv 2503.00491) Comprehensive academic tutorial covering draft model architectures, verification strategies, EAGLE-3 — 2–4× speedup while maintaining original output distribution	Deep understanding of how speculative decoding works and why it's powerful	Paper	arxiv.org/abs/2503.00491
SpeculativeDecodingPapers (curated list) Curated reading list of all major speculative decoding papers — EAGLE, EAGLE-2/3, self-speculative, quantized draft models, LongSpec	Stay current on the most active research area in inference efficiency	Reading List	github.com/hemingkx/SpeculativeDecodingPapers
vLLM Speculators Library Official vLLM library for building, evaluating, and storing speculative decoding algorithms — EAGLE-3 support built in	Hands-on implementation of speculative decoding inside vLLM serving stack	Lib	vllm speculators

        ◈ Deep Track 3: Reliability & Observability ← Leverage infra background, lighter touch
      

Your cloud infra background makes this the fastest track to competence — focus on AI-specific failure modes rather than generic observability

Resource	What You Learn	Format	Link
vLLM Metrics & Monitoring Docs Built-in Prometheus metrics — TTFT, TPOT, queue depth, GPU utilisation, cache hit rates. Start here for observability baseline	What inference-specific metrics look like vs generic infra metrics; what to alert on	Docs	docs.vllm.ai/observability
Locust Load Testing Python-based load testing — simulate bursty LLM traffic, measure P50/P95/P99 TTFT under concurrency, find breaking points	Realistic load simulation; understand queueing behaviour under different traffic patterns	Lib	locust.io
Prometheus + Grafana for LLM Infra Standard observability stack — prometheus-fastapi-instrumentator for FastAPI, NVIDIA DCGM exporter for GPU metrics, pre-built Grafana dashboards	Production-grade dashboards for GPU util, TTFT histograms, request rates, OOM events	Guide	prometheus.io/docs
Databricks — LLM Inference Best Practices Production guide covering SLO design, graceful degradation, OOM handling, canary deployments, model versioning in serving	Practical reliability thinking from a company running LLMs at scale in production	Guide	databricks.com/llm-inference-best-practices

Recommended Order

Week 1–2

Sebastian Raschka PyTorch → nanoGPT → HuggingFace LLM Course inference chapter

Week 3–4

vLLM Quickstart + GitHub examples → NVIDIA Inference Optimization blog → mlabonne LLM course inference section

Month 2

AutoAWQ + lm-eval-harness → vLLM Distributed Serving docs → Ray Serve LLM docs

Month 3

Speculative decoding tutorial → KEDA autoscaling guide → llm-d architecture → Meta engineering blog

Ongoing

Stanford CS229S (audit freely) → TensorRT-LLM when vLLM feels solid → SpeculativeDecodingPapers for staying current