Career Transition Plan

AI Inference Engineering

A focused 6-month plan for transitioning from Cloud Architecture into AI Infrastructure and LLM Serving — leveraging existing strengths rather than starting from scratch.

Timeline: 3–6 months
Target: 3 months
Base: Cloud Architect → AI Infra
Location: London

01 Existing Strengths (Don't Re-Learn)

⚙️
Cloud Architecture
Multi-cloud design, cost optimisation, CDN — directly applicable to inference cost-per-token economics
🔗
Backend Systems
APIs, concurrency, async patterns, Docker/Kubernetes — the serving layer for inference is pure backend
🎮
Gaming Scale
Latency-sensitive, high-concurrency workloads at Tencent scale — rare context most candidates lack
💰
Infra Economics
Group procurement, 26–60% cost savings — maps directly to GPU cost optimisation thinking
🏢
Enterprise Context
Microsoft + Tencent background; comfortable with CTO-level stakeholders and enterprise infrastructure
🐍
Python Proficiency
Strong working knowledge — inference engineering is predominantly Python at the application layer

02 6-Month Learning Path

Month 1
Week 1–2

Close the PyTorch Gap — Fast

  • Sebastian Raschka — PyTorch in One Hour — dense, inference-focused, zero fluff. Perfect bridge from backend to ML. sebastianraschka.com/teaching/pytorch-1h
  • Karpathy's nanoGPT — read and run it. Focus on model.py generate loop and KV cache specifically. Skip full training. github.com/karpathy/nanoGPT
  • Load Llama-3.1-8B via HuggingFace Transformers — run generation, measure VRAM and tok/s. This is the end goal of week 2.
  • Understand: what is TTFT, TPOT, KV cache, autoregressive loop, why inference is memory-bound
Month 1–2
Core investment

Get Hands-On with vLLM (Primary Focus)

  • Install vLLM locally, serve a model, hit it with real requests from a script — github.com/vllm-project/vllm · docs.vllm.ai
  • Build the FastAPI wrapper — streaming responses, JWT/API key auth, per-user rate limiting, Prometheus metrics (TTFT, TPOT, GPU util, queue depth), Grafana dashboard. This is your portfolio project.
  • Run locust load tests against it — understand where latency comes from under concurrency — locust.io
  • Experiment with quantization (AWQ/FP8) — benchmark quality vs speed. AutoAWQ · AutoGPTQ
  • Understand prefix caching, continuous batching, PagedAttention — the core innovations that make vLLM fast
  • Frame everything in cost-per-token economics — this is a genuine differentiator most candidates miss
Month 2–3
Go one level deeper

Deepen in One Direction (Pick One)

  • If scaling excites you → Multi-GPU tensor parallelism, Kubernetes autoscaling with KEDA, multi-model routing with cost-aware dispatching
  • If efficiency excites you → Quantization benchmark suite (AWQ vs GPTQ vs FP8), speculative decoding experiments, offline batch pipeline with Ray
  • If reliability excites you → Full observability stack, graceful OOM handling, SLO design, circuit breakers under load
⚠️ Pick one and finish it properly. One done well beats three half-finished.
Month 1–6
Parallel

Running in Parallel Throughout

  • Job market conversations from Week 1 — not month 4. Even exploratory conversations give you real signal on where you're competitive
  • LinkedIn updated ✓ — headline and about section done, skills added (LLM Inference, vLLM, GPU Infrastructure, AI Platform Engineering)
  • GitHub documentation — every project public with README, benchmark tables, before/after metrics, real numbers
  • One blog post — Medium or personal site. E.g. "How I 3×'d Throughput on Llama-70B with vLLM PagedAttention". Links impress interviewers.

03 Portfolio Project Ladder

# Project Goal Tech Target Metric
1
Production Chat API
FastAPI + vLLM with streaming, auth, rate limiting, Prometheus + Grafana monitoring
Reliable, observable serving layer FastAPI, vLLM, Prometheus, Docker <500ms TTFT @ 50 concurrent users
2
Quantization Benchmark Suite
Compare AWQ / GPTQ / FP8 on same model, measure quality vs speed vs VRAM tradeoffs
Demonstrate optimisation depth vLLM, AutoAWQ, lm-eval-harness, Streamlit 2–3× throughput gain with <2% quality drop
3
Load Test + Autoscaling PoC
Deploy vLLM, run bursty load tests, implement autoscaling via KEDA + custom metrics
Scaling under real-world traffic Kubernetes, KEDA, locust, Prometheus 99% requests <1.5s TTFT across 5× traffic spike
4
Multi-Model Cost Router
Route requests between 8B fast-path and 70B quality fallback based on complexity, user tier, or cost budget
Distributed systems + economics thinking FastAPI, vLLM multi-instance, Redis, Ray 40% avg cost reduction maintaining P95 latency SLO
5
Offline Batch Pipeline
Distributed batch processor for 10k+ documents using Ray + vLLM, fault tolerance, checkpointing
Throughput at scale beyond online serving Ray, vLLM offline mode, pyarrow 300+ tok/s effective throughput across 8 GPUs

04 AI Career Path Comparison (2026)

Area ML Theory Needed Backend Leverage Ramp-Up (Your BG) 2026 Demand Fit
🎯 Inference Engineering Low–Medium Extremely High 2–4 months ★★★★★ Primary
AI Backend / Platform Very Low Extremely High 1–3 months ★★★★★ Strong
MLOps / LLMOps Low High 2–4 months ★★★★★ Strong
Agent Systems Backend Low–Medium High 2–4 months ★★★★ Good
RAG / Knowledge Pipelines Low High 1–3 months ★★★★ Good
Production Evals Low Medium–High 1–2 months ★★★★ Good
AI Guardrails / Security Low High 2–3 months ★★★★ Good
ML Research / Training Very High Low 12–24 months ★★★★★ Poor Fit

05 London Job Market Strategy

Primary Target

Larger Tech with AI Infra

  • Google Cloud / DeepMind adjacent teams
  • Microsoft Azure AI (strong leverage from your past)
  • AWS AI/ML infrastructure
  • Meta London AI infra

Your Microsoft-to-cloud-architect story is clean and sellable. Getting one strong AI infra role here is the unlock for everything else — including frontier labs 12–18 months later.

Parallel Bets

AI-Native Startups

  • Mistral (Paris HQ, London presence)
  • Wayve — autonomous AI
  • Stability AI
  • US AI companies with London engineering hubs

Steepest learning curve and most relevant experience fastest. Avoid companies that are just OpenAI API wrappers — look for ones that build and run their own inference stack.

Step-After-Next (12–18 months)

Frontier Labs — DeepMind etc.

Highly competitive. These roles typically want people who've already shipped at production AI scale. Your cloud architecture background is relevant, but you'd be competing with 2–3 year inference veterans. Build the track record first, then target these. The path is: one strong AI infra role → frontier lab, not direct entry.

06 Two Parallel Tracks (Not Sequential)

Track 1 — Build (Month 1–3)
  • PyTorch crash course, nanoGPT, HuggingFace basics
  • vLLM local setup + first benchmarks
  • FastAPI + vLLM production API (portfolio project)
  • Quantization experiments with documented results
  • GitHub public with real numbers and README
  • One blog post about what you built
Track 2 — Market (Month 1–6)
  • LinkedIn updated ✓ — start now, not month 4
  • Exploratory recruiter conversations from Week 1
  • Target: real interviews by Month 3 checkpoint
  • If getting through to technical screens → on track
  • If not → adjust gaps with 3 months still remaining
  • Use interview questions to guide what to build next

Month 3 gut-check: Have you had real interviews, not just applications? Getting to technical screens means you're competitive. If not, you still have 3 months to close specific gaps — and you'll know exactly what they are.

07 Why This Bet Is Future-Proof

If standardised runtimes dominate

Move up the stack

Orchestration, multi-model routing, cost-aware dispatching, agent composition infrastructure — all benefit from deep inference understanding

If hardware improves 10×

Domain-specific optimisation

Edge AI, robotics, gaming real-time inference, scientific computing — same skills, different constraints

If agents dominate

Reasoning infrastructure

Bursty, tree-like inference with branch caching, speculative prefetch, multi-step routing — direct extension of serving expertise

Jevons Paradox effect

Demand keeps growing

Cheaper inference → people use it for harder problems (long-context agents, video generation, embodied AI) → optimisation remains scarce and valuable

08 Resources & Courses by Track

◈ Core vLLM — Start Here (Months 1–2)
Resource What You Learn Format Link
vLLM Quickstart
Official starting point — install, serve a model, hit it with the OpenAI-compatible API
Setup, offline batch, online serving, API keys Docs docs.vllm.ai/quickstart
vLLM GitHub + Examples
Walk the examples/ directory — offline inference, benchmarking, quantization, multi-GPU
Hands-on with real engine internals, sampling params, engine args Code github.com/vllm-project/vllm
HuggingFace LLM Course — Optimized Inference Deployment
Chapter covering TGI vs vLLM vs llama.cpp comparison, PagedAttention deep-dive, production deployment patterns
Framework comparison, KV cache mechanics, when to use which engine Course huggingface.co/learn/llm-course
mlabonne LLM Course (41k ⭐)
End-to-end LLM engineer roadmap — strong inference section covering Flash Attention, KV cache, speculative decoding, quantization, serving at scale
Broadest coverage of inference topics with curated links to best resources Course github.com/mlabonne/llm-course
HuggingFace Inference Optimization Docs
KV cache, static KV cache + torch.compile, speculative decoding, FlashAttention-2 — all with code
Production-grade optimisation techniques directly usable in HF Transformers Docs huggingface.co/docs/transformers/llm_optims
NVIDIA — Mastering LLM Techniques: Inference Optimization
Deep technical blog covering memory distribution, quantization types, model parallelism, batching — written by NVIDIA engineers
GPU hardware perspective on every key optimization; explains why things work Blog developer.nvidia.com/blog
◈ Deep Track 1: Scaling & Distributed Inference ← Your primary focus
Tensor parallelism, multi-GPU routing, autoscaling, disaggregated serving, cost-aware dispatching
Resource What You Learn Format Link
Stanford CS229S — Systems for ML
Graduate-level course on efficient training and inference — covers tensor/pipeline/expert parallelism, LLM serving systems, hardware-software co-design
Academic foundation for understanding distributed inference at depth; closest thing to frontier-lab thinking Course stanford.edu/courses/cs229s
vLLM Distributed Inference Docs
Tensor parallel, pipeline parallel, data parallel in vLLM — configuration, when to use each, multi-node setup
Hands-on multi-GPU serving; understanding --tensor-parallel-size and its tradeoffs Docs docs.vllm.ai/distributed_serving
llm-d — Kubernetes-Native Distributed Inference
Open-source framework (Red Hat + Google) for disaggregated serving on Kubernetes — prefix-aware routing, KV cache across nodes, disaggregated prefill/decode
Production-grade architecture for serving at cluster scale; 40% latency reduction on DeepSeek V3.1 Project github.com/llm-d/llm-d
Ray Serve — LLM Serving Docs
Ray's native LLM serving layer — autoscaling, multi-model routing, vLLM integration, OpenAI-compatible API, prefix-aware routing
Production multi-model orchestration with Ray; widely used at OpenAI, Uber, Instacart Docs docs.ray.io/serve/llm
KEDA + vLLM Autoscaling Guide (Red Hat)
Step-by-step walkthrough of KEDA-based autoscaling for vLLM using Prometheus custom metrics — queue depth, TTFT, GPU utilization as scaling signals
Practical autoscaling that goes beyond CPU/memory to inference-specific signals Blog developers.redhat.com/kserve-keda
Kubernetes LLM Autoscaling Complete Guide
HPA + KEDA + VPA patterns for LLM workloads, with ScaledObject YAML examples for multi-model routing
Kubernetes-native scaling patterns — directly applicable to your infra background Guide collabnix.com/k8s-llm-autoscaling
Meta Engineering — Scaling LLM Inference
Deep technical post from Meta on tensor, context, and expert parallelism in production — how they achieved <1 min for 1M token prefill on 32 H100 hosts
Real-world production architecture from a frontier lab — benchmark targets and approach Blog engineering.fb.com/scaling-llm
NVIDIA TensorRT-LLM
NVIDIA's production-grade inference library — multi-GPU, FP8 quantization, speculative decoding, disaggregated serving. Complement to vLLM for NVIDIA-heavy deployments
Maximum perf on NVIDIA hardware; world-record numbers on DeepSeek R1 on Blackwell Lib github.com/NVIDIA/TensorRT-LLM · docs
◈ Deep Track 2: Efficiency — Quantization & Speculative Decoding ← Also strong interest
AWQ, GPTQ, FP8, speculative decoding, cost-per-token optimisation benchmarking
Resource What You Learn Format Link
AutoAWQ — 4-bit Quantization
Production-ready AWQ quantization — 3× speedup and 3× memory reduction vs FP16. Start here for hands-on quantization experiments
GEMM vs GEMV tradeoffs, fused modules, batch size implications — the practical side of quant Lib github.com/casper-hansen/AutoAWQ
AutoGPTQ
GPTQ-based quantization — layer-wise approach, complementary to AWQ. Use both to understand tradeoffs in your benchmark suite
GPTQ vs AWQ differences; when each wins; Marlin kernel integration Lib github.com/AutoGPTQ/AutoGPTQ
vLLM FP8 Quantization Docs
Native FP8 support in vLLM — how to run FP8 models, tradeoffs vs INT4, hardware requirements (H100 / A100)
The third quantization format to benchmark; hardware-aware precision selection Docs docs.vllm.ai/quantization
lm-evaluation-harness (EleutherAI)
Standard framework for measuring model quality after quantization — perplexity, benchmarks (ARC, HellaSwag, MMLU). Essential for your benchmark dashboard
Rigorous quality measurement across quantization methods — what interviewers want to see Lib github.com/EleutherAI/lm-evaluation-harness
Speculative Decoding Tutorial (arXiv 2503.00491)
Comprehensive academic tutorial covering draft model architectures, verification strategies, EAGLE-3 — 2–4× speedup while maintaining original output distribution
Deep understanding of how speculative decoding works and why it's powerful Paper arxiv.org/abs/2503.00491
SpeculativeDecodingPapers (curated list)
Curated reading list of all major speculative decoding papers — EAGLE, EAGLE-2/3, self-speculative, quantized draft models, LongSpec
Stay current on the most active research area in inference efficiency Reading List github.com/hemingkx/SpeculativeDecodingPapers
vLLM Speculators Library
Official vLLM library for building, evaluating, and storing speculative decoding algorithms — EAGLE-3 support built in
Hands-on implementation of speculative decoding inside vLLM serving stack Lib vllm speculators
◈ Deep Track 3: Reliability & Observability ← Leverage infra background, lighter touch
Your cloud infra background makes this the fastest track to competence — focus on AI-specific failure modes rather than generic observability
Resource What You Learn Format Link
vLLM Metrics & Monitoring Docs
Built-in Prometheus metrics — TTFT, TPOT, queue depth, GPU utilisation, cache hit rates. Start here for observability baseline
What inference-specific metrics look like vs generic infra metrics; what to alert on Docs docs.vllm.ai/observability
Locust Load Testing
Python-based load testing — simulate bursty LLM traffic, measure P50/P95/P99 TTFT under concurrency, find breaking points
Realistic load simulation; understand queueing behaviour under different traffic patterns Lib locust.io
Prometheus + Grafana for LLM Infra
Standard observability stack — prometheus-fastapi-instrumentator for FastAPI, NVIDIA DCGM exporter for GPU metrics, pre-built Grafana dashboards
Production-grade dashboards for GPU util, TTFT histograms, request rates, OOM events Guide prometheus.io/docs
Databricks — LLM Inference Best Practices
Production guide covering SLO design, graceful degradation, OOM handling, canary deployments, model versioning in serving
Practical reliability thinking from a company running LLMs at scale in production Guide databricks.com/llm-inference-best-practices
Recommended Order
Week 1–2
Sebastian Raschka PyTorch → nanoGPT → HuggingFace LLM Course inference chapter
Week 3–4
vLLM Quickstart + GitHub examples → NVIDIA Inference Optimization blog → mlabonne LLM course inference section
Month 2
AutoAWQ + lm-eval-harness → vLLM Distributed Serving docs → Ray Serve LLM docs
Month 3
Speculative decoding tutorial → KEDA autoscaling guide → llm-d architecture → Meta engineering blog
Ongoing
Stanford CS229S (audit freely) → TensorRT-LLM when vLLM feels solid → SpeculativeDecodingPapers for staying current