AI Systems Software Engineer

Vamshi
Nagireddy

Performance engineer at Intel working on LLM inference, low-level runtime optimization, and AI compiler development for XPUs. I write about what I learn.

4+
Years of work experience
MS
Computer Science
vLLM
Open Source
Contributor in progress
§01 — About
VN
AI Systems Software Engineer
Vamshi Nagireddy
Education
MS · Computer Science
CSU Sacramento
BS · Computer Science
JNTUH College of Engineering

I'm a systems software engineer at Intel focused on making AI workloads run faster — telemetry-driven runtime optimization frameworks, cache/memory tuning, LLVM-based PGO, and AI compiler development targeting XPUs. I care about what happens at the hardware/software boundary, and I'm drawn to problems that live in the gap between ML research and production systems.

Before Intel, I built NLP systems and job recommendation engines at Phenom, and did AI/multimodal research at CSU Sacramento. My projects span autonomous agents (PEARL), RAG and GraphRAG pipelines (DrugGuard), local speech intelligence (VaultASR), and LLM inference benchmarking.

Currently working toward contributing to vLLM open source and writing about systems I build and study along the way.

LLM InferencePerformance EngineeringLow-Level DevelopmentOS / Kernel / DriversAI Compiler DevelopmentXPU TargetsLLVMOpen Source
ML / Inference
PyTorchCUDATritonvLLMTensorRT-LLMONNXTensorFlowOpenVINO
Compilers & Runtime
LLVM/PGOMLIRClangoneAPISYCLIntel APOCMake
Languages
PythonC++CGoJavaBash/Shell
Infra & DevOps
DockerKubernetesKafkaJenkinsMLFlowGit
Data & Backends
SQLMongoDBNeo4jDynamoDBCassandraElasticsearch
Hardware Targets
NVIDIA GPUApple SiliconIntel NPUCoreMLDirectMLROCm
§02 — Experience
Jul 2023 —
Present
Software Engineer II
Intel Corporation · San Jose, CA
AI and traditional workload performance optimization. Telemetry-driven runtime frameworks, Intel APO software, cache/memory tuning, LLVM-based PGO, and Python state machines. Active contributor to the BAPCo benchmarking consortium.
LLVM/PGOIntel APOBAPCoTelemetryC++Python
2022 — 2023
ML Engineer
Phenom
Built NLP systems for resume parsing and job ranking. Developed job recommendation systems using NoSQL databases. Published research on job recommendation systems.
NLPPyTorchNoSQLRecommendation Systems
2021 — 2022
AI Research Assistant
CSU Sacramento
AI/ML research focused on NLP and autonomous agents. Graduated MS CS.
ResearchNLPAutonomous Agents
§03 — Projects
ASR
VaultASR
High-Performance Local Speech Intelligence
GitHub ↗

Local, private speech-to-text pipeline with multi-speaker diarization, Silero VAD v5, and hardware-accelerated inference. Transcribes hours of audio in minutes with zero data leaving the device.

FeaturesMulti-speaker diarization · Silero VAD v5 · CoreML/Metal GPU · XLSX/JSON/Docx export
Impact100% offline · Apple Silicon optimized · hours of audio in minutes
C++whisper.cppONNX RuntimeCoreMLMetalFFmpeg
AI
PEARL
Proactive Execution and Adaptive Reasoning Loop
GitHub ↗

Autonomous AI agent with a cognitive architecture for reliable task execution. Features dynamic task decomposition, constrained decoding, and experience-based learning via persistent vector memory.

FeaturesPEARL cognitive loop · dynamic task decomposition · ChromaDB long-term memory · web search tools
ImpactNovel approach to reliable LLM-based code generation and task planning
PythonGemini APIChromaDBLangChainFastAPI
Rx
DrugGuard
LLM-Powered Pharmacovigilance System
GitHub ↗

Comparative study of RAG vs GraphRAG for safety-critical medical information retrieval. Graph-based knowledge representation significantly improves retrieval accuracy over flat vector search for drug interaction queries.

Dataset19,520 drug-side effect associations · 976 drugs · 3,851 side effect terms
InnovationGraphRAG vs RAG evaluation · graph-based knowledge representation
PythonNeo4jLangChainRAGGraphRAGLLM
§04 — Latest Writing
LLM Inference2025
Why Running an LLM Is Harder Than It Looks
Training gets the papers. In between sits inference — a complete systems engineering discipline almost entirely distinct from ML research.
Read article
LLM InferenceComing soon
Memory Mapping and How a 140GB Model Actually Loads
How safetensors and GGUF make zero-copy model loading possible, and why .pt files cannot.
In progress
PerformanceComing soon
Prefill vs Decode: Why Transformers Have Two Compute Personalities
The two phases of inference have opposite hardware characteristics — understanding this changes how you think about serving systems.
In progress
View all articles →

Resume

Full resume covering Intel, Phenom, CSU Sacramento, projects (PEARL, DrugGuard, ARIA), and publications. Updated April 2025.

Download PDF ↓ View LinkedIn ↗
§05 — Contact

Open to interesting conversations about performance optimization, LLM inference, open source, or new opportunities.