AI Systems Software Engineer

Vamshi
Nagireddy

Performance engineer at Intel working on LLM inference, low-level runtime optimization, and AI compiler development for XPUs. I write about what I learn.

Read Articles Medium ↗ GitHub ↗

Years of work experience

Computer Science

vLLM

Open Source

Contributor in progress

§01 — About

AI Systems Software Engineer

Vamshi Nagireddy

LocationSan Jose, CA
CompanyIntel Corporation
Medium@vamshire
GitHubvamshinr
LinkedInvamshinr
Emailvamshi.knagireddy@gmail.com

Education

MS · Computer Science

CSU Sacramento

BS · Computer Science

JNTUH College of Engineering

I'm a systems software engineer at Intel focused on making AI workloads run faster — telemetry-driven runtime optimization frameworks, cache/memory tuning, LLVM-based PGO, and AI compiler development targeting XPUs. I care about what happens at the hardware/software boundary, and I'm drawn to problems that live in the gap between ML research and production systems.

Before Intel, I built NLP systems and job recommendation engines at Phenom, and did AI/multimodal research at CSU Sacramento. My projects span autonomous agents (PEARL), RAG and GraphRAG pipelines (DrugGuard), local speech intelligence (VaultASR), and LLM inference benchmarking.

Currently working toward contributing to vLLM open source and writing about systems I build and study along the way.

LLM InferencePerformance EngineeringLow-Level DevelopmentOS / Kernel / DriversAI Compiler DevelopmentXPU TargetsLLVMOpen Source

ML / Inference

PyTorchCUDATritonvLLMTensorRT-LLMONNXTensorFlowOpenVINO

Compilers & Runtime

LLVM/PGOMLIRClangoneAPISYCLIntel APOCMake

Languages

PythonC++CGoJavaBash/Shell

Infra & DevOps

DockerKubernetesKafkaJenkinsMLFlowGit

Data & Backends

SQLMongoDBNeo4jDynamoDBCassandraElasticsearch

Hardware Targets

NVIDIA GPUApple SiliconIntel NPUCoreMLDirectMLROCm

§02 — Experience

Jul 2023 —
Present

Software Engineer II

Intel Corporation · San Jose, CA

AI and traditional workload performance optimization. Telemetry-driven runtime frameworks, Intel APO software, cache/memory tuning, LLVM-based PGO, and Python state machines. Active contributor to the BAPCo benchmarking consortium.

LLVM/PGOIntel APOBAPCoTelemetryC++Python

2022 — 2023

ML Engineer

Phenom

Built NLP systems for resume parsing and job ranking. Developed job recommendation systems using NoSQL databases. Published research on job recommendation systems.

NLPPyTorchNoSQLRecommendation Systems

2021 — 2022

AI Research Assistant

CSU Sacramento

AI/ML research focused on NLP and autonomous agents. Graduated MS CS.

ResearchNLPAutonomous Agents

§03 — Projects

ASR

VaultASR

High-Performance Local Speech Intelligence

GitHub ↗

Local, private speech-to-text pipeline with multi-speaker diarization, Silero VAD v5, and hardware-accelerated inference. Transcribes hours of audio in minutes with zero data leaving the device.

FeaturesMulti-speaker diarization · Silero VAD v5 · CoreML/Metal GPU · XLSX/JSON/Docx export

Impact100% offline · Apple Silicon optimized · hours of audio in minutes

C++whisper.cppONNX RuntimeCoreMLMetalFFmpeg

PEARL

Proactive Execution and Adaptive Reasoning Loop

GitHub ↗

Autonomous AI agent with a cognitive architecture for reliable task execution. Features dynamic task decomposition, constrained decoding, and experience-based learning via persistent vector memory.

FeaturesPEARL cognitive loop · dynamic task decomposition · ChromaDB long-term memory · web search tools

ImpactNovel approach to reliable LLM-based code generation and task planning

PythonGemini APIChromaDBLangChainFastAPI

DrugGuard

LLM-Powered Pharmacovigilance System

GitHub ↗

Comparative study of RAG vs GraphRAG for safety-critical medical information retrieval. Graph-based knowledge representation significantly improves retrieval accuracy over flat vector search for drug interaction queries.

Dataset19,520 drug-side effect associations · 976 drugs · 3,851 side effect terms

InnovationGraphRAG vs RAG evaluation · graph-based knowledge representation

PythonNeo4jLangChainRAGGraphRAGLLM

§04 — Latest Writing

LLM Inference2025

Why Running an LLM Is Harder Than It Looks

Training gets the papers. In between sits inference — a complete systems engineering discipline almost entirely distinct from ML research.

Read article

LLM InferenceComing soon

Memory Mapping and How a 140GB Model Actually Loads

How safetensors and GGUF make zero-copy model loading possible, and why .pt files cannot.

In progress