
For a long time, I thought the hard part of AI was the model.
Then I thought it was the prompt.
Then I thought it was the platform.
I was wrong.
The hardest part of working with Large Language Models (LLMs) is not what you deploy —
it’s how you understand what the system is telling you when it’s under pressure.
That realization didn’t come from a whitepaper.
It didn’t come from a demo.
And it definitely didn’t come from a “Hello World” notebook.
It came from watching latency curves collapse, throughput flatten, and p95 explode — while everything looked fine on the surface.
That’s when I stopped approaching AI as an application problem.
And started approaching it as an engineering system.
The Mistake Many of Us Make When Entering AI
Most professionals entering the AI space come from one of three backgrounds:
• Software development
• Data science
• Strategy / leadership
Each of those brings value.
But each also brings a blind spot.
Developers focus on features.
Data scientists focus on accuracy.
Leaders focus on outcomes.
But LLM platforms don’t fail first in features, accuracy, or outcomes.
They fail first in capacity, latency, and saturation.
And those are engineering problems.
Why I Chose to Start With the Engineer’s Lens
After years of building and operating large-scale infrastructure — networks, fabrics, data centers, distributed systems — I recognized something familiar when I started benchmarking LLMs.
The behavior felt… known.
Queues backing up.
Latency spikes under load.
Throughput plateaus.
Non-linear collapse points.
This wasn’t magic.
This was systems engineering.
LLMs behave far more like:
• Firewalls under burst traffic
• WAN links at peak utilization
• CPU schedulers under contention
• Storage arrays hitting IOPS limits
…than they behave like traditional applications.
Once I accepted that, everything changed.
The Engineer’s Question Is Always the Same
Not:
“Why is the model slow?”
But:
“What is the system telling me under load?”
That single question reframes everything.
Because performance is not a single number.
It’s a story over time.
Understanding LLMs as Systems, Not Products
An LLM platform is not just a model.
It is a pipeline:
Input → Tokenization → Context handling → Inference → Decoding → Output
Wrapped in:
• GPU memory limits
• KV cache behavior
• Context window constraints
• Concurrency scheduling
• Queue backpressure
When something degrades, it is never random.
The system is communicating.
Your job, as an engineer, is to listen correctly.
Latency Is Not the Enemy — Blindness Is
I see many teams panic when p95 latency increases.
But latency is not the problem.
Unexplained latency is the problem.
Latency increases for reasons:
• Context length grows
• KV cache pressure increases
• Concurrency exceeds sustainable limits
• GPU memory fragmentation rises
• RAG adds retrieval overhead
If you don’t know which one, you don’t have a performance issue —
you have an observability issue.
Throughput Tells a Different Story Than Latency
One of the first engineering lessons in LLM systems:
Latency and throughput are not enemies.
They are dance partners.
You can:
• Optimize for low latency and kill throughput
• Maximize throughput and destroy user experience
The engineer’s job is to find the stable operating zone — not the maximum theoretical number.
Just like networking:
• Line-rate is not always usable-rate
• Peak bandwidth is not sustained bandwidth
Concurrency Is Where Reality Appears
Everything looks great at concurrency 1.
At concurrency 5, still fine.
At concurrency 20… interesting.
At concurrency 40… revealing.
At concurrency 60… truth.
This is where many AI experiments die in production.
Not because the model is bad.
But because capacity planning was never done.
Concurrency exposes:
• Queue depth issues
• Scheduling inefficiencies
• Memory pressure
• Non-linear collapse points
These are not “AI problems”.
These are engineering problems.
The Day I Realized Prompt Engineering Was Not Step One
Prompt engineering matters.
But only after you understand:
• Where your bottleneck actually is
• Whether output length dominates latency
• Whether context dominates memory
• Whether RAG dominates compute
Optimizing prompts without this knowledge is like tuning MTU sizes while ignoring packet loss.
You may feel productive.
But you’re not fixing the real issue.
Why RAG Changes Everything (and Nothing)
RAG is powerful.
It also:
• Adds latency
• Reduces throughput
• Increases variability
• Introduces new failure modes
From an engineer’s view, RAG is not “AI magic”.
It is:
• Retrieval latency
• Serialization cost
• Token expansion
• Memory pressure
Once you model it that way, you stop arguing emotionally about RAG —
and start placing it where it belongs architecturally.
Engineering Is About Predictability, Not Perfection
I’m not chasing:
• The fastest response
• The smartest model
• The biggest benchmark score
I’m chasing:
• Predictable latency
• Explainable degradation
• Stable capacity
• Known failure modes
Because leadership doesn’t ask:
“Is the model state of the art?”
They ask:
“Can we run this reliably for thousands of users?”
Why This Lens Changes How You Lead
Once you truly understand system behavior, leadership conversations change.
You stop saying:
• “The model is slow.”
You start saying:
• “At concurrency 50, p95 exceeds our SLO due to KV cache pressure; reducing output length by 30% restores stability.”
That is not technical jargon.
That is decision-grade clarity.
The Silent Advantage of the Engineer’s Lens
Here’s the part nobody talks about.
When you start with the engineer’s lens:
• Developer decisions become obvious
• Optimization becomes targeted
• Strategy becomes grounded
• Budget conversations become factual
You stop guessing.
You stop overpromising.
You stop reacting.
You start designing deliberately.
This Is Not About Being the Deepest Expert
This is important.
I am not trying to be:
• The best prompt engineer
• The best ML researcher
• The best Python developer
I’m building something far more valuable:
👉 The ability to switch lenses
Engineer → Developer → Leader
Truth → Optimization → Strategy
But the first lens must be engineering.
Because without truth, everything else is storytelling.
What I’m Building Next (And Why It Matters)
This journey is not theoretical.
It’s grounded in:
• Real benchmarks
• Real saturation curves
• Real trade-offs
• Real constraints
I’m deliberately training myself to:
• Read systems, not dashboards
• Explain performance, not defend it
• Design capacity, not chase peaks
Because AI is not replacing infrastructure thinking.
It is demanding better infrastructure thinking.
A Final Reflection
If you’re working with LLMs today and something feels “off”:
• Latency feels unpredictable
• Scaling feels fragile
• Costs feel unclear
• Decisions feel emotional
You don’t need a better model.
You need a better lens.
Start with the engineer’s lens.
Everything else will follow.
If this resonates, the next posts will dive deeper into:
• Reading latency curves like traffic graphs
• Understanding saturation before users feel it
• Designing LLM platforms that fail gracefully
• Translating tokens/sec into business decisions
Because AI doesn’t need more hype.
It needs engineers who listen to systems.
—
Mohammad Iqbal
Leave a Reply