Before Prompts, Before Pipelines, Before Strategy — I Chose the Engineer’s Lens

An infographic illustrating the concept of using an engineer's lens in AI, featuring a microchip, graphs depicting latency spikes and saturation points, and text emphasizing the importance of understanding capacity and concurrency in large language models.

For a long time, I thought the hard part of AI was the model.

Then I thought it was the prompt.

Then I thought it was the platform.

I was wrong.

The hardest part of working with Large Language Models (LLMs) is not what you deploy —
it’s how you understand what the system is telling you when it’s under pressure.

That realization didn’t come from a whitepaper.
It didn’t come from a demo.
And it definitely didn’t come from a “Hello World” notebook.

It came from watching latency curves collapse, throughput flatten, and p95 explode — while everything looked fine on the surface.

That’s when I stopped approaching AI as an application problem.

And started approaching it as an engineering system.


The Mistake Many of Us Make When Entering AI

Most professionals entering the AI space come from one of three backgrounds:

• Software development
• Data science
• Strategy / leadership

Each of those brings value.
But each also brings a blind spot.

Developers focus on features.
Data scientists focus on accuracy.
Leaders focus on outcomes.

But LLM platforms don’t fail first in features, accuracy, or outcomes.

They fail first in capacity, latency, and saturation.

And those are engineering problems.


Why I Chose to Start With the Engineer’s Lens

After years of building and operating large-scale infrastructure — networks, fabrics, data centers, distributed systems — I recognized something familiar when I started benchmarking LLMs.

The behavior felt… known.

Queues backing up.
Latency spikes under load.
Throughput plateaus.
Non-linear collapse points.

This wasn’t magic.

This was systems engineering.

LLMs behave far more like:

• Firewalls under burst traffic
• WAN links at peak utilization
• CPU schedulers under contention
• Storage arrays hitting IOPS limits

…than they behave like traditional applications.

Once I accepted that, everything changed.


The Engineer’s Question Is Always the Same

Not:

“Why is the model slow?”

But:

“What is the system telling me under load?”

That single question reframes everything.

Because performance is not a single number.
It’s a story over time.


Understanding LLMs as Systems, Not Products

An LLM platform is not just a model.

It is a pipeline:

Input → Tokenization → Context handling → Inference → Decoding → Output
Wrapped in:

• GPU memory limits
• KV cache behavior
• Context window constraints
• Concurrency scheduling
• Queue backpressure

When something degrades, it is never random.

The system is communicating.

Your job, as an engineer, is to listen correctly.


Latency Is Not the Enemy — Blindness Is

I see many teams panic when p95 latency increases.

But latency is not the problem.

Unexplained latency is the problem.

Latency increases for reasons:
• Context length grows
• KV cache pressure increases
• Concurrency exceeds sustainable limits
• GPU memory fragmentation rises
• RAG adds retrieval overhead

If you don’t know which one, you don’t have a performance issue —
you have an observability issue.


Throughput Tells a Different Story Than Latency

One of the first engineering lessons in LLM systems:

Latency and throughput are not enemies.
They are dance partners.

You can:
• Optimize for low latency and kill throughput
• Maximize throughput and destroy user experience

The engineer’s job is to find the stable operating zone — not the maximum theoretical number.

Just like networking:
• Line-rate is not always usable-rate
• Peak bandwidth is not sustained bandwidth


Concurrency Is Where Reality Appears

Everything looks great at concurrency 1.

At concurrency 5, still fine.

At concurrency 20… interesting.

At concurrency 40… revealing.

At concurrency 60… truth.

This is where many AI experiments die in production.

Not because the model is bad.
But because capacity planning was never done.

Concurrency exposes:
• Queue depth issues
• Scheduling inefficiencies
• Memory pressure
• Non-linear collapse points

These are not “AI problems”.

These are engineering problems.


The Day I Realized Prompt Engineering Was Not Step One

Prompt engineering matters.

But only after you understand:
• Where your bottleneck actually is
• Whether output length dominates latency
• Whether context dominates memory
• Whether RAG dominates compute

Optimizing prompts without this knowledge is like tuning MTU sizes while ignoring packet loss.

You may feel productive.
But you’re not fixing the real issue.


Why RAG Changes Everything (and Nothing)

RAG is powerful.

It also:
• Adds latency
• Reduces throughput
• Increases variability
• Introduces new failure modes

From an engineer’s view, RAG is not “AI magic”.

It is:
• Retrieval latency
• Serialization cost
• Token expansion
• Memory pressure

Once you model it that way, you stop arguing emotionally about RAG —
and start placing it where it belongs architecturally.


Engineering Is About Predictability, Not Perfection

I’m not chasing:
• The fastest response
• The smartest model
• The biggest benchmark score

I’m chasing:
• Predictable latency
• Explainable degradation
• Stable capacity
• Known failure modes

Because leadership doesn’t ask:

“Is the model state of the art?”

They ask:

“Can we run this reliably for thousands of users?”


Why This Lens Changes How You Lead

Once you truly understand system behavior, leadership conversations change.

You stop saying:
• “The model is slow.”

You start saying:
• “At concurrency 50, p95 exceeds our SLO due to KV cache pressure; reducing output length by 30% restores stability.”

That is not technical jargon.

That is decision-grade clarity.


The Silent Advantage of the Engineer’s Lens

Here’s the part nobody talks about.

When you start with the engineer’s lens:

• Developer decisions become obvious
• Optimization becomes targeted
• Strategy becomes grounded
• Budget conversations become factual

You stop guessing.

You stop overpromising.

You stop reacting.

You start designing deliberately.


This Is Not About Being the Deepest Expert

This is important.

I am not trying to be:
• The best prompt engineer
• The best ML researcher
• The best Python developer

I’m building something far more valuable:

👉 The ability to switch lenses

Engineer → Developer → Leader
Truth → Optimization → Strategy

But the first lens must be engineering.

Because without truth, everything else is storytelling.


What I’m Building Next (And Why It Matters)

This journey is not theoretical.

It’s grounded in:
• Real benchmarks
• Real saturation curves
• Real trade-offs
• Real constraints

I’m deliberately training myself to:
• Read systems, not dashboards
• Explain performance, not defend it
• Design capacity, not chase peaks

Because AI is not replacing infrastructure thinking.

It is demanding better infrastructure thinking.


A Final Reflection

If you’re working with LLMs today and something feels “off”:

• Latency feels unpredictable
• Scaling feels fragile
• Costs feel unclear
• Decisions feel emotional

You don’t need a better model.

You need a better lens.

Start with the engineer’s lens.

Everything else will follow.


If this resonates, the next posts will dive deeper into:
• Reading latency curves like traffic graphs
• Understanding saturation before users feel it
• Designing LLM platforms that fail gracefully
• Translating tokens/sec into business decisions

Because AI doesn’t need more hype.

It needs engineers who listen to systems.


Mohammad Iqbal

Leave a Reply

Discover more from IT Infrastructure

Subscribe now to keep reading and get access to the full archive.

Continue reading