The Architecture Behind a 6,000% Latency Improvement at Hertz

Hertz was a nearly $10 billion company running on technology that its own CEO would publicly call "30 to 40 years old." Underneath it: 1,800 IT systems, six database vendors, 30 rental processing systems, and a core built on IBM AS/400 mainframes running COBOL. Adding a single new product required 18 separate system changes. Meanwhile, Uber and Lyft had captured over 70% of corporate ground transportation spending on expense reports — up from near zero just a few years earlier. The legacy platform wasn't just slow. It was an existential liability.

Hertz had already spent $32 million with Accenture on the digital transformation. The result was a website that never went live and code so riddled with defects that every line of frontend work had to be scrapped. Accenture's code couldn't even extend to the other brands — it was built specifically for Hertz when the whole point was a unified platform across Hertz, Dollar, Thrifty, and Firefly. When Accenture was fired, IBM came in through the Cloud Garage with business partners to pick up the pieces. I was there from day zero as a developer on the rate engine. When the system needed to go further — when it needed to actually scale — they realized I was the one to take it over. The rate engine became the piece I owned: one system serving all four brands, handling every pricing query across 10,000+ locations worldwide.

The Problem: Death by a Thousand Queries

We released the first version of the new rate engine and were told it was doing about 300 requests per second with a p90 of over a minute and a worst case often around 3 minutes — roughly the same as the legacy system we were replacing. That was the moment. Same throughput, cleaner code, but no actual improvement in capacity. So we went after it. The actual scale tells you why we had to — a global fleet of nearly 700,000 vehicles across 10,000+ locations, four brands, millions of pricing queries per day, with localized rates that change constantly based on market conditions, inventory, promotional windows, and regional demand curves. Hertz would ultimately commit more than $400 million to a multi-year technology transformation — and this was after the failed Accenture engagement.

At that RPS, the math doesn't work. It never worked. The system had been held together by caching patches and operational workarounds long enough that the seams were showing everywhere. The worst case could bog down the entire system if the multiple instances hit it repeatedly or it would backlog fetches.

The architecture was synchronous throughout. Every rate query hit the database directly. There was no meaningful tiered caching strategy — requests that came in a millisecond apart would both go all the way to storage rather than the first populating a cache and the second hitting it. During normal load, this was survivable. During holiday weekends or promotional events, it was catastrophic. The database would saturate, latency would spike, and the whole thing would cascade — a queue of requests backing up behind a storage layer that couldn't drain fast enough.

The business impact was direct. Abandoned bookings during peak periods aren't recoverable revenue — a customer who can't complete a reservation on a Friday afternoon before a long weekend books with a competitor. There's no email drip campaign that fixes that. The window closes.

That's what we were solving for. Not a bug fix. Not a performance tuning pass. A ground-up rearchitecture of how rates were stored, served, and kept current — replacing COBOL-era assumptions with patterns that could handle the actual demand.

The Architecture We Built

The core insight that unlocked everything — the realization that changed the entire architecture — was this: you don't need strong consistency for rate shopping.

When a customer is comparing rental prices, they're not executing a financial transaction. They're doing reconnaissance. The rate they see on the results page doesn't need to be the exact rate stored in the primary database at that precise millisecond. It needs to be accurate within a few seconds, reflect the correct pricing tier, and load fast enough that they don't leave. Eventual consistency with sub-second propagation across regions is indistinguishable from strong consistency to a human being browsing rental options.

Once we accepted that — once we stopped treating the read path like it needed ACID guarantees — the constraints changed entirely. We could cache aggressively. We could separate the read path from the write path. We could build for the actual SLA the use case required rather than the theoretical SLA we'd been designing to by default.

The architecture we landed on had three main components working together:

Cloudant (IBM's distributed CouchDB) as the document store, with rate-related data sharded into documents by location, date, and discount code. Redis sat in front for the read path, with a CDC stream from Cloudant that pushed changes to Redis as they happened — no cache misses on rule data, no manual invalidation.

An event streaming backbone via Kinesis for pricing propagation from the Rate Management System (RMS) into the write path.

A clean separation between the read path and the write path — two distinct microservices (HRE and HRE-Update) that could be scaled, deployed, and optimized independently.

Read path sustained >3,000 reads/sec at p95 around 30ms. Write path handled >2,500 pricing writes/sec with sub-second cross-region propagation. The 6,000% throughput improvement wasn't from one optimization — it was from attacking the architectural constraints that had artificially capped everything.

Read Path: 3,000+ Operations Per Second

Redis sat in front of everything on the read path, but it wasn't a passive cache waiting to be populated by requests. We used Cloudant's Change Data Capture (CDC) stream — a built-in feature of CouchDB-based databases — to push updates into Redis proactively. When the data around a rate changed in Cloudant — a discount code, a corporate agreement, a promotion timeframe, a sell rule — the CDC stream fired, and a cache manager process picked up the change and updated the corresponding Redis keys. The read path never had to wait for a cache miss to discover new data.

The key design that made this work was how we structured the cache keys — and what we chose to cache versus what we didn't. Rates changed too frequently to cache. But rates were useless without knowing which of the tens of thousands of rules applied to a given request. That filtering — figuring out which rules, discounts, promotions, and eligibility criteria applied to a specific location, account, date, and car type — was the expensive part. Not the rate lookup itself.

So we did that filtering work upfront. We grouped all applicable rules, benefits, promotions, and eligibility criteria by location and corporate discount code into hashed key structures in Redis. Everything for LAX went together. Everything for a corporate code like GMC or SLAYER went together. When a rate shop request came in, we pulled the pre-filtered rule set from Redis in one fetch, then made a single targeted database call for the actual rate. Instead of making hundreds of individual rule lookups per request, we'd already done that work when the underlying data changed — not when the customer was waiting.

Actual Redis key format: LOC~RLOC~Date:RC~CarType~DiscCode (e.g., LAX~LAT~2020-01:RC001~CCAR~D). Diagram shows simplified grouping — keys are compound strings encoding 6 dimensions.

This grouping strategy was rooted in a frequency asymmetry that most caching designs miss. Corporate discount codes change maybe once a year, when the contract comes up for renewal. Rate codes change thousands of times per second across the fleet. By caching the slow-moving data (account rules, location config, benefits, sell rules) aggressively with longer TTLs, and treating the fast-moving data (individual rate prices) as the thing that came from the database at request time, we dramatically reduced the volume of work per request without sacrificing freshness where it mattered.

The decision about what to cache and what not to cache matters as much as the caching infrastructure itself. We cached everything except the rates — the rules, discounts, promotions, corporate eligibility, and location-specific criteria that determined which rate applied. The rates themselves changed too frequently and needed to come from the source of truth. But by pre-computing and caching everything that surrounded the rate, we turned what had been hundreds of database calls per request into one Redis fetch plus one targeted DB lookup. That's where the 6,000% came from.

At 10,000+ locations with localized pricing variations, the cache hit rate on rule data during steady-state operation was the key metric. When the cache is absorbing the filtering load, the underlying storage layer only handles the targeted rate lookups it was designed for. When the cache miss rate climbs, you're back to the old problem. We monitored that ratio carefully.

Write Path: 2,500+ Operations Per Second

The write path had its own evolution story, and understanding where it started makes the final architecture more meaningful.

In the initial version, the Rate Management System (RMS) — the internal tool where Hertz's revenue team configured rates, bundled discounts, and promotional pricing — pushed updates to our write service (HRE-Update) via direct REST calls. Thousands of them. RMS would compute a new rate structure and fire off HTTP requests to HRE-Update, which had to accept them, validate them, and persist them to Cloudant. Under normal load this was manageable. During a rate restructuring event — say, adjusting pricing across all West Coast locations for a holiday weekend — the volume would spike to the point where HRE-Update couldn't keep up. I built a custom queuing system in Cloudant to buffer the backlog — two queue types (account-promo and generic), with the ability to spin up separate queue workers per document prefix. We deployed over 20 of them, partitioned by letter, location, or doc type, so they could process in parallel without parent/child conflicts. Under steady-state load, they kept up. The problem was burst scenarios — when a major corporate discount code changed, or during initial spin-up, locations like LAX and NYC that represent a disproportionate share of rate rules and rental volume would backlog badly. A "parked docs" mechanism handled failed inserts for retry, but the fundamental issue was that the queuing system was only as fast as the REST pipeline feeding it. Every update was a full HTTP request-response cycle — connection establishment, headers, serialization, acknowledgment, teardown. At thousands of updates per second, that overhead alone was eating a significant chunk of throughput.

The breakthrough was moving to event streaming — and honestly, just dropping the full HTTP stack was a massive speed improvement on its own. No more connection establishment, header negotiation, serialization overhead, and acknowledgment round-trips on every single update. We'd originally designed around Kafka, but AWS cut a deal that made Kinesis the practical choice. The architecture shifted: RMS published rate changes to Kinesis streams, and HRE-Update consumers pulled from those streams at their own pace. This decoupled the write path from the source system entirely — RMS didn't need to care whether HRE-Update was keeping up, and HRE-Update didn't need to handle burst REST traffic anymore. The burst scenarios that had backlogged the custom queues — corporate discount code changes hitting LAX and NYC simultaneously — were now absorbed by the stream buffer instead of hammering application-level queues.

With Kinesis in place, we could do something the REST-based approach never allowed: geo-routed updates. Kafka would have given us partition-by-key routing and consumer groups natively — with Kinesis we had to build that ourselves, encoding the routing metadata in the messages and writing custom consumer logic to select regional streams. More work, but the economics made it the right call.

The geo-replication was actually a two-layer strategy. Kinesis streams wrote to each regional cluster, targeting the nearest zone to the rental locations. But Cloudant also had its own built-in replication — CouchDB's multi-master replication protocol, which we used to sync data across regions as a second propagation path. We could control the replication direction and shard it, so EU and Asia data replicated independently from US data. The US was split into East, Central, and West zones. Ireland was one of the first international rollouts — we didn't do a full global deployment, focusing on US and EU.

A rate update for LAX hit the West Coast cluster first via Kinesis, while Cloudant replication propagated it outward to other regions. A rate update for a Dublin location hit the Ireland cluster first. This meant the region most likely to serve that rate got the update fastest through two independent channels — stream routing for speed, database replication for durability.

The eventual consistency model meant that a pricing update written to the stream would appear in the read-path cache within sub-second latency under normal conditions. Not immediately — but close enough that the gap was invisible to customers and acceptable to the business. When we framed it that way, the objection to eventual consistency disappeared. The alternative — synchronous writes propagating to every cache layer before acknowledging the update — would have strangled the write path at the throughput we needed.

The consistency vs availability trade-off was explicit and documented. We chose availability for the read path and eventual consistency for writes. The system would serve slightly stale pricing data for a brief window after a price change rather than block reads while writes propagated. For rate shopping, that's the right call. For a final booking transaction, you validate against fresh pricing data before completing the reservation — different code path, different consistency requirements.

p95 Under 30ms: The Latency Story

Averages lie. This is not new information, but it bears repeating because teams that optimize to average latency and ignore the tail will discover their mistake during peak traffic.

At 3,000+ reads per second, a p95 of 30ms means 95% of requests complete in under 30ms. The remaining 5% — 150 requests per second at steady state — are the ones you need to understand. What causes them? Where's the time going? What does that tail distribution look like under load?

For us, the tail latency was dominated by two things: cache misses that fell through to Cloudant, and connection establishment overhead during burst traffic. Both were solvable.

Connection pooling eliminated the burst overhead. Instead of establishing new connections to Redis and Cloudant under load, we maintained warm connection pools sized for peak concurrency. The connection establishment latency — which is small in isolation but adds up when you're handling thousands of requests per second — stopped contributing to the tail.

Strategic denormalization did the most work. By storing precomputed rule and eligibility summaries at the cache layer — grouped by that location/account key structure I described earlier — we eliminated the assembly cost at query time. A request for LAX pricing data retrieved a single pre-built document containing all applicable rules and eligibility criteria rather than joining dozens of individual lookups under load. The p95 improvement from this alone was significant.

Worst-case latency stayed well within 500ms even under extreme load — graceful degradation rather than cascading failure. The system had explicit shed-load behavior: under sustained overload, it would deprioritize less-time-sensitive work rather than blocking the entire request queue. That's the difference between a system that bends and one that breaks.

The Team Behind It

I need to say this clearly: I didn't do this alone. Not even close.

The hashing strategy that made the entire cache key grouping work — the shared hash values that let CDC correctly update, delete, and maintain the grouped object lists underneath — that was Jerry (Gerardo Leon). He figured out how to structure the hashes so that related items could be fetched together and updated together consistently. That idea is the foundation the whole read path is built on. IBM had their performance infrastructure guys assisting as we built out the k8s and had to scale the tests to test the scaling of the infra.

We had two Aarons on the team (yes, two). Taffy was a beast of a software engineer and the metalhead who added SLAYER as a test discount code — it stuck. Aiden set up some of the toughest testing infrastructure I've worked with, building end-to-end performance verification on our event streams that let us prove the system was solid, not just fast. Don Matthews helped clean up the Kinesis streaming layer. Andrew, Dominika, and Layne were in the trenches through the hardest phases. The original build had another Matt, Ravi, Steve, and others who laid the groundwork before we took it to the next level. There are many I'm forgetting who deserve a callout. Feel free to add a comment and add more!

The testing deserves its own callout. We didn't just have unit tests — we had integration tests, end-to-end tests, and performance tests with full reporting at every step. I rewrote our entire e2e suite from JMeter — which was painfully slow, spinning up and tearing down JVMs for every run — into Taurus with BlazeMeter. Optimized, parallelized, maintained in YAML files instead of brittle XML. I did that rewrite on a flight to France because it was bothering me that much and wasting time on tests that take too long to run means that you never maintain them.

We had complexity checkers throughout the codebase and refactored often. We built a way to print full decision reports when someone shopped a rate — every rate that went into the calculation, what came out, and why. You don't take a system this complex live across four brands without being able to verify every decision it makes. The old system followed its own logic built over decades, and ours didn't replicate it exactly — it was a new architecture with new patterns. The only way to prove it was correct was to test it at every layer and make the reasoning visible.

This was one of the best teams I've ever worked with. If I've gotten any of the details wrong here — it's been a while — feel free to call me out in the comments.

Lessons for Your Next Performance Overhaul

There are a few generalizable things from this work that I've found useful across every system I've touched since.

Measure before you optimize. The instinct when a system is slow is to start tuning the code. The actual first step is instrumenting the request path well enough to know where the time is going. At Hertz, the problem wasn't slow code — it was a synchronous architecture making too many round trips to storage. No amount of code optimization would have moved that number by 6,000%.

Caching solves most read problems. Async solves most write problems. This is the 80/20 of performance work. Before you reach for sharding, horizontal scaling, or re-platforming, understand whether your read path has a caching strategy and whether your write path is blocking on things it doesn't need to block on. Most systems I've seen haven't exhausted either of those levers when they start talking about infrastructure investment.

Look for frequency asymmetry. Not all data changes at the same rate. Corporate discount rules changed annually. Individual rates changed thousands of times per second. Caching everything with the same TTL wastes either freshness or compute. Match your invalidation strategy to how often the data actually moves.

Know what consistency level your use case actually requires. Strong consistency is expensive. It's the right choice for financial transactions, inventory commits, and anything where two systems acting on stale data produces a real-world problem. It's not the right choice for read-heavy use cases where the cost of eventual consistency is a user seeing data that's two seconds old. Be explicit about which category your system falls into. Default assumptions here are what cost the Hertz legacy system its throughput ceiling.

Graceful degradation is an architecture decision, not a fallback. Systems that fail catastrophically under load are systems where nobody made explicit decisions about what happens when limits are hit. The decision to shed load rather than cascade failures was made in the design phase, not after an incident.

Let the architecture evolve. We didn't start with Kinesis and geo-routed updates. We started with REST calls and a custom queue. Each iteration solved the most pressing bottleneck and revealed the next one. The final architecture was the product of multiple phases — stabilizing the legacy system, building a parallel read path, migrating traffic gradually, and decommissioning the synchronous path once the new one had earned trust under production load. That sequence matters. You don't pull the old system before the new one has proven itself.

If you're sitting on a system that's hitting its throughput ceiling — legacy rate engines, pricing systems, high-read APIs with inadequate caching — or if you're making an architectural bet right now that could use a second opinion, I do 30-minute architecture calls at cal.com/mdostal/meet. No pitch. Just a real conversation about the problem.

The Architecture Behind a 6,000% Latency Improvement at Hertz

The Problem: Death by a Thousand Queries

The Architecture We Built

Read Path: 3,000+ Operations Per Second

Write Path: 2,500+ Operations Per Second

p95 Under 30ms: The Latency Story

The Team Behind It

Lessons for Your Next Performance Overhaul

Get one CTO-level insight per week

Summary Feed

Full Articles

Subscribe via email

Mathew Dostal

Need strategic guidance on your architecture decisions?

Discussion

Related Posts

I Built a 35-Agent AI Coding Swarm That Runs Overnight

Right-Sizing Your DevOps Stack

The Week I Stopped Coding: Orchestrating an Army of AI Agents