The Architecture Behind a 6,000% Throughput Improvement at Hertz

Hertz was a $9.8 billion company running on a 30-year-old technology platform. The CEO said that publicly. Underneath it: 1,800 IT systems, six database vendors, 30 rental processing systems, and a core built on IBM AS/400 mainframes running COBOL. Adding a single new product required 18 separate system changes. Meanwhile, Uber and Lyft had captured 72% of corporate ground transportation spending — up from near zero in 2014. The legacy platform wasn't just slow. It was an existential liability.

Hertz had already spent $32 million with Accenture on the digital transformation. The result was a website that never went live and code so riddled with defects that every line of frontend work had to be scrapped. Accenture's code couldn't even extend to the other brands — it was built specifically for Hertz when the whole point was a unified platform across Hertz, Dollar, Thrifty, and Firefly. When Accenture was fired, IBM came in through the Cloud Garage with business partners to pick up the pieces. I was there from day zero as a developer on the rate engine. When the system needed to go further — when it needed to actually scale — they realized I was the one to take it over. The rate engine became the piece I owned: one system serving all four brands, handling every pricing query across 10,000+ locations worldwide.

The Problem: Death by a Thousand Queries

We released the first version of the new rate engine and were told it was doing about 60 requests per second — roughly the same as the legacy system we were replacing. That was the moment. Same throughput, cleaner code, but no actual improvement in capacity. So we went after it. The actual scale tells you why we had to — a global fleet of nearly 700,000 vehicles across 10,000+ locations, four brands, millions of pricing queries per day, with localized rates that change constantly based on market conditions, inventory, promotional windows, and regional demand curves. The company was spending approximately $400 million a year on IT just to keep the lights on.

At ~60 RPS, the math doesn't work. It never worked. The system had been held together by caching patches and operational workarounds long enough that the seams were showing everywhere.

The architecture was synchronous throughout. Every rate query hit the database directly. There was no meaningful tiered caching strategy — requests that came in a millisecond apart would both go all the way to storage rather than the first populating a cache and the second hitting it. During normal load, this was survivable. During holiday weekends or promotional events, it was catastrophic. The database would saturate, latency would spike, and the whole thing would cascade — a queue of requests backing up behind a storage layer that couldn't drain fast enough.

The business impact was direct. Abandoned bookings during peak periods aren't recoverable revenue — a customer who can't complete a reservation on a Friday afternoon before a long weekend books with a competitor. There's no email drip campaign that fixes that. The window closes.

That's what we were solving for. Not a bug fix. Not a performance tuning pass. A ground-up rearchitecture of how rates were stored, served, and kept current — replacing COBOL-era assumptions with patterns that could handle the actual demand.

The Architecture We Built

The core insight that unlocked everything — the realization that changed the entire architecture — was this: you don't need strong consistency for rate shopping.

When a customer is comparing rental prices, they're not executing a financial transaction. They're doing reconnaissance. The rate they see on the results page doesn't need to be the exact rate stored in the primary database at that precise millisecond. It needs to be accurate within a few seconds, reflect the correct pricing tier, and load fast enough that they don't leave. Eventual consistency with sub-second propagation across regions is indistinguishable from strong consistency to a human being browsing rental options.

Once we accepted that — once we stopped treating the read path like it needed ACID guarantees — the constraints changed entirely. We could cache aggressively. We could separate the read path from the write path. We could build for the actual SLA the use case required rather than the theoretical SLA we'd been designing to by default.

The architecture we landed on had three main components working together:

A multi-tier caching layer with Redis for hot paths and Cloudant (IBM's distributed CouchDB) for persistent storage, connected by a CDC stream that kept Redis warm without manual intervention
An event streaming backbone via Kinesis for pricing propagation from the Rate Management System (RMS) into the write path
A clean separation between the read path and the write path — two distinct microservices (HRE and HRE-Update) that could be scaled, deployed, and optimized independently

Read path sustained >3,000 reads/sec at p95 around 30ms. Write path handled >2,500 pricing writes/sec with sub-second cross-region propagation. The 6,000% throughput improvement wasn't from one optimization — it was from attacking the architectural constraints that had artificially capped everything.

Read Path: 3,000+ Operations Per Second

Redis sat in front of everything on the read path, but it wasn't a passive cache waiting to be populated by requests. We used Cloudant's Change Data Capture (CDC) stream — a built-in feature of CouchDB-based databases — to push updates into Redis proactively. When a rate document changed in Cloudant, the CDC stream fired, and a cache manager process picked up the change and updated the corresponding Redis keys. The read path never had to wait for a cache miss to discover new data.

The key design that made this work was how we structured the cache keys. Instead of caching individual rate lookups one at a time, we grouped related data by location and corporate discount code. Everything for LAX went into a hashed key structure together — all applicable rules, benefits, promotions, rate codes. Same for a corporate discount code like GMC or SLAYER. When a rate shop request came in for a specific location and account, we could pull everything we needed in one fetch instead of making dozens of individual lookups.

This grouping strategy was rooted in a frequency asymmetry that most caching designs miss. Corporate discount codes change maybe once a year, when the contract comes up for renewal. Rate codes change thousands of times per second across the fleet. By caching the slow-moving data (account rules, location config, benefits) aggressively with longer TTLs, and treating the fast-moving data (individual rate prices) as the only thing that needed frequent invalidation, we dramatically reduced the volume of cache operations without sacrificing freshness where it mattered.

The decision about what to cache and what not to cache matters as much as the caching infrastructure itself. We cached hot rate lookups — the pricing data that customers actually browse when comparing options across locations and date ranges. We didn't cache inventory availability in the same tier, because inventory state is more volatile and the staleness cost is asymmetric. A stale price displayed for two seconds is fine. An overbooked vehicle is not.

At 10,000+ locations with localized pricing variations, the cache hit rate during steady-state operation was the key metric. When the cache is absorbing the load, the underlying storage layer stays healthy. When the cache miss rate climbs, you're back to the old problem. We monitored that ratio carefully.

Write Path: 2,500+ Operations Per Second

The write path had its own evolution story, and understanding where it started makes the final architecture more meaningful.

In the initial version, the Rate Management System (RMS) — the internal tool where Hertz's revenue team configured rates, bundled discounts, and promotional pricing — pushed updates to our write service (HRE-Update) via direct REST calls. Thousands of them. RMS would compute a new rate structure and fire off HTTP requests to HRE-Update, which had to accept them, validate them, and persist them to Cloudant. Under normal load this was manageable. During a rate restructuring event — say, adjusting pricing across all West Coast locations for a holiday weekend — the volume would spike to the point where HRE-Update couldn't keep up. We built a custom queue in Cloudant to buffer the backlog, with a "parked docs" mechanism for failed inserts that could be retried later. It worked, but it was ugly. The queue itself became a bottleneck, and debugging failed updates meant spelunking through queue databases.

The breakthrough was moving to event streaming. We'd originally designed around Kafka, but AWS cut a deal that made Kinesis the practical choice. The architecture shifted: RMS published rate changes to Kinesis streams, and HRE-Update consumers pulled from those streams at their own pace. This decoupled the write path from the source system entirely — RMS didn't need to care whether HRE-Update was keeping up, and HRE-Update didn't need to handle burst REST traffic anymore.

With Kinesis in place, we could do something the REST-based approach never allowed: geo-routed updates. Rate changes were tagged with their target region and routed to the nearest cluster. The US was split into East, Central, and West zones. Then there was Europe, Asia, and other regional clusters. A rate update for LAX hit the West Coast cluster first and propagated outward. A rate update for London hit Europe first. This meant the region most likely to serve that rate got the update fastest, and other regions caught up within sub-second latency.

The eventual consistency model meant that a pricing update written to the stream would appear in the read-path cache within sub-second latency under normal conditions. Not immediately — but close enough that the gap was invisible to customers and acceptable to the business. When we framed it that way, the objection to eventual consistency disappeared. The alternative — synchronous writes propagating to every cache layer before acknowledging the update — would have strangled the write path at the throughput we needed.

The consistency vs availability trade-off was explicit and documented. We chose availability for the read path and eventual consistency for writes. The system would serve slightly stale pricing data for a brief window after a price change rather than block reads while writes propagated. For rate shopping, that's the right call. For a final booking transaction, you validate against fresh pricing data before completing the reservation — different code path, different consistency requirements.

p95 Under 30ms: The Latency Story

Averages lie. This is not new information, but it bears repeating because teams that optimize to average latency and ignore the tail will discover their mistake during peak traffic.

At 3,000+ reads per second, a p95 of 30ms means 95% of requests complete in under 30ms. The remaining 5% — 150 requests per second at steady state — are the ones you need to understand. What causes them? Where's the time going? What does that tail distribution look like under load?

For us, the tail latency was dominated by two things: cache misses that fell through to Cloudant, and connection establishment overhead during burst traffic. Both were solvable.

Connection pooling eliminated the burst overhead. Instead of establishing new connections to Redis and Cloudant under load, we maintained warm connection pools sized for peak concurrency. The connection establishment latency — which is small in isolation but adds up when you're handling thousands of requests per second — stopped contributing to the tail.

Strategic denormalization did the most work. By storing precomputed rate summaries at the cache layer — grouped by that location/account key structure I described earlier — we eliminated the assembly cost at query time. A request for LAX pricing data retrieved a single pre-built document containing all applicable rules, rates, and benefits rather than joining dozens of individual lookups under load. The p95 improvement from this alone was significant.

Worst-case latency stayed well within 500ms even under extreme load — graceful degradation rather than cascading failure. The system had explicit shed-load behavior: under sustained overload, it would deprioritize less-time-sensitive work rather than blocking the entire request queue. That's the difference between a system that bends and one that breaks.

Lessons for Your Next Performance Overhaul

There are a few generalizable things from this work that I've found useful across every system I've touched since.

Measure before you optimize. The instinct when a system is slow is to start tuning the code. The actual first step is instrumenting the request path well enough to know where the time is going. At Hertz, the problem wasn't slow code — it was a synchronous architecture making too many round trips to storage. No amount of code optimization would have moved that number by 6,000%.

Caching solves most read problems. Async solves most write problems. This is the 80/20 of performance work. Before you reach for sharding, horizontal scaling, or re-platforming, understand whether your read path has a caching strategy and whether your write path is blocking on things it doesn't need to block on. Most systems I've seen haven't exhausted either of those levers when they start talking about infrastructure investment.

Look for frequency asymmetry. Not all data changes at the same rate. Corporate discount rules changed annually. Individual rates changed thousands of times per second. Caching everything with the same TTL wastes either freshness or compute. Match your invalidation strategy to how often the data actually moves.

Know what consistency level your use case actually requires. Strong consistency is expensive. It's the right choice for financial transactions, inventory commits, and anything where two systems acting on stale data produces a real-world problem. It's not the right choice for read-heavy use cases where the cost of eventual consistency is a user seeing data that's two seconds old. Be explicit about which category your system falls into. Default assumptions here are what cost the Hertz legacy system its throughput ceiling.

Graceful degradation is an architecture decision, not a fallback. Systems that fail catastrophically under load are systems where nobody made explicit decisions about what happens when limits are hit. The decision to shed load rather than cascade failures was made in the design phase, not after an incident.

Let the architecture evolve. We didn't start with Kinesis and geo-routed updates. We started with REST calls and a custom queue. Each iteration solved the most pressing bottleneck and revealed the next one. The final architecture was the product of multiple phases — stabilizing the legacy system, building a parallel read path, migrating traffic gradually, and decommissioning the synchronous path once the new one had earned trust under production load. That sequence matters. You don't pull the old system before the new one has proven itself.

If you're sitting on a system that's hitting its throughput ceiling — legacy rate engines, pricing systems, high-read APIs with inadequate caching — or if you're making an architectural bet right now that could use a second opinion, I do 30-minute architecture calls at cal.com/mdostal/meet. No pitch. Just a real conversation about the problem.