Imagine looking for a flight on a travel website and waiting for 10 seconds as the results load up. Feels like an eternity, right? Modern travel search platforms must return results almost instantly, even under heavy load. Yet, not long ago, our travel search engine’s API had a p95 latency hovering around 10 seconds. This meant 5% of user searches, often during peak traffic, took 10 seconds or more. Result – frustrated users, heavy bounce rate, and worse – lost sales. Latency reduction, hence, is a non-negotiable in such cases.
This article is a real-world case study of how we evolved our cloud infrastructure to slay the latency dragon. By leveraging Google Cloud Run for scalable compute and Redis for smart caching, we transformed our search API’s p95 latency from ~10 seconds down to ~2 seconds. Here, we will walk through this entire process of latency reduction. This includes the performance bottlenecks, the optimizations, as well as the dramatic improvements they brought on.
The sky-high latency was a serious problem. Delving deep into it, we found multiple culprits dragging down our response times. All of these had a common factor – they made our search API do a lot of heavy lifting on each request. Before we could achieve an overall latency reduction, here are the issues we had to fix:
All these factors formed a perfect storm for slow p95 latency. Under the hood, our architecture simply wasn’t optimized for speed. Queries were doing redundant work, and our infrastructure was not tuned for a latency-sensitive workload. The good news? Each bottleneck was an opportunity for latency reduction.
Also Read: Cloud Run as a Serverless Platform to Deploy Containers
We targeted latency reduction on two major fronts. Caching to avoid repetitive work, and Cloud Run optimizations to minimize cold-start and processing overhead. Here is how the backend evolved:
We deployed a Redis cache to short-circuit expensive operations on hot paths. The idea was pretty straightforward: store the results of frequent or recent queries, and serve those directly for subsequent requests. For example, when a user searched for flights from NYC to LON for certain dates, our API would fetch and compile the results once. It would then cache that “fare response” in Redis for a short period.
If another user (or the same user) made the same search shortly after, the backend could return the cached fare data in milliseconds, avoiding repeated calls to external APIs and database queries. By avoiding expensive upstream calls on cache hits, we dramatically reduced latency for hot queries.
We applied caching to other data as well, like static or slow-changing reference data. e.g. airport codes, city metadata, currency exchange rates, now used cache. Rather than hitting our database for airport info on each request, the service now retrieves it from Redis (populated at startup or on first use). This cut down a lot of minor lookups that were adding milliseconds here and there (which add up under load).
As a rule of thumb, we decided to “cache what’s hot.” Popular routes, recently fetched prices, and static reference data like airport info were all kept readily available in memory. To keep cached data fresh (important where prices change), we set sensible TTL (time-to-live) expirations and invalidation rules. For instance, fare search results were cached for a few minutes at most.
After that, they would expire, so new searches can get up-to-date prices. For highly volatile data, we could even proactively invalidate cache entries when we detected changes. As the Redis docs note, flight prices often update only “every few hours.” So, a short TTL combined with event-based invalidation balances freshness vs. speed.
The outcome? On cache hits, the response time per query dropped from multiple seconds to a few hundred milliseconds or less. This was all thanks to Redis, which can serve data blazingly fast over memory. In fact, industry reports show that using an in-memory “fare cache” can turn a multi-second flight query into a response in just tens of milliseconds. While our results weren’t quite that instant across the board, this caching layer delivered a huge boost. Significant latency reduction was achieved, especially for repeat searches and popular queries.
Caching helped with repeated work, but we also needed to optimise performance for first-time queries and scale-ups. We therefore fine-tuned our Cloud Run service for low latency.
We enabled minimum instances = 1 for the Cloud Run service. This guaranteed that at least one container is up and ready to receive requests even during idle periods. The first user request no longer incurs a cold start penalty. Google’s engineers note that keeping a minimum instance can dramatically improve performance for latency-sensitive apps by eliminating the zero-to-one startup delay.
In our case, setting min instances to 1 (and even 2 or 3 during peak hours) meant users weren’t stuck waiting for containers to spin up. The p95 latency saw a significant drop just from this one optimisation alone.
We revisited our concurrency setting. After ensuring our code could handle parallel requests safely, we raised the Cloud Run concurrency from 1 to a higher number. We experimented with values like 5, 10, and eventually settled on 5 for our workload. This meant each container could handle up to 5 simultaneous searches before a new instance needed to start.
Result – fewer new instances spawned during traffic spikes. This, in turn, meant fewer cold starts and less overhead. Essentially, we let each container do a bit more work in parallel, up to the point where CPU usage was still healthy. We monitored CPU and memory closely – our goal was to use each instance efficiently without overloading it.
This tuning helped smooth out latency during bursts: if 10 requests came in at once, instead of 10 cold starts (with concurrency = 1), we’d handle them with 2 warm instances handling 5 each, keeping things snappy.
We also made some app-level tweaks to start up quicker and run faster on Cloud Run. Also, we enabled Cloud Run’s startup CPU boost feature, which gives a burst of CPU to new instances during startup. We also used a slim base container image and loaded only essential modules at startup.
Certain initialization steps (like loading large config files or warming certain caches) were also moved to the container startup phase instead of at request time. Thanks to min instances, this startup ran infrequently. In practice, by the time a request arrived, the instance was already bootstrapped (database connections open, config loaded, etc.), so it could start processing the query immediately.
We essentially paid the startup cost once and reused it across many requests, rather than paying a bit of that cost on each request.
The results were instantly visible with these optimisations in place. We monitored our API’s performance before vs. after. The p95 latency plummeted from roughly 10 seconds, down to around 2 seconds. This was a mind-blowing 5 times faster loading experience for our users. Even the average latency improved (for cache-hitting queries, it was often <500 ms).
More importantly, the responses became consistent and reliable. Users no longer experienced the painful and much-dreaded 10-second waits. The system could handle traffic spikes gracefully: Cloud Run scaled out to additional instances when needed. With warm containers and higher concurrency, it did so without choking on cold starts.
Meanwhile, Redis caching absorbed repeated queries and reduced load on our downstream APIs and databases. This also indirectly improved latency by preventing those systems from becoming bottlenecked.
The net effect was a snappier, more scalable search API that kept up with our customers’ expectations of quick responses and a smooth experience.
From the entire set of optimizations we undertook for latency reduction, here are the key takeaways and important points for you to consider.
While these latency reduction strategies may seem too much to handle at once, a systematic check through each one is quite smooth in practice. The one unparalleled plus for the entire exercise is that we turned our travel search API from a sluggish experience into one that feels instant. In a world where users expect answers “yesterday,” cutting p95 latency from 10s to 2s made all the difference in delivering a smooth travel search experience.