Scaling Time Series Data Processing: Discovering TSDB Limits in Practice (Part 3)
Prologue: Testing TSDB Limits in the Real World
In Part 1, we introduced the M×N scaling strategy to test the performance limits of our Time Series Database (TSDB). In Part 2, we shared how our team’s feedback led us to drop a complex stream processing design in favor of a simpler, more practical approach using our existing data writer service.
Now comes the moment of truth: actually testing the TSDB’s limits and finding the real bottlenecks.
What we found changed how we look at system performance. The bottlenecks were not where we expected them to be. Fixing them required us to look far beyond our application code. This is the story of how we uncovered our database’s true limits and learned to scale beyond them.
Chapter 1: Preparing for the Test
Our Strategy
Our goal was simple: push the TSDB to its breaking point. Using our M×N strategy, we planned to turn up the data volume step by step until the system buckled.
Here is what we expected to see in theory:
- Base throughput: ~22M metrics/minute
- M=6 (10-second intervals): 22M × 6 = 132M expected
- M=12 (5-second intervals): 22M × 12 = 264M expected
The First Obstacle: Application Bottlenecks
When we ran the tests, the results were sobering:
| Test | Parallelism | M | Expected/min | Actual/min | Efficiency |
|---|---|---|---|---|---|
| 1 | 4 | 6 | 130.2M | ~73M | 56% |
| 2 | 6 | 6 | 130.2M | ~104M | 80% |
| 3 | 6 | 12 | 260.4M | ~130M | 50% |
| 4 | 6 | 20 | 434M | ~130M | 30% |
The hard truth: When we pushed more data (increased M), our efficiency dropped fast. We hit a brick wall at about 140-145 million metrics per minute, no matter how we tweaked the application settings.
What Was Going Wrong?
We checked the TSDB monitoring dashboard, and to our surprise, the database was barely breaking a sweat:
- CPU Utilization: 20-40%
- Peak Ingestion Rate: 18M datapoints/min
- Memory/Disk: All perfectly normal
The realization: The database wasn’t the problem. Our own application was too slow to generate enough load to stress the database!
Before we could test the TSDB’s limits, we had to fix our own code.
Chapter 2: Eliminating Application Bottlenecks
Problem #1: Blocking I/O
The Issue: When our application sent data to the database, it stopped and waited for a response. During this waiting time, the thread was blocked and couldn’t process any new messages from Kafka.
Before: Synchronous Architecture
<div class="mermaid">
graph TD
subgraph "Consumer Thread"
A[Kafka Poll] --> B[Data Amplification]
B --> C[timeSeriesService.write()]
C -- "I/O Wait (seconds)" --> D[Write Complete]
D --> A
end
</div>
After: Asynchronous Architecture with Virtual Threads
<div class="mermaid">
graph TD
subgraph "Consumer Thread"
A[Kafka Poll] --> B[Data Amplification]
B --> Q[Submit Write Task]
Q --> A
end
subgraph "Virtual Thread Pool"
Q --> VT1[Virtual Thread 1: Write Batch 1]
Q --> VT2[Virtual Thread 2: Write Batch 2]
Q --> VTx[Virtual Thread N...]
end
</div>
The Fix: We used Java 21’s Virtual Threads to make the database writes asynchronous. Now, the main thread simply hands the write task to a virtual thread and immediately goes back to fetching more data.
Code Example: Asynchronous Write Implementation
```java // Before: Synchronous blocking write try { if(!timeSeries.isEmpty()) { timeSeriesService.write(storageTier, timeSeries); // Blocks here } } catch (Exception e) { log.error("Write failed", e); // Retry logic... } // After: Asynchronous non-blocking write if (!timeSeries.isEmpty()) { writerExecutor.submit(() -> { int retries = writeRetryCount; while (retries >= 0) { try { timeSeriesService.write(storageTier, timeSeries); writtenCounter.addAndGet(timeSeries.size()); break; // Success } catch (Exception e) { retries--; log.error("Write failed. Retries left: {}", retries, e); if (retries >= 0) { Thread.sleep(1000); // Retry delay } } } }); } ```Problem #2: Memory Overload from Object Creation
The Issue: Even after fixing the I/O problem, our CPU usage was still too high. Why? Because we were creating millions of temporary HashMap objects every minute during the M×N data multiplication. This caused the Garbage Collector (GC) to work overtime, slowing down the whole system.
The Fix: Instead of creating a new HashMap in every loop, we pre-created the required maps once and reused them. This cut our object creation down massively.
Code Example: Memory Optimization
```java // Before: Creating new HashMap in every loop iteration for (int i = 0; i < timeMultiplier; i++) { for (int instanceOffset = 0; instanceOffset < instanceMultiplier; instanceOffset++) { // New HashMap created M×N times String newInstanceNo = generateNewInstanceNo(originalInstanceNo, instanceOffset); TimeSeries expanded = original.copyWithTimestampAndInstanceNo(newTimestamp, newInstanceNo); result.add(expanded); } } // After: Pre-create dimension variants (N times) and reuse List<Map<String, String>> dimensionVariants = new ArrayList<>(instanceMultiplier); for (int instanceOffset = 0; instanceOffset < instanceMultiplier; instanceOffset++) { Map<String, String> variant = new HashMap<>(baseDimensions); variant.put("instanceNo", generateNewInstanceNo(originalInstanceNo, instanceOffset)); dimensionVariants.add(variant); } // Main loop reuses pre-created Maps for (int i = 0; i < timeMultiplier; i++) { for (Map<String, String> dims : dimensionVariants) { TimeSeries expanded = TimeSeries.builder() .timestamp(newTimestamp) .dimensions(dims) // Reuse, don't recreate .value(original.getValue()) .build(); result.add(expanded); } } ```The Result of Our Optimizations
- Efficiency shot up: M=6 tests went from 56% to ~90% efficiency.
- Stable processing: We comfortably hit ~120M metrics/minute.
- Application bottlenecks gone: Our generator was finally ready.
Mission Accomplished: With our application running smoothly, it was time to find the database’s real limits.
Chapter 3: Discovering the Real TSDB Bottleneck
The True Test
Now we could push harder. We cranked M up to 12, aiming for over 260M metrics per minute.
The Surprise: Performance didn’t go up. It actually dropped to around 100M/minute. We also started seeing Slow write request warnings that lasted 7 to 9 seconds!
We had finally found the TSDB’s breaking point. But why was it breaking?
The Smoking Gun: Database Concurrency Limits
We dug back into the TSDB dashboards:
- CPU Usage: Still only ~30%. The database wasn’t working too hard.
- Concurrent Inserts: Here was the clue. Even though our application opened ~200 connections, the database was only processing 16 write operations at a time.
The Root Cause
This wasn’t a bug; it was a built-in safety feature. Our TSDB (VictoriaMetrics) uses a component called vminsert which protects itself from overload by limiting concurrent processing to CPU cores × 2.
The Math:
- Our Database Servers: 8 CPU cores
- Hard Limit: 8 × 2 = 16 concurrent operations
The Reality: Our hundreds of parallel write requests were piling up in the database’s waiting line. They had to wait for one of those 16 slots to open up. That waiting time is exactly what caused the 7-9 second delays.
A Successful Discovery
This is exactly what we wanted to find. We discovered a hard, physical limit that would affect our production systems.
What we learned:
- The Limit: 16 simultaneous write operations.
- The Cause: Database self-protection, not bad application code.
- The Solution: Adding more application servers wouldn’t help. We had to upgrade the database hardware.
Chapter 4: Scaling Beyond the Limits
Breaking the Ceiling
Now we knew the problem: the 16-slot concurrency limit. Since this limit is tied to CPU cores, the only way to get more slots was to get more cores.
The Pragmatic Move: A Massive Hardware Upgrade
Instead of trying to hack the database settings, we chose a straightforward infrastructure upgrade:
The Solution: High-Performance Physical Servers
- What we found: 8 idle Physical Machines (PMs) were available in our datacenter.
- The Specs: 40 cores and 256GB RAM each! (A huge jump from our 8-core VMs).
- The Win: This solved both our compute needs and our concurrency limits at the same time.
The Migration Impact:
# Before: Virtual Machines (VMs)
Servers: 8 × VMs (8 cores, 32GB)
Concurrency Limit: 8 × 2 = 16 slots per server
Capacity: ~120M metrics/min
# After: Physical Machines (PMs)
Servers: 8 × PMs (40 cores, 256GB)
Concurrency Limit: 40 × 2 = 80 slots per server
Expected Jump: A massive 5x increase in capacity!
Why This Was the Right Call
This decision highlights some great lessons about real-world engineering:
- Hardware over Hacking: Sometimes, using powerful, idle hardware is much smarter than spending weeks writing complex software workarounds.
- Whole-System Thinking: We didn’t just scale our app; we scaled the underlying compute and database concurrency limits together.
- Speed to Value: Reusing idle servers gave us an immediate performance boost without waiting for new budgets or orders.
This upgrade instantly gave us 5x more capacity and set us up perfectly for our final tests.
Chapter 5: Lessons Learned
Technical Takeaways
- Async I/O is Magic: Virtual Threads are an incredible tool for fixing I/O-heavy bottlenecks.
- Watch Your Objects: Reusing objects (like our HashMaps) can save your Garbage Collector and speed up your app.
- Know Your Limits: Systems have hard limits built in for protection. You need to find them.
- Dashboards Don’t Lie: We never would have found the 16-slot limit without good monitoring.
Architectural Rules of Thumb
- Fix Your Own House First: Optimize your application before blaming the database.
- Test With Real Data: Guesses are dangerous. Measure everything.
- Respect Boundaries: Understand how the systems you connect to actually work.
- Scale Boldly: When you hit a hard hardware limit, don’t be afraid to upgrade the hardware.
Why Simple is Better
Choosing the “Direct Writer” approach back in Part 2 proved to be a brilliant move:
- Speed: We spent our time optimizing code, not setting up Flink clusters.
- Familiarity: Because we used our existing code, debugging was fast and easy.
- Clear Progress: Every tweak we made showed immediate, measurable results.
Conclusion: 1-Second Granularity Achieved and TSDB Choice Validated
The journey from the M×N strategy concept to successfully testing our TSDB’s limits was challenging but incredibly rewarding. Through application-level optimizations and strategic infrastructure scaling, we achieved our ultimate goal.
Mission Accomplished: From 60 Seconds to 1 Second We successfully completed the PoC, proving our system can seamlessly handle the transition from a 60-second collection interval down to a 1-second interval (M=60). By upgrading our infrastructure to 40-core Physical Machines, we broke through the initial concurrency bottlenecks and confidently absorbed the massive influx of over 400M+ metrics per minute.
Validating Our TSDB Choice (VictoriaMetrics)
Perhaps the most important outcome of this extreme load testing was the validation of our core architectural decision. Hitting the vminsert concurrency limit early on wasn’t a failure of the database; rather, it was a testament to its self-protective design. Once provided with the appropriate hardware, VictoriaMetrics demonstrated incredible vertical scalability and raw ingestion efficiency.
It proved beyond a doubt that it can handle the punishing load of 1-second granularity metrics across our entire cloud infrastructure without breaking a sweat. This PoC gave us the definitive answer we needed: our choice of VictoriaMetrics as our TSDB was absolutely the right one.
What’s Next? With our ingestion capabilities proven for 1-second intervals, we’re now ready to tackle the next challenge: optimizing for cardinality explosion and complex query patterns on the read side. But that’s a story for another day.
This concludes our three-part series on scaling time series data processing. From initial strategy through team collaboration to discovering real-world database constraints, we’ve covered the complete journey from theoretical concept to practical system success.
Leave a comment