Scaling Time Series Data Processing: A Pivot to Pragmatism (Part 2)
Introduction
In Part 1, I explained the “M×N” strategy for scaling time series data. My original idea was to build a dedicated stream processing layer using Flink or Kafka Streams. It was a clean, theoretically perfect design that kept data transformation completely separate from data ingestion.
However, a “theoretically perfect” design in a whiteboard session often clashes with the gritty reality of a production environment.
Before diving into the Proof of Concept (PoC) development, we conducted a rigorous architectural review. This review process proved to be incredibly valuable, leading us to completely scrap the Flink idea in favor of a brutally simple, highly pragmatic alternative.
This post details that architectural pivot. It highlights how critically evaluating operational constraints guided us away from a complex new system and toward an efficient, battle-tested solution.
The Architectural Review: Idealism vs. Operational Reality
During our design review, we stress-tested the Flink proposal against the realities of our infrastructure. As we analyzed the operational requirements, we quickly identified two critical risks:
- Deployment and Operational Overhead: Building a new Flink job solely for a PoC meant provisioning new clusters, managing configurations, and maintaining a new deployment pipeline. This introduced significant delay before we could even begin our core task: testing the database.
- Cross-IDC Network Latency: This was the dealbreaker. Our existing
data writer serviceuses a custom parallel consumer specifically built to handle the severe network delays between our isolated data centers (IDCs). Standard Kafka consumers struggle in this environment. If we deployed Flink in a separate IDC, it would immediately hit this exact network bottleneck, invalidating our performance tests.
It became clear that introducing a new stream processing engine would force us to solve infrastructure and network problems before we could even test the Time Series Database (TSDB).
We needed a new direction.
The Pragmatic Pivot: Scaling the Existing Writer
To solve these constraints, we pivoted to an elegant and highly practical alternative: Why not embed the M×N transformation logic directly into our existing data writer service and scale that horizontally?
The concept was simple: Have each instance of the data writer service read the same raw data from Kafka, but configure each instance to generate a specific, non-overlapping slice of the final M×N data.
How It Works
Each writer instance handles both the M (time division) and N (cardinality expansion) dimensions internally. For example, one instance handles timestamps at 00s, 10s, and 20s. Another handles 30s, 40s, and 50s. Both instances also multiply the data to increase the cardinality (N) before writing it to the database.
// Inside the existing Data Writer Service
public class DataProcessor {
private final List<Integer> assignedTimestampOffsets; // M: time division
private final int cardinalityMultiplier; // N: data volume expansion
public void process(Metric metric) {
LocalDateTime baseTime = metric.getTimestamp().truncatedTo(ChronoUnit.MINUTES);
// M: Time division - split into multiple timestamps
for (int offset : assignedTimestampOffsets) {
// N: Cardinality expansion - increase data volume
for (int instanceOffset = 0; instanceOffset < cardinalityMultiplier; instanceOffset++) {
Metric transformed = metric.copy();
transformed.setTimestamp(baseTime.plusSeconds(offset));
transformed.setInstanceNo(metric.getInstanceNo() + instanceOffset);
emitToWriterQueue(transformed);
}
}
}
}
Configuration Example
// Writer Instance A: handles 00s, 10s, 20s with 6x cardinality
DataProcessor writerA = new DataProcessor(
Arrays.asList(0, 10, 20), // M: 3 timestamps
6 // N: 6x data volume
);
// Writer Instance B: handles 30s, 40s, 50s with 6x cardinality
DataProcessor writerB = new DataProcessor(
Arrays.asList(30, 40, 50), // M: 6x data volume
6 // N: 6x data volume
);
Data Volume Calculation
Let’s do the math:
- M = 6: We split 1-minute data into 6 timestamps (00s, 10s, 20s, 30s, 40s, 50s)
- N = 6: We multiply each timestamp by 6 using different instance numbers.
- Total Increase: M × N = 6 × 6 = 36x data volume
This approach achieved the exact same M×N scaling effect (36x data volume) as the complex Flink architecture, but completely bypassed the need for new infrastructure.
Most importantly, it leveraged a battle-tested service that had already solved our cross-IDC network latency issues.
Head-to-Head: Comparing the Architectures
We formalized the comparison to ensure we were making the right long-term trade-offs.
| Aspect | Flink/Stream Processor Approach | Direct Writer Scaling Approach | Analysis |
|---|---|---|---|
| Time to Market (PoC) | Slow | Fast | No new infrastructure or deployment pipelines needed. |
| Operational Risk | High | Low | Introduces an unproven component vs. scaling a stable, known one. |
| Resource Efficiency | Low | High | Avoids duplicating data in a second Kafka topic, saving massive storage. |
| M×N Implementation | High | High | Both approaches successfully implement the scaling strategy. |
| Architectural Purity | High | Low | Concerns are cleanly separated vs. writer handling both transformation and I/O. |
| Network Resilience | Low | High | Flink would struggle with our IDC latency; the writer already handles it perfectly. |
For our PoC, the decision was obvious. We happily traded textbook architectural “purity” for a massive gain in speed, resource efficiency, and network resilience.
Visual Comparison of Approaches
00s, 10s, 20s] A2 --> D3[Writer Instance 2
30s, 40s, 50s] D2 --> E2[TSDB] D3 --> E2 style A2 fill:#ccffcc style D2 fill:#ccffcc style D3 fill:#ccffcc style E2 fill:#ccffcc end subgraph "Key Differences" F1["Flink: 2 Topics, High Latency Risk
Complex, Resource-Heavy"] F2["Direct: 1 Topic, Latency Optimized
Simple, Resource-Efficient"] style F1 fill:#ffcccc style F2 fill:#ccffcc end
Conclusion: Architectural Pragmatism in Practice
This pivot from a theoretical ideal to a highly pragmatic solution reinforced several core principles of senior-level system design.
1. The Goal is Validation, Not Perfection
A “good” architecture is one that solves the business problem efficiently. Our goal was to test TSDB limits safely and quickly. By recognizing that building a perfect stream processing pipeline was a distraction from our actual goal, we saved weeks of engineering time.
2. Infrastructure Constraints Drive Design
You cannot design software in a vacuum. The specific network latency between our data centers was a hard constraint that immediately invalidated our theoretical Flink design. Truly robust architectures are built around—and optimized for—the unique limitations of their physical environments.
3. Complexity is a Cost
Every new component (like Flink) introduces operational overhead, deployment risk, and maintenance burdens. By strategically reusing and scaling an existing, battle-tested component, we achieved our high-throughput goal while keeping the system architecture as lean as possible.
This experience proved that the best architectures are rarely the most complex ones. They are the ones that balance technical rigor with operational reality to deliver results efficiently.
All technical content in this article is based on actual production experience. Specific system names and configuration values have been generalized for security.
Leave a comment