SeungHyeon Lee

Scaling Time Series Data Processing: Discovering TSDB Limits in Practice (Part 3)

2025-09-25T00:00:00+00:00

Prologue: Testing TSDB Limits in the Real World

In Part 1, we introduced the M×N scaling strategy to test the performance limits of our Time Series Database (TSDB). In Part 2, we shared how our team’s feedback led us to drop a complex stream processing design in favor of a simpler, more practical approach using our existing data writer service.

Now comes the moment of truth: actually testing the TSDB’s limits and finding the real bottlenecks.

What we found changed how we look at system performance. The bottlenecks were not where we expected them to be. Fixing them required us to look far beyond our application code. This is the story of how we uncovered our database’s true limits and learned to scale beyond them.

Chapter 1: Preparing for the Test

Our Strategy

Our goal was simple: push the TSDB to its breaking point. Using our M×N strategy, we planned to turn up the data volume step by step until the system buckled.

Here is what we expected to see in theory:

Base throughput: ~22M metrics/minute
M=6 (10-second intervals): 22M × 6 = 132M expected
M=12 (5-second intervals): 22M × 12 = 264M expected

The First Obstacle: Application Bottlenecks

When we ran the tests, the results were sobering:

Test	Parallelism	M	Expected/min	Actual/min	Efficiency
1	4	6	130.2M	~73M	56%
2	6	6	130.2M	~104M	80%
3	6	12	260.4M	~130M	50%
4	6	20	434M	~130M	30%

The hard truth: When we pushed more data (increased M), our efficiency dropped fast. We hit a brick wall at about 140-145 million metrics per minute, no matter how we tweaked the application settings.

What Was Going Wrong?

We checked the TSDB monitoring dashboard, and to our surprise, the database was barely breaking a sweat:

CPU Utilization: 20-40%
Peak Ingestion Rate: 18M datapoints/min
Memory/Disk: All perfectly normal

The realization: The database wasn’t the problem. Our own application was too slow to generate enough load to stress the database!

Before we could test the TSDB’s limits, we had to fix our own code.

Chapter 2: Eliminating Application Bottlenecks

Problem #1: Blocking I/O

The Issue: When our application sent data to the database, it stopped and waited for a response. During this waiting time, the thread was blocked and couldn’t process any new messages from Kafka.

Before: Synchronous Architecture


graph TD
    subgraph "Consumer Thread"
        A[Kafka Poll] --> B[Data Amplification]
        B --> C[timeSeriesService.write()]
        C -- "I/O Wait (seconds)" --> D[Write Complete]
        D --> A
    end

After: Asynchronous Architecture with Virtual Threads


graph TD
    subgraph "Consumer Thread"
        A[Kafka Poll] --> B[Data Amplification]
        B --> Q[Submit Write Task]
        Q --> A
    end

    subgraph "Virtual Thread Pool"
        Q --> VT1[Virtual Thread 1: Write Batch 1]
        Q --> VT2[Virtual Thread 2: Write Batch 2]
        Q --> VTx[Virtual Thread N...]
    end

The Fix: We used Java 21’s Virtual Threads to make the database writes asynchronous. Now, the main thread simply hands the write task to a virtual thread and immediately goes back to fetching more data.

Code Example: Asynchronous Write Implementation

Scaling Time Series Data Processing: A Pivot to Pragmatism (Part 2)

2025-08-27T00:00:00+00:00

Introduction

In Part 1, I explained the “M×N” strategy for scaling time series data. My original idea was to build a dedicated stream processing layer using Flink or Kafka Streams. It was a clean, theoretically perfect design that kept data transformation completely separate from data ingestion.

However, a “theoretically perfect” design in a whiteboard session often clashes with the gritty reality of a production environment.

Before diving into the Proof of Concept (PoC) development, we conducted a rigorous architectural review. This review process proved to be incredibly valuable, leading us to completely scrap the Flink idea in favor of a brutally simple, highly pragmatic alternative.

This post details that architectural pivot. It highlights how critically evaluating operational constraints guided us away from a complex new system and toward an efficient, battle-tested solution.

The Architectural Review: Idealism vs. Operational Reality

During our design review, we stress-tested the Flink proposal against the realities of our infrastructure. As we analyzed the operational requirements, we quickly identified two critical risks:

Deployment and Operational Overhead: Building a new Flink job solely for a PoC meant provisioning new clusters, managing configurations, and maintaining a new deployment pipeline. This introduced significant delay before we could even begin our core task: testing the database.
Cross-IDC Network Latency: This was the dealbreaker. Our existing data writer service uses a custom parallel consumer specifically built to handle the severe network delays between our isolated data centers (IDCs). Standard Kafka consumers struggle in this environment. If we deployed Flink in a separate IDC, it would immediately hit this exact network bottleneck, invalidating our performance tests.

It became clear that introducing a new stream processing engine would force us to solve infrastructure and network problems before we could even test the Time Series Database (TSDB).

We needed a new direction.

The Pragmatic Pivot: Scaling the Existing Writer

To solve these constraints, we pivoted to an elegant and highly practical alternative: Why not embed the M×N transformation logic directly into our existing data writer service and scale that horizontally?

The concept was simple: Have each instance of the data writer service read the same raw data from Kafka, but configure each instance to generate a specific, non-overlapping slice of the final M×N data.

How It Works

Each writer instance handles both the M (time division) and N (cardinality expansion) dimensions internally. For example, one instance handles timestamps at 00s, 10s, and 20s. Another handles 30s, 40s, and 50s. Both instances also multiply the data to increase the cardinality (N) before writing it to the database.

// Inside the existing Data Writer Service
public class DataProcessor {
    private final List<Integer> assignedTimestampOffsets; // M: time division
    private final int cardinalityMultiplier;              // N: data volume expansion
    
    public void process(Metric metric) {
        LocalDateTime baseTime = metric.getTimestamp().truncatedTo(ChronoUnit.MINUTES);
        
        // M: Time division - split into multiple timestamps
        for (int offset : assignedTimestampOffsets) {
            // N: Cardinality expansion - increase data volume
            for (int instanceOffset = 0; instanceOffset < cardinalityMultiplier; instanceOffset++) {
                Metric transformed = metric.copy();
                transformed.setTimestamp(baseTime.plusSeconds(offset));
                transformed.setInstanceNo(metric.getInstanceNo() + instanceOffset);
                
                emitToWriterQueue(transformed);
            }
        }
    }
}

Configuration Example

// Writer Instance A: handles 00s, 10s, 20s with 6x cardinality
DataProcessor writerA = new DataProcessor(
    Arrays.asList(0, 10, 20),  // M: 3 timestamps
    6                           // N: 6x data volume
);

// Writer Instance B: handles 30s, 40s, 50s with 6x cardinality  
DataProcessor writerB = new DataProcessor(
    Arrays.asList(30, 40, 50), // M: 6x data volume
    6                           // N: 6x data volume
);

Data Volume Calculation

Let’s do the math:

M = 6: We split 1-minute data into 6 timestamps (00s, 10s, 20s, 30s, 40s, 50s)
N = 6: We multiply each timestamp by 6 using different instance numbers.
Total Increase: M × N = 6 × 6 = 36x data volume

This approach achieved the exact same M×N scaling effect (36x data volume) as the complex Flink architecture, but completely bypassed the need for new infrastructure.

Most importantly, it leveraged a battle-tested service that had already solved our cross-IDC network latency issues.

Head-to-Head: Comparing the Architectures

We formalized the comparison to ensure we were making the right long-term trade-offs.

Aspect	Flink/Stream Processor Approach	Direct Writer Scaling Approach	Analysis
Time to Market (PoC)	Slow	Fast	No new infrastructure or deployment pipelines needed.
Operational Risk	High	Low	Introduces an unproven component vs. scaling a stable, known one.
Resource Efficiency	Low	High	Avoids duplicating data in a second Kafka topic, saving massive storage.
M×N Implementation	High	High	Both approaches successfully implement the scaling strategy.
Architectural Purity	High	Low	Concerns are cleanly separated vs. writer handling both transformation and I/O.
Network Resilience	Low	High	Flink would struggle with our IDC latency; the writer already handles it perfectly.

For our PoC, the decision was obvious. We happily traded textbook architectural “purity” for a massive gain in speed, resource efficiency, and network resilience.

Visual Comparison of Approaches

graph TB subgraph "Theoretical Approach (Complex Flink)" A1[Source Topic] --> B1[Flink Cluster 1] A1 --> B2[Flink Cluster 2] B1 --> C1[Target Topic] B2 --> C1 C1 --> D1[TSDB Writer] D1 --> E1[TSDB] style A1 fill:#ffcccc style C1 fill:#ffcccc style B1 fill:#ffcc99 style B2 fill:#ffcc99 end subgraph "Pragmatic Pivot (Direct Writer Scaling)" A2[Source Topic] --> D2[Writer Instance 1
00s, 10s, 20s] A2 --> D3[Writer Instance 2
30s, 40s, 50s] D2 --> E2[TSDB] D3 --> E2 style A2 fill:#ccffcc style D2 fill:#ccffcc style D3 fill:#ccffcc style E2 fill:#ccffcc end subgraph "Key Differences" F1["Flink: 2 Topics, High Latency Risk
Complex, Resource-Heavy"] F2["Direct: 1 Topic, Latency Optimized
Simple, Resource-Efficient"] style F1 fill:#ffcccc style F2 fill:#ccffcc end

Conclusion: Architectural Pragmatism in Practice

This pivot from a theoretical ideal to a highly pragmatic solution reinforced several core principles of senior-level system design.

1. The Goal is Validation, Not Perfection

A “good” architecture is one that solves the business problem efficiently. Our goal was to test TSDB limits safely and quickly. By recognizing that building a perfect stream processing pipeline was a distraction from our actual goal, we saved weeks of engineering time.

2. Infrastructure Constraints Drive Design

You cannot design software in a vacuum. The specific network latency between our data centers was a hard constraint that immediately invalidated our theoretical Flink design. Truly robust architectures are built around—and optimized for—the unique limitations of their physical environments.

3. Complexity is a Cost

Every new component (like Flink) introduces operational overhead, deployment risk, and maintenance burdens. By strategically reusing and scaling an existing, battle-tested component, we achieved our high-throughput goal while keeping the system architecture as lean as possible.

This experience proved that the best architectures are rarely the most complex ones. They are the ones that balance technical rigor with operational reality to deliver results efficiently.

All technical content in this article is based on actual production experience. Specific system names and configuration values have been generalized for security.

Scaling Time Series Data Processing: The M×N Strategy and a Stream-Based Approach (Part 1)

2025-08-25T00:00:00+00:00

Introduction

When running time series data processing systems, we often need to scale data ingestion. Two common scenarios are:

Granular Monitoring: Shortening a 1-minute collection interval to 1 second to increase data density.
System Performance Testing: Stress-testing a Time Series Database (TSDB) with massive data to find its limits.

Initially, I took a simple approach: why not just copy the existing data multiple times? However, this “simple” method caused unexpected problems. It forced me to rethink the challenge from a completely different angle.

This article shares my trial-and-error process and the improved architecture we built as a result.

The First Try: Simple Multiplication

Initial Requirement and Intuitive Solution

Current: Low-density metric data (e.g., N-minute intervals)
Target: High-density metric data (e.g., M-second intervals)

When faced with this, my first thought was to simply “copy the existing data by the required multiplier.”

// Initial implementation: Simple multiplication
public void processData(List<Metric> metrics) {
    List<Metric> expandedMetrics = new ArrayList<>();
    
    for (Metric metric : metrics) {
        // Generate data by the target multiplier
        for (int multiplier = 0; multiplier < TARGET_MULTIPLIER; multiplier++) {
            Metric duplicated = metric.copy();
            duplicated.adjustTimestamp(multiplier);
            expandedMetrics.add(duplicated);
        }
    }
    
    // Send all data to the sink at once
    for (Metric expanded : expandedMetrics) {
        sink.write(expanded); // OOM occurs due to the massive data volume!
    }
}

Unexpected Problems

This simple code quickly revealed serious issues:

1. OutOfMemoryError (OOM)

Existing data volume: X
Multiplier: N
Result: X * N items loaded into memory simultaneously → OOM

2. Garbage Collection (GC) Pressure

The app created massive amounts of temporary objects.
This caused long GC pause times, which delayed the whole system.

3. Scalability Limitations

Memory usage grew exponentially as the multiplier increased.
We couldn’t use this method for tests that required even larger data volumes.

These problems led me to ask a fundamental question: “Is this approach truly practical?”

Analyzing the Problem and Finding a New Design

Limitations of the Existing Method

Why did the first approach fail? Here are the main reasons:

Memory-centric thinking: We tried to load all data into memory before processing it.
Batch processing mindset: We treated continuous streaming data as a single batch.
Lack of realism: The design didn’t match how a real-world production environment works.

The key insight: “The code wasn’t the problem. The way we thought about the data was the problem.”

Root Cause Analysis: Two Different Test Objectives

Looking closer, our initial requirement actually hid two different test objectives:

1. The M-Dimension: Increasing Data Density

Objective: More granular monitoring.
Method: 1-minute intervals → 1-second intervals (temporal refinement).
Impact: More data points for the *same* time series.
Test Target: TSDB write throughput and storage capacity limits.

2. The N-Dimension: Increasing Cardinality

Objective: System performance testing (measuring TSDB limits).
Method: Diversifying unique identifiers (e.g., server IDs, instance IDs).
Impact: An N-fold increase in the number of unique time series, i.e., N-fold cardinality.
Test Target: TSDB performance for indexing, metadata processing, and label-based queries.

Cardinality = The number of unique time series (a combination of metric name + labels).
※ The timestamp does not affect cardinality.
※ Different unique identifiers are treated as distinct time series.

I failed initially because I didn’t understand this distinction. I only focused on “increasing data volume.”

This realization demanded a new approach. I needed to drop the “load everything into memory” mindset. Instead, I needed to embrace stream processing while handling both the M and N dimensions.

I decided to tackle the more complex M-dimension (increasing data density) first. I looked at two alternative approaches.

New Approaches for the M-Dimension: Memory Holding vs. Immediate Transformation

Note: The N-dimension (increasing cardinality) is easy to implement. We simply add unique identifiers to the data inside each processing cluster. Therefore, we will focus on the harder M-dimension here.

Approach 1: The Memory Holding Method

My first idea was to distribute the data perfectly over time.

// Pseudocode: Memory Holding Method
public class WindowedProcessor {
    private Map<String, List<Metric>> oneMinuteBuffer;
    
    public void process(Metric metric) {
        // Collect data for a 1-minute window
        buffer.add(metric);
        
        if (windowComplete()) {
            // Dispatch at precise intervals (e.g., every 1 second)
            sendAt("14:00:00", createMetrics(buffer, 0));
            sendAt("14:00:01", createMetrics(buffer, 1));
            sendAt("14:00:02", createMetrics(buffer, 2));
            // ... (interval is configurable)
        }
    }
}

Advantages:

It perfectly copies the time distribution pattern of high-density monitoring.
It provides a very accurate simulation.

Disadvantages:

High memory usage: O(M × time_window).
It requires complex timers and state management.
We risk a memory explosion when we scale the N-dimension later.

Approach 2: The Immediate Transformation Method

// Pseudocode: Immediate Transformation Method
public class ImmediateProcessor {
    public void process(Metric metric) {
        LocalDateTime baseTime = metric.getTimestamp()
            .truncatedTo(ChronoUnit.MINUTES);
        
        // Immediately transform one metric into M metrics upon arrival
        for (int offset = 0; offset < INTERVAL_SECONDS; offset += TARGET_INTERVAL) {
            Metric transformed = metric.copy();
            transformed.setTimestamp(baseTime.plusSeconds(offset));
            emit(transformed);
        }
    }
}

Advantages:

Memory efficient: O(M).
Simple to code.
Scales very well.

Disadvantages:

It creates a sudden burst of data instead of spreading the load evenly over time.

The Critical Insight: “The Essence of Streaming”

While debating between the two approaches, I had an “Aha!” moment.

How Does Real-World Data Arrive?

I thought about how data actually arrives in a real monitoring environment.

Data flow in high-density monitoring:
Actual Arrival Time   Metric Data Content
──────────────────────────────────────────
12:00:03              server1_cpu (timestamp: 12:00:00)
12:00:07              server2_cpu (timestamp: 12:00:00)  
12:00:13              server1_cpu (timestamp: 12:00:10)
12:00:15              server3_cpu (timestamp: 12:00:00)
12:00:17              server2_cpu (timestamp: 12:00:10)
12:00:23              server1_cpu (timestamp: 12:00:20)
...

→ Characteristic: Each server sends data at its own, uncoordinated time.
→ Result: Data arrives as a continuous, irregular stream.

The Key Insight:

“In a real streaming environment, data arrives individually and continuously. Is there any good reason to buffer it and process it as a batch?”

This changed everything. In the real world:

Metrics arrive one by one.
They naturally spread out over time.
The system handles them as a stream, not a bulk batch.

Even though the Immediate Transformation method created small bursts, it actually mirrored real-world streaming much better than the buffering method!

Detailed Technical Review

Checking Memory Usage

Let’s look closely at the memory usage of the Immediate Transformation method.

public void process(Metric input) {
    // M objects are created momentarily
    for (int i = 0; i < M; i++) {
        Metric transformed = input.copy(); // M objects
        emit(transformed);
    }
}

The conclusion? It uses O(M) memory. Yes, it creates M objects, but it emits them immediately. The garbage collector can clean them up right away. This is vastly more efficient than holding onto O(M × time_window) data in memory.

Checking Data Volume Impact

Someone asked me, “If we increase M, does that also increase cardinality?”

The answer is no: Changing the timestamp interval does not increase cardinality.

Cardinality only depends on the metric name + labels.
The timestamp does not change cardinality.

Example:
- cpu_usage{server="web01"} → 1 time series
- If we collect this every 10 seconds instead of 1 minute, it is still only 1 time series.

However, increasing M definitely increases the number of data points, which affects our storage and raw write performance.

Final Architecture Decision

Selected Method: Immediate Transformation

In the end, I chose the Immediate Transformation method.

flowchart LR A["Source Topic
(Low-Density Metrics)"] --> B["Stream Processor
(Transformation Engine)"] B --> |"Immediate M-fold Transformation
(1 → M)"| C["Target Topic
(High-Density Metrics)"] C --> D["TSDB
(Time Series DB)"] subgraph "M-Dimension: Time Division" E["timestamp: T1
metric{labels}: value"] E --> F["timestamp: T1
metric{labels}: value"] E --> G["timestamp: T1+Δ
metric{labels}: value"] E --> H["timestamp: T1+2Δ
metric{labels}: value"] E --> I["..."] end B -.-> E style A fill:#e1f5fe style B fill:#f3e5f5 style C fill:#e8f5e8 style D fill:#fff3e0

How we scaled the N-Dimension: We ran processing engines in parallel.

Each engine: Created unique time series by altering label values.
Implementation: We added a unique identifier to the labels.
Example: metric{server="web01"} became metric{server="web01", metric_id="1"}, metric{server="web01", metric_id="2"}

→ Result: We generated N times the number of unique time series simply by changing the identifiers.

Why We Chose This

Realism: The data flow matches a real-world streaming environment.
Efficiency: It keeps memory usage low.
Simplicity: We avoided complex state and timer management.
Scalability: It runs smoothly even when we crank up the N-dimension.
Test Objective: It perfectly hits our goal of generating massive data volumes.

The Integrated M×N Scaling Strategy

I built a two-dimensional strategy to control data volume and cardinality independently.

Why test both dimensions?

M-Dimension (Data Density): Tests pure data throughput. It measures how fast the TSDB can write data and how much disk space it uses.
N-Dimension (Cardinality): Tests the TSDB’s indexing and metadata engine. High cardinality usually breaks TSDBs faster than raw data volume.

We must separate them because a TSDB handles raw data points very differently than it handles unique time series indexes.

M-Dimension: Increasing Data Points

// M-Implementation: Timestamp division
LocalDateTime baseTime = metric.getTimestamp().truncatedTo(ChronoUnit.MINUTES);

for (int offset = 0; offset < INTERVAL_SECONDS; offset += TARGET_INTERVAL) {
    Metric transformed = metric.copy();
    transformed.setTimestamp(baseTime.plusSeconds(offset));
    emit(transformed); // Generates M times the data points
}

Effect:

Cardinality Impact: None (same metric + labels).
Data Volume Impact: M-fold increase.

N-Dimension: Increasing Cardinality

// N-Implementation: Diversifying time series via unique identifiers
public class TimeSeriesIdentifierTransformer {
    private final int metricId;
    
    public TimeSeriesIdentifierTransformer(int metricId) {
        this.metricId = metricId;
    }
    
    public Metric transform(Metric input) {
        Metric transformed = input.copy();
        
        // Add a new unique identifier to the labels
        transformed.addLabel("metric_id", String.valueOf(metricId));
        
        return transformed; // Generates N-fold cardinality
    }
}

Effect:

Cardinality Impact: N-fold increase (creates new time series).
Data Volume Impact: N-fold increase.

Putting It Together

// M×N Integrated Implementation
for (int timeOffset = 0; timeOffset < M; timeOffset++) {
    for (int identifierOffset = 0; identifierOffset < N; identifierOffset++) {
        Metric transformed = metric.copy();
        
        // M: Refine the timestamp (increase data density)
        transformed.setTimestamp(baseTime.plusSeconds(timeOffset * TARGET_INTERVAL));
        
        // N: Diversify the unique identifier (increase cardinality)
        transformed.addLabel("metric_id", 
            String.valueOf(identifierOffset));
        
        emit(transformed); // Total data volume is M×N
    }
}

Final Effect:

Total Data Volume = Original Volume × M × N
Total Cardinality = Original Cardinality × N (M has no effect)

How This Actually Impacts the TSDB

Let’s look at exactly what happens to the database when we turn these dials.

Impact of M-Dimension Scaling

What it stresses:

Write I/O: The disks must write more data points per second.
Network Bandwidth: The network must transfer M times more data.
Disk Storage: The disk fills up M times faster.
Compression Efficiency: Because data points arrive closer together, compression often improves.

What we monitor:

- Write throughput (points/sec)
- Disk usage growth rate
- Memory usage for write buffers
- Query response time for time-range queries

Impact of N-Dimension Scaling

What it stresses:

Index Memory: The TSDB must create and hold indexes for new label combinations in RAM.
Metadata Management: The system does N times more work to discover and manage series.
Label-based Search: Regex queries like {server="web01_virtual_*"} become much slower.
Aggregate Queries: GROUP BY operations must scan N times more series.

What we monitor:

- Memory usage for the series index
- Label query performance (milliseconds)
- Cardinality limit warnings
- Query planning time for complex aggregations

Integrated Load Testing Scenarios

With the M×N strategy, we can run targeted scenarios to find exact weaknesses:

Scenario 1: M=60, N=1  → High data density, existing cardinality
Scenario 2: M=1, N=100 → Existing density, high cardinality
Scenario 3: M=10, N=10 → Balanced load test

This lets us find out exactly which part of the TSDB breaks first.

Lessons Learned

1. Understand the Real Goal

It was important to look past the basic request of “make more data.” The true goal was measuring the performance limits of the TSDB.

2. Realism Over Perfection

Building a system that behaves like the real world is much more valuable than building a theoretically “perfect” but unnatural simulation.

3. Weigh the Trade-offs

I learned to constantly weigh performance against complexity, and perfection against practicality.

4. Talk to Your Team

Discussing these ideas with colleagues helped me fix blind spots I never would have seen on my own.

5. Take Small Steps

Starting with a simple multiplication idea and slowly refining it worked much better than trying to design a massive, complex system on day one.

Conclusion

This experience taught me how vital it is to understand how data flows over time.

It’s easy to fall into the trap of over-engineering a solution. This process reminded me that the best development involves finding the core problem, matching the real-world environment, and validating ideas with your team.

Part 2 Preview

In Part 2, I’ll explain how we tried to implement this M×N strategy in a real Proof of Concept (PoC) environment.

Technology Selection: Why we compared Kafka Streams and Flink.
PoC Architecture Design: How we planned to build the M×N engine.
Validation: Did the “Immediate Transformation” approach actually work in practice?

Spoiler Alert: When I showed this to my team, their feedback led us to scrap the Flink idea entirely. We pivoted to a completely different, much simpler approach. Discover how teamwork turned a complex architecture into a pragmatic solution in Part 2.

All technical content in this article is based on actual production experience. Specific system names and configuration values have been generalized for security.

Exception Handling Best Practices: Lessons from Effective Java

2025-08-22T00:00:00+00:00

1. Overview

Recently, while working on a software development project, I had the opportunity to think deeply about Exception Handling.

While integrating a third-party library, I encountered conflicts between the library’s guidelines and our application’s exception handling approach. This led me to reconsider exception handling practices, and I decided to revisit the concepts from Effective Java 3rd Edition - a true bible for Java developers.

As you can see from the table of contents, Exception Handling deserves an entire chapter rather than just a single item, highlighting its critical importance.

Let me explore Effective Java’s exception handling principles and reflect on how to solve the real-world problems I’ve encountered.

2. Exception Handling Principles

The opening statement sets the tone perfectly:

“Used properly, exceptions can improve a program’s readability, reliability, and maintainability. Used improperly, they can have the opposite effect.”

I completely agree with this statement, and I believe developers should always consider whether improper usage might occur.

2.1. Use Exceptions Only for Exceptional Conditions

Exceptions should be used only for truly exceptional situations that disrupt the normal flow of a program.

Using exceptions for situations that can be handled with simple conditional statements leads to performance degradation, reduced readability, and debugging difficulties.

Why shouldn’t exceptions be overused?

Performance Issues
- Throwing exceptions is relatively expensive
- Frequent exceptions in loops can significantly impact performance
Reduced Code Readability
- Too much exception handling code makes normal logic hard to read
- Overused exceptions = “control flow via exceptions” → not intuitive
Debugging Difficulties
- Unnecessary exceptions create longer stack traces, making actual problem identification harder

Proper Usage Examples

Situations for exception handling:
- Unpredictable errors: network disconnection, missing files, DB connection failures
- External system dependency failures
- Programming contract violations (IllegalArgumentException within reasonable bounds)
Situations to avoid exception handling:
- Simple conditional checks are sufficient
- Normal situations that frequently occur in loops

// Wrong: Using exceptions for existence checks in loops
for (String s : list) {
    try {
        process(s);
    } catch (NoSuchElementException e) {
        // Simply because the list is empty
        // Conditional handling is much more efficient
    }
}

// Correct:
for (String s : list) {
    if (s != null) {
        process(s);
    }
}

// File reading where file doesn't exist -> exception appropriate
try {
    readFile("nonexistent.txt");
} catch (IOException e) {
    System.out.println("Error reading file: " + e.getMessage());
}

2.2. Use Checked Exceptions for Recoverable Conditions and Runtime Exceptions for Programming Errors

Java provides three types of throwables: checked exceptions, runtime exceptions, and errors. Here’s guidance on when to use each:

Recoverable conditions → Checked exceptions Programming errors → Runtime exceptions

Checked Exceptions
- Force callers to handle exceptions
- Use for situations where the program can recover
Unchecked Exceptions
- Subclasses of RuntimeException
- Callers don’t need to handle them
- Represent programming errors that should be fixed through code changes

flowchart TD A["Should throw an exception?"] --> B{"Normal flow?"} B -->|Yes| C["Use conditionals
if / for etc."] B -->|No| D["Use exceptions"] D --> E{"Exception type selection"} E -->|Recoverable situation| F["Checked Exception
try-catch or throws required
Example: File not found, Network error"] E -->|Programming error| G["Runtime Exception
Code fix required
Example: IndexOutOfBounds, NullPointer"]

2.3. Avoid Unnecessary Checked Exceptions

Checked exceptions force callers to handle exceptions. However, if it’s not truly a recoverable situation, using checked exceptions hurts API usability and makes code messy.

Why is this problematic?

Messy caller code

try {     
 obj.action(); 
} catch (SomeCheckedException e) {     
 // Actually no recovery method available     
 throw new RuntimeException(e); 
}

→ Callers inevitably end up with unnecessary “catch and rethrow” code.

Reduced API usability
- Developers always have to write try-catch blocks
- APIs become unnecessarily complex

Correct Design Principles

Use runtime exceptions (Unchecked Exception) instead of checked exceptions if no recovery is possible
Use checked exceptions only when clients must respond
Optional / null returns might be better in some cases

Examples

Wrong approach

// Caller cannot actually recover
public void connect() throws IOException {     
  // IOException thrown on connection failure
}

try {
    service.connect();
} catch (IOException e) {
    // Cannot recover but must catch and rethrow or just log
}

Improved approach

// Not recoverable → change to runtime exception
public void connect() {
  if (/* failure */) {
      throw new IllegalStateException("Cannot connect to server");
  }
}

// No recovery needed → provide state check method
if (service.canConnect()) {
    service.connect();
}

2.4. Favor the Use of Standard Exceptions

Java provides well-defined standard exception classes. Rather than defining new exception classes, using appropriate standard exceptions is advantageous for consistency, readability, and maintainability.

Standard exceptions should be the first choice, and creating new exceptions should be a last resort.

Why use standard exceptions?

Consistency
- All Java developers can easily understand the meaning
- Names like NullPointerException, IllegalArgumentException are self-explanatory
Avoiding unnecessary duplication
- No need to create custom exceptions with the same functionality as existing ones
API simplification
- Prevents unnecessary proliferation of exception classes → easier maintenance

Commonly Used Standard Exceptions

Exception	Usage
IllegalArgumentException	When arguments are invalid
IllegalStateException	When object state is inappropriate for method call
NullPointerException	When null arguments are not allowed
IndexOutOfBoundsException	When index is out of range
ConcurrentModificationException	When concurrent modification is prohibited
UnsupportedOperationException	When called method is not supported

Examples

Using standard exceptions

public void setAge(int age) {
  if (age < 0) {
      throw new IllegalArgumentException("Age cannot be negative: " + age);
  }
  this.age = age;
}

Unnecessary custom exception

// Actually IllegalArgumentException is sufficient
public class InvalidAgeException extends RuntimeException {
  public InvalidAgeException(String message) {
      super(message);
  }
}

2.5. Throw Exceptions Appropriate to the Abstraction

Exceptions thrown by a method should match the abstraction level of that method.

Lower-level implementation details should not leak through exceptions - they should be translated to exceptions appropriate for the higher abstraction level.

Why is this important?

Maintaining Encapsulation
- External APIs shouldn’t change when internal implementation technology changes
- Exposing internal exceptions leaks implementation details
Consistent API
- Callers only need to think at the method’s abstraction level
- “What situations can cause this method to fail?” is all they need to understand
Maintenance ease
- If internal technology changes (e.g., DB → file storage), client code shouldn’t need updates if API exceptions change

Wrong Example (Exposing Implementation Exceptions)

// Internal library code
public List<String> readNames() throws SQLException {
    // DB access logic
}

Problem: Clients depend on SQLException → API changes needed when DB is replaced

Correct Example (Matching Abstraction Level)

// Abstracted API
public List<String> readNames() throws DataAccessException {
    try {
        // DB access
    } catch (SQLException e) {
        throw new DataAccessException("Database read failed", e);
    }
}

Clients only need to understand “data cannot be read” in abstract terms
Internal implementation (DB vs file) can change without affecting the API

Exception Translation

Convert lower-level exceptions → higher-level abstraction exceptions
Methods:
- Exception Translation: Wrap lower exceptions in higher-level exceptions
- Exception Chaining: Pass lower exception as cause (new MyException("msg", cause))

try {
    // DB access
} catch (SQLException e) {
    throw new DataAccessException("Data access failed", e); // include cause
}

3. Conclusion: Applying Lessons to Real-World Problems

After studying the theory, let me share how I resolved the library integration problem mentioned in the overview.

3.1. Problem Situation: Two Conflicting Exception Handling Approaches

The third-party library we were integrating validates requests before they reach application controllers. The library guidelines were:

The checkPermission() method throws a single AuthorizationException on failure
This exception contains an ErrorCode indicating the failure reason (TOKEN_EXPIRED, INSUFFICIENT_PERMISSIONS)
The GlobalExceptionHandler should catch this AuthorizationException, check the internal ErrorCode, and use if/else branching to return different HTTP status codes (401, 403, etc.)

[Library-recommended approach]

@RestControllerAdvice
public class GlobalExceptionHandler {

    @ExceptionHandler(AuthorizationException.class)
    public ResponseEntity<ErrorResponse> handleAuthorizationException(AuthorizationException e) {
        // 😫 Logic to analyze the cause inside ExceptionHandler
        if (e.getErrorCode() == ErrorCode.TOKEN_EXPIRED) {
            return ResponseEntity.status(HttpStatus.UNAUTHORIZED)
                .body(new ErrorResponse("Token has expired."));
        } else if (e.getErrorCode() == ErrorCode.INSUFFICIENT_PERMISSIONS) {
            return ResponseEntity.status(HttpStatus.FORBIDDEN)
                .body(new ErrorResponse("Access denied."));
        }
        // ... various other error code branches
        return ResponseEntity.status(HttpStatus.INTERNAL_SERVER_ERROR)
            .body(new ErrorResponse("Unknown authentication error."));
    }
}

However, this conflicted with our project’s exception handling principles:

ExceptionHandler should have simple responsibility - only converting exception types to appropriate HTTP responses
Throw clear custom exceptions matching the root cause (e.g., ProductNotFoundException, InvalidOrderException)

3.2. Finding Solutions with Effective Java Principles

The Effective Java principles I just summarized provided clear direction:

Item 73: Throw exceptions appropriate to the abstraction
- The interceptor’s role is the abstract concept of ‘authentication/authorization’. Exposing AuthorizationException with library implementation details is inappropriate. TOKEN_EXPIRED should be translated to UnauthorizedException, and INSUFFICIENT_PERMISSIONS to ForbiddenException at a higher abstraction level.
Item 75: Exception Translation
- Instead of exposing lower-level exceptions directly, I decided to apply ‘exception translation’ by wrapping them in appropriate higher-level exceptions. The Interceptor would act as an ‘adapter’, catching the library’s AuthorizationException and converting it to project-compliant exceptions.

3.3. Final Solution: Maintaining Architectural Consistency Through Exception Translation

1. Define Project-Appropriate Custom Exceptions

First, I defined exceptions appropriate for our project’s abstraction level:

// 401 Unauthorized
public class UnauthorizedException extends RuntimeException {
    public UnauthorizedException(String message) { super(message); }
}

// 403 Forbidden
public class ForbiddenException extends RuntimeException {
    public ForbiddenException(String message) { super(message); }
}

2. Apply Exception Translation in Interceptor

Then, I modified the AuthInterceptor to catch the library’s exceptions and translate them to appropriate custom exceptions:

@Component
public class AuthInterceptor implements HandlerInterceptor {

    private final AuthenticationService authService; // External library service

    // ... constructor omitted ...

    @Override
    public boolean preHandle(HttpServletRequest request, HttpServletResponse response, Object handler) {
        try {
            String token = request.getHeader("Authorization");
            authService.checkPermission(token); // This method throws AuthorizationException
        } catch (AuthorizationException e) {
            // ✨ Exception Translation happens here
            if (e.getErrorCode() == ErrorCode.TOKEN_EXPIRED) {
                throw new UnauthorizedException("Authentication token is invalid.");
            } else if (e.getErrorCode() == ErrorCode.INSUFFICIENT_PERMISSIONS) {
                throw new ForbiddenException("You don't have permission to access this resource.");
            }
        }
        return true;
    }
}

3. Simplified ExceptionHandler

As a result, the GlobalExceptionHandler returned to its clean original role of ‘simple conversion based on exception type’:

@RestControllerAdvice
public class GlobalExceptionHandler {

    // 👍 Handler with clear roles and responsibilities
    @ExceptionHandler(UnauthorizedException.class)
    @ResponseStatus(HttpStatus.UNAUTHORIZED)
    public ErrorResponse handleUnauthorized(UnauthorizedException e) {
        log.info(e.getMessage());
        return new ErrorResponse(e.getMessage());
    }

    @ExceptionHandler(ForbiddenException.class)
    @ResponseStatus(HttpStatus.FORBIDDEN)
    public ErrorResponse handleForbidden(ForbiddenException e) {
        log.warn(e.getMessage());
        return new ErrorResponse(e.getMessage());
    }
    
    // ... other handlers ...
}

4. Final Thoughts

While library and framework guidelines are important, when they conflict with our application’s overall design principles and consistency, it may be better to integrate them non-intrusively by adding an ‘adapter’ layer as shown above.

Ultimately, good exception handling goes beyond simply catching errors - it’s an important design activity that enhances code readability, maintainability, and overall system stability.

This article is based on real work experiences, with specific system names and configuration values generalized for security purposes.

References

Bloch, J. (2018). Effective Java (3rd Edition). Addison-Wesley Professional.
Martin, R. C. (2008). Clean Code: A Handbook of Agile Software Craftsmanship. Prentice Hall.
Fowler, M. (2018). Refactoring: Improving the Design of Existing Code (2nd Edition). Addison-Wesley Professional.
Oracle. (2021). The Java™ Tutorials - Exception Handling. Oracle Documentation.
Spring Framework Documentation. (2023). Exception Handling in Spring MVC. VMware, Inc.