Phoenix Framework: Designing a Business Concurrent Framework from 0 to 1 - Core Design of Concurrent Thread Pool

Background#

Designing a concurrent framework for business from 0 to 1 series:

The first two articles have discussed the background of my framework design and the details of abstract design. Today, I will talk about the core design of the concurrent thread pool, which is the most critical part of the concurrent framework. I will mainly discuss the problems encountered in designing the thread pool division and the approach I ultimately adopted.

Concurrent Invocation Group

After dividing the tasks with dependencies into groups, executing the groups one by one can obtain all the desired results. But how to divide the thread pool and set the thread pool is the problem we face.

Next, I will simplify the complexity of actual business design and present the problem in a concrete way.

Solution: Shared Thread Pool#

Solution#

Initially, I planned to allocate the tasks to a shared thread pool, allowing the tasks to compete for resources in the thread pool, as shown in the following figure:

Shared Thread Pool

However, it was soon discovered that once the number of requests to a single thread pool increased, if one task interface became slow, the success rate of the entire interface would rapidly decrease until it became unavailable.

Why does this situation occur?

Effect#

Shared Thread Pool

Time T1: The first wave of traffic comes in, and TaskA or TaskB is executed first.
The number of requests for TaskA increases rapidly, and the interface becomes slower and slower.
Time T2: Two TaskA tasks from the first wave of traffic have not been completed, and then the second wave of traffic comes in:
- The TaskC and TaskD of the first wave of traffic start to execute.
- The second wave of traffic comes in, and TaskA and TaskB also get threads to execute.
Time T3: At this time, there are still 4 TaskA tasks that have not been completed, and the first two TaskA tasks are facing timeouts:
- The TaskA task executed by the first wave of traffic is facing a timeout interruption.
- The TaskA task executed by the second wave of traffic is still running.
- The third wave of traffic comes in, making the situation more complicated, with new traffic and TaskA and TaskB being executed.
- At this time, the first two layers of the first wave of traffic are completed, and TaskE starts to execute.
- At this time, the previous layer of the second wave of traffic is completed, and TaskC and TaskD start to execute.
The situation continues to develop with TaskA being slow...
Time Tn: At this time, most of the thread pool has been occupied by TaskA from the previous n waves of traffic, and a large number of threads have been interrupted due to timeouts. Other tasks cannot compete for threads to execute.

In this case, the availability of the interface depends entirely on the availability of TaskA. However, there is also a fatal problem that other tasks cannot be executed or, due to dependency issues, most of the parameters required for the previous tasks are empty, so normal requests cannot be made. Even if the interface returns data, it is incomplete data.

This solution has the problem of "a large number of threads waiting for timeouts" and is not acceptable.

Solution: Layered Thread Pool#

Solution#

The situation with a shared thread pool is definitely problematic. On this basis, I tried to divide the concurrent execution into different concurrent pools in a layered manner, with each layer sharing a thread pool, as shown in the following figure:

Layered Thread Pool

After using the layered shared thread pool, the stress test found only a slight improvement in performance, far from the expected goal. Why does this problem occur?

Effect#

Layered Thread Pool

Let's still take TaskA as an example, which will time out on a large scale as the concurrency increases.

Time T1: The first wave of traffic comes in, and TaskA or TaskB is executed first, while thread pools 2 and 3 do not execute.
The number of requests for TaskA increases rapidly, and the interface becomes slower and slower.
Time T2: Two TaskA tasks from the first wave of traffic have not been completed, and then the second wave of traffic comes in:
- The TaskC and TaskD of the first wave of traffic start to execute in thread pool 2.
- The first wave of traffic completes TaskC and starts to execute TaskE in thread pool 3.
- The second wave of traffic comes in, and TaskA and TaskB also get threads to execute.
Time T3: At this time, there are still 4 TaskA tasks that have not been completed, and the first two TaskA tasks are facing timeouts:
- The TaskA task executed by the first wave of traffic in thread pool 1 is facing a timeout interruption.
- The TaskA task executed by the second wave of traffic in thread pool 1 is still running.
- The third wave of traffic comes in, making the situation relatively complicated with new traffic.
- At this time, the first two layers of the first wave of traffic are completed, and TaskE starts to execute in thread pool 3.
- At this time, the previous layer of the second wave of traffic is completed, and TaskC and TaskD start to execute in thread pool 2.
The situation continues to develop with TaskA being slow...
Time Tn: At this time, most of the thread pool 1 has been occupied by TaskA from the previous n waves of traffic, and a large number of threads have been interrupted due to timeouts. Since TaskC and TaskD depend on the results of TaskA, they cannot obtain the execution results:
- TaskA, which is too slow, occupies nearly 100% of the resources of thread pool 1.
- TaskB cannot compete for resources and is interrupted due to timeouts.

In the end, the interface still becomes unavailable. In fact, it is the same problem as the shared thread pool, and there are still a large number of threads waiting for timeouts.

This situation of a shared thread pool is not acceptable, so how should we divide the thread pool for execution? In fact, the idea of divide and conquer can solve this problem, which brings us to version 3.0, the solution of an "independent task thread pool".

Solution: Independent Thread Pool#

No matter how the thread pool is shared, it will be squeezed. Only by assigning each task to a separate thread pool can we avoid the problem of contention and waiting. So how should we design it?

Solution#

Create a separate thread pool for each task to handle the traffic. The thread pools do not interfere with each other, and the traffic is handed over to the CPU for scheduling and execution.

Effect#

Since the traffic is handled separately, this design meets the goal of high availability. Still taking TaskA as an example, as the concurrency of requests increases, the interface becomes slower and eventually becomes unavailable. In addition, we introduce a condition: "TaskC can only be executed after TaskA is completed".

Time T1: The first wave of traffic comes in, and all the threads in the thread pools are occupied, starting the core scheduling and execution.
Time T2: The second wave of requests comes in, and the two TaskA tasks from the first wave of requests have not been completed yet. Other thread pools gradually take over the second wave of requests and wait for scheduling.
Time T3: The third wave of requests comes in, and the situation becomes more complicated:
- The two TaskA tasks from the first wave of traffic have already timed out, and the two TaskC tasks in the corresponding thread pool are waiting for the execution results of the tasks and fail to complete the tasks.
- The TaskA tasks from the second wave of traffic have not been completed yet and are on the verge of timing out.
- Other thread pools are executing normally.
After a while...
Time Tn:
- TaskA has reached an unavailable state.
- TaskC, which depends on TaskA, gradually becomes unavailable.
- Other thread executions are normal.

In this way, for a scenario where dozens or hundreds of interfaces are called, the availability of the entire interface is not affected by the availability of a single interface or interfaces with dependencies. As long as the monitoring of individual thread pools is done well and alerts are added, we can dynamically detect which upstream interfaces have failed and promptly notify the corresponding system maintenance personnel. This greatly reduces maintenance costs.

This version was pushed to the online production environment as the first version. With a single 8C 8G (k8s) configuration, the framework achieved a QPS of 14,000 and an interface availability of 99.96% (the results are for reference only and may vary depending on the company's cluster deployment strategy and machine performance).

However, this solution still has obvious problems. Each task executes interfaces with inconsistent response times. Some interfaces respond within 50ms, some within 100ms, and some take longer, up to 500ms. Allocating the same number of thread pools is unreasonable because it will cause unfair CPU scheduling. So how can we make the scheduling more fair?

Optimization#

To address this issue, create thread pools with different sizes based on "weights". For interfaces that are slower but can return within a certain waiting time, allocate more thread pool resources. For interfaces that respond quickly, relatively reduce the size of the thread pool. This design ensures interface availability while also considering the integrity of interface response fields.

Conclusion#

This article mainly discusses how to divide and execute concurrent tasks in the framework design. In the end, we adopted the solution of an independent thread pool and evaluated the size of each task thread pool based on factors such as execution time and CPU cores. This ensures that threads are scheduled as fairly as possible by CPU scheduling, ensuring the concurrency requirements and high availability of the interface.

If you are interested, I recommend following the official account or subscribing to this website. Welcome to interact and communicate. Let's become stronger together~