Virtual thread is set to be released in JDK 21. This feature will bring significant benefits to many IO-bound frameworks. It will alleviate the burden on Java developers like me from having to write reactive-style code. The following tests have been conducted using JDK 21 EA build 25. CPU: Ryzen 4800H mobile.
JVM Memory and Startup Time
Java is presumed to consume more memory in general. Lets see how much memory a hello world program uses and explore how it grows by adding dependencies.
The maximum resident set size (RSS) memory used is measured at 44104 or approximately 44MB. This result is not bad, as I anticipated a higher value. However, let us now examine the actual heap usage.
Upon starting Visual VM, the JVM immediately spawns multiple threads to facilitate the recording and monitoring of its memory usage. Consequently, this results in an increased resident set size (RSS) of 117MB. However, it is worth noting that the heap usage remains relatively low, at approximately 11MB. Now, let us proceed to limit the memory usage and observe how low we can potentially reduce the RSS.
java -Xmx15m -Xms15m -jar target/jarfilename.jar
The resident set size (RSS) is measured at 41420 or 41MB, which appears to be the minimum memory usage of the JVM. Surprisingly, the heap usage is a mere 7MB.
I also attempted to time the "Hello World" program without any sleep. It took approximately 41ms to complete, indicating that the JVM startup time is at most 41ms.
Benchmarking Tomcat Server
So far, the JVM has been performing well. Now, let's examine its performance in serving HTTP requests. For this test, we have selected Tomcat 11-M7 embedded. This particular version offers an option to utilize virtual threads, which can be configured either in the server.xml file or programmatically within the embedded Tomcat. The Maven dependency for this setup is:
Hello World Servlet
This code is a straightforward HTTP servlet that sends a "Hello World" response.
Please note that I utilized the NIO endpoint instead of NIO2. It is important to mention that NIO2 does not dispatch requests within virtual threads.
The heap usage is ~35MB, since the heap usage is very low, lets limit to 50MB with Xmx50m
This usage is stable. The spikes in the graph are due to running benchmark. Adding below.
Disclaimer: No new threads were created during this test :). The benchmark sends 100k requests with a concurrency level set to 100. It is truly remarkable to observe that 100k requests are sent within 0.6 seconds, resulting in a request per second (req/s) rate of 165k. Furthermore, these operations are accomplished with a mere 35MB of heap usage. The resident set size (RSS) is recorded at 180MB. Let us proceed to benchmark this scenario once again using wrk.
100 Connections, -Xmx50m
326K req/s is really good. Another test was conducted with 1000 concurrent connections using a maximum heap size (-Xmx) of 50m.
1000 Connections, -Xmx50m
It is remarkable that this does not utilize more memory. The resident set size (RSS) remains at 180M, even with 1000 concurrent requests. It appears that virtual threads, in the end, do not consume a significant amount of memory.
1000 Connections, no -Xmx
Benchmarking 1000 connections without -Xmx below:
RSS at 600M and heap is at 300/400MB. This suddenly is not impressive. 300M for 1000 requests means 3GB for 10k requests and 7.5GB for 25K requests approximately. :(
25k Connections (failure)
Dispatching 1000 threads does not result in a memory usage of 1000MB. However, let us push the boundaries even further by testing with 25,000 concurrent requests.
25k Connections - tomcat
🤦 I had to fight with TCP/IP stack of linux to run this test. In addition to tweaking kernel parameters, the tomcat connector is configured to handle high concurrency.
The benchmark executed successfully, with the resident set size (RSS) reaching 2350MB and the heap reaching 2112MB. These values are considered decent for 25k concurrent connections, although not as high as the anticipated 7.5GB. It is important to note that we are discussing actual HTTP connections here, which involve more than just spinning off multiple threads.
25k Connections, 10s sleep - tomcat
Adding a sleep before writing response, because you know we use virtual threads. This test is to hold 25k connections for 10 seconds.
When blocking 25k connections, the resident set size (RSS) eventually reaches 3.8GB, while the heap size reaches 3.2GB. The req/s doesn't matter much as most of the time the server only waits in sleep. In a practical scenario it would be waiting for database or signal from another service. An example scenario is to send live score of a sport to 25k users. When live score is updated all 25k waiting connections will be sent the current live score. The memory usage is is really good, but how good is this? One of the most memory efficient and safe language is Rust. Let's compare with it just because of curiosity.
The rust example uses axum framework with a similar hello world request handler. The program is built with --release flag and the version is 1.70.0.
It seems that there is no straightforward method to limit the number of concurrent connections within the framework. I speculate that the framework has a restriction on the number of streams per HTTP/2 connection, resulting in the failure of the wrk benchmark to send 25k requests. To address this issue, I increased the number of threads in wrk as a workaround. This approach assumes that in wrk, each thread establishes a new connection, thereby reducing the number of streams per connection.
25k Connections with sleep for 10s - Rust Axum
At the end, RSS usage of the rust program is at 537MB. Compared to Java's 3.8GB Rust is 7 times more memory efficient. I would say it is not bad.
Let's go even further by switching the java server. I have a feeling tomcat uses more memory per request. Let's try vert.x web. I use smallrye mutiny flavor of Vert.x.
25k Connections, 10s sleep - Vertx Web
As you can see the benchmark ran fine, but whats unexpected is the memory usage.
250MB to handle 25k HTTP connections. With -Xmx300m, the max RSS is down to 484MB. This is unreal. I had to increment a counter inside the handler to verify if it actually dispatches that many requests. Yes I counted 25k requests dispatching on the server. Comparing RSS value of Vert.x's 484MB to Axum's 537MB, I hereby declare Java uses less memory than Rust in this scenario :).
It is worth mentioning that there might be optimizations that can be done in Rust, or perhaps there are other Rust frameworks that can perform better than Vert.x. The primary goal here is not optimization or prove what is efficient, but rather to observe how the framework/language handles threads when the code is written without any intent for optimization or performance profiling. The Rust comparison is simply used as a reference to evaluate the effectiveness of Java's virtual threads :).