Theron - C++ concurrency library

Theron performance

This page presents performance results for Theron 3.03, for five different benchmarks: ThreadRing, ParallelThreadRing, CountingActor, ProducerConsumer and PingPong. Source code for all benchmarks is provided in the 3.03 distribution; see the individual pages for more details. As well as presenting results for version 3.03, we compare the results against previous versions 3.00 and 3.02.

Here are the headline figures:

  • 5 seconds for 50 million message "hops" in ThreadRing
  • Peak throughput of 10 million messages per second
  • 5 seconds for 50 million messages (25 million message/response cycles) in PingPong
  • Best-case message/response latency of around 200 nanoseconds
  • 12.3 seconds for 50 million messages in CountingActor (0.7 seconds for 3 million)

Test environment

All performance measurements were made on an Intel Xeon X5550 2.66GHz (four hyperthreaded cores, 8 hardware threads), with 6 GB RAM, running Windows 7 64-bit SP1. The results were measured with optimized 32-bit and 64-bit Visual Studio 2010 builds, built with the included Visual Studio solution, using Windows threads. Theron was configured with 16 software worker threads for all tests.

Message counts of 50 million messages were used for all tests. In the case of PingPong, this constitutes 25 million complete message/response cycles. In the case of ParallelThreadRing, the 50 million messages, or "hops", were split equally among the 503 tokens sent around the ring (with the first 502 tokens performing 99404 hops each, and the last performing 99192 hops). In the case of the ProducerConsumer benchmark, two Producers were enabled, with each sending 25 million messages concurrently to a single Consumer.

Results

This table presents the actual times in seconds. Each time presented is the best of 5 successive measurements:

Benchmark Theron 3.00 Theron 3.02 Theron 3.03
dynamic_cast registered types
32-bit 64-bit
CountingActor 21.3 18.7 17.8 14.7 12.3
ParallelThreadRing 22.8 17.4 17.3 14.3 12.0
PingPong 7.5 6.4 5.9 5.5 5.0
ProducerConsumer 33 16.1 15.1 14.4 12.3
ThreadRing 7.5 6.2 6.2 5.6 5.0

Note that three different times are presented for version 3.03. The 3.03 release introduces support for 64-bit Visual Studio builds, so times for the 64-bit build are available for the first time. Additionally, the benchmarks were changed in the 3.03 release to register their message types, avoiding expensive calls to the C++ dynamic_cast operator for message type identification. For that reason we present two times for the 32-bit build: one without message type registration (comparable to the times for 3.00 and 3.02, where message types were also unregistered) and one with (showing the speedup due to avoiding dynamic_cast). For the 64-bit results message types were registered.

Discussion

Most obvious is the successive speedups in versions 3.02 and 3.03, with the PingPong and ThreadRing benchmarks, which measure peak raw message throughput and latency, showing similar reductions of about 20% between 3.00 and 3.03 -- ignoring for now the additional benefits seen from message type registration and 64-bit builds.

The other benchmarks are more complex and show different effects. The ParallelThreadRing benchmark effectively measures the effect of massive contention for shared message processing resources: the 16 software threads, the 8 hardware threads, the 4 hardware cores, the 503 actor message queues, the single actor work queue, and the shared message memory cache. The bigger reduction of around 24% in this benchmark probably shows the beneficial effect of optimizations on cache coherence and memory overheads, which are stressed in this benchmark due to contention.

The CountingActor and ProducerConsumer benchmarks are similar and are both probably memory-bound due to allocating large amounts of memory for queued messages (because the writing threads in each case are able to flood the unbounded message queues of the reader). Both benchmarks allocate gigabytes of memory, and the amount of memory allocated is strongly dependent on the relative speeds of the writers and reader, making these benchmarks overly sensitive to optimizations.

In addition to the speedup in raw message processing ability are speedups resulting from registration of message types and support for 64-bit builds. Avoiding the expensive dynamic_cast operator by registering the message types used in the benchmarks reduces both code size and runtime overheads, resulting in a further reduction of execution times of around 8% in the PingPong and ThreadRing benchmarks. Then, using 64-bit builds instead of 32-bit builds results in a significant further reduction of around 10%.

The 64-bit results for PingPong and ThreadRing indicate a peak message processing throughput of 10 million messages per second. The similarity of the times in these benchmarks suggests there is little extra overhead in sending a message around a ring of 503 actors than in sending it back and forth between just two.

The result of 5 seconds for 50 million messages in ThreadRing compares well with results for implementations of the same benchmark in other systems, including actor-based languages.

This isn't a completely like-for-like comparison since Theron is considerably more lightweight than some of the languages tested, and Theron uses an M:N architecture instead of actually creating 503 separate software threads. But it does give an indication of raw message processing ability.

The PingPong result corresponds to a round trip message latency of 200 nanoseconds. The latency, or response time, is defined as the time taken to send a message from one actor to another and then receive a message back in response. In this case the messages are 32-bit integers, decremented at both sender and receiver. Because messages are copied, latency will of course be greater with larger messages.

The execution time of 12.3 seconds for 50 million messages in CountingActor can be compared to anecdotal results available online here and here for the same benchmark in other actor-based systems -- although a message count of only 3 million is typically used online. For reference, Theron 3.03's result for 3 million messages can be expected to be around 0.7 seconds, using a 64-bit build -- roughly in the same ball-park as the figure of 1.2 seconds quoted here for Erlang.

Comparing the results for ThreadRing and ParallelThreadRing gives an indication of the overheads of significant parallelism in Theron. Both benchmarks consist of sending 50 million messages, or 'hops', in a ring of 503 connected actors. The difference is that in ParallelThreadRing the 50 million hops are performed by 503 tokens in parallel (with each token doing around 99400 hops) whereas in ThreadRing the hops are performed by a single token and are effectively serialized. Because the actual processing done by the actors between receiving a token and forwarding it on is marginal (a decrement and a branch), the actors holding tokens are effectively all trying to send their token messages at the same time. The fact that the 50 million messages take only just over twice as long when performed 'in parallel' in this fashion suggests that the overheads of massive contention for shared resources are not too severe.

Since ParallelThreadRing performs the 50 million hops in 503 parallel batches, we might naively expect it to be 503 times faster than ThreadRing. But in reality it's all overheads and no parallelism, since the actors do no real processing and instead spend all their times sending messages.

Conclusion

The results confirm that Theron is among the fastest available Actor Model implementations on any platform.

The 3.03 release gives a significant speedup compared to the 3.00 release, with a reduction of raw message handling overheads of around 20%. Added to this are significant speedups resulting from Theron's message type registration mechanism (an advanced feature added in version 2.06) and support for 64-bit builds introduced in 3.03.

The peak throughput of around 10 million messages per second in ThreadRing, and the round-trip latency of around 200 nanoseconds in PingPong, reflect raw performance where message queues are short, actors have only one message handler registered, and only one actor is ever sending a message at any time. The results for the other benchmarks give an indication of additional overheads due to contention and memory allocation. They suggest that performance degrades acceptably with parallel message sending among a large number of actors.