🥊 Parallel Transducing Context Fight Night: |>> (pipeline) vs =>> (fold)

John_Newman · October 4, 2021, 2:25am

Just tried it on a 64 core Linode box. Basically the same graph. 26 times speed up, at exactly a 40% core efficiency ratio.

joinr · October 4, 2021, 9:15am

Interesting results. This is looking more familiar. Curious if the ~60% overhead holds in general, or is peculiar to this workload. There are also a bevy of jvm options to possibly explore as well.

The HT/physical core count is also something to consider (they are somewhat opaque to the JVM, and have to be derived from the OS I think).

bsless · October 4, 2021, 9:30am

This is actually something I know about.
While HT fakes being two processors, it isn’t really two processors. What it gives you is the option to parallelize some operations with no data dependency which don’t use the same modules. So you can probably dispatch two floating point operations in parallel, or integer operation while other things are going on.
The interesting part is here:

Look at the execution engine. Operations can be dispatched in parallel when they can be assigned to different ports.
After that, you should use a tool like JITWatch to understand the generated assembly, you’re at a level where you want to understand what’s going on.
The way the results look to me, there isn’t a 60% overhead, this code is just only able to utilize one thread of the HT model, which some have called bunk in the past, and the overhead is more like 5%.

bsless · October 4, 2021, 10:30am

@John_Newman try testing on a AMD or ARM processor, should be interesting

didibus · October 4, 2021, 8:54pm

I’d raise the HEAP maximum as well on the JVM, if you do more stuff in parallel, it would seem normal to me that more HEAP would be needed, so if the same default is used, maybe that bottlenecks things.

system · April 5, 2022, 8:55am

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.