Is this an efficient way of handling concurrency?

didibus · October 5, 2020, 9:04am

Your disk can read data into memory without the CPU being involved, up to some buffer size. So while the disk is reading some chunk to memory, the CPU is free to do other things. And newer types of disks, like SSDs, can actually read different parts of the disk in parallel, which is one reason for SSDs to be so much faster then traditional drives. To do that, SSDs actually come with their own micro-controller which will handle it, again, freeing the CPU to do other things.

So in general, parallel reads from disk should not be faster, but with SSDs, they often are to some extent. This could even be true of HDDs. There’s a few reasons I know:

It might be you’re not saturating disk IO, so if for example you read a chunk and go to process it, and then read another chunk, during the time you went to process the first chunk, you were not reading from the disk anymore. So you wasted time you could have been reading. So if you parallelize your IO requests, when the first returns from the disk and your CPU goes to process it, while the CPU is processing, the disk can begin reading the next chunk from disk, so as soon as the CPU is done, the chunk is ready for it, etc.
Some OS and combination of hardware can be smarter about the IO scheduling. If you tell the scheduler your next 5 IO requests, it might realize it is faster to do them in a different order, due to where the data is on the disk. But if you tell it one, go to process it, then tell it the other, etc. It won’t have a chance to come up with a better ordering.
Some OS and combination of hardware actually support parallel reads. SSDs for example can read from multiple places in parallel. So again, if the scheduler and IO controller knows about a set of requests you want from it, it could decide to actually parallelize some of them.

EDIT: By the way, in your case, I don’t think the disk read is the slowdown, since you mentioned things that take multiple seconds. On my machine, I can read 20 files each of 50k lines of random numbers in 50ms. I suspect the time it takes you to parse each line, prepare the query request, connect to the DB, the network time to send the request to the DB and the DB to answer is where time is spent, and not really the read from disk. That said, keep in mind the network behaves somewhat similarly to IO, your network card can send and receive packets from/into buffers without the CPU being involved, and obviously your DB can process things while your machine’s CPU do other things as well. So there’s lots of parallelization going on here between CPU and IO.

joinr · October 5, 2020, 9:26am

The client library I use provides 900 connections to the target (database) to perform send the IO commands. I have 1 million IO operations to be performed in groups of 50,000 each. So basically there are 20 chunks of IO operation to be performed where each chunk contains 50,000 operations.

I am uncertain what bearing the connection limit has on the operation. I assumed it was some form of write constraint, which manifested in my answer. Your actual implementation doesn’t seem to care about connection limits though. Is this just noise, or is it a legitimate constraint (say “at scale”)?

I need to process these 20 chunks concurrently.

This is more commonly viewed as working on 20 chunks (or even sub-batches of the 20) in parallel. As with the Rob Pike talk I linked, the broader notion of concurrency is amenable to parallelization.

In programming, concurrency is the composition of independently executing processes, while parallelism is the simultaneous execution of (possibly related) computations. Concurrency is about dealing with lots of things at once. Parallelism is about doing lots of things at once.
Rob Pike

By virtue of the fact that you are returning futures, implicitly creating your own thread pool of size 20, those independent threads will distributed over how many physical cores you have (ideally 20). So work can be executed simultaneously; which I think is the point of your post. Since all of the work is seemingly independent, there really appears to be no concurrency (e.g. concurrent read/write of shared resources), outside of writes to the database.

Part of my confusion is the usage of concurrency with the apparent operationally parallel semantics. I “think” the goal here is to distribute all these operations across your resources in order to do the total work in the least amount of time possible (possibly respecting some global constraint like connection limit). The only coordination required by any of the processing tasks seems to be communication with the database, since there is no synchronization of state or inter process communication via channels or software transactional memory. You seem to have arbitrarily chosen to chunk the work into 50K batches. These batches are processed independently (and operationally they will be executed simultaneously on a multicore machine, at least the reading of lines and writing to the database).

The approach I took with pipeline abstracts the chunking away. All work is combined into a lazy seq of concatenated lines, which are submitted to a buffered channel. That channel is then processed via pipeline’s semantics, in which the work is done in parallel batches (which you can parameterize say according to your system resources). I have 1 million pending lines to write, and say 4 cores to distribute the work over (assuming 1 thread per core). So I spin up a pipeline with 4 threads, and let it process “the work” in batches of 4. Maybe this is suboptimal, and we’re better off building up much larger database writes. We can still modify the processing to include similar chunking and batch writing semantics without much problem. We could say read 50K lines and process that as a bulk insert to the database (that’s an exercise for the reader at the moment).

I think at the end of the day, you are trying to read in parallel (reading lines still takes some CPU), and write concurrently (hopefully in parallel if the database can handle it) where possible. If things are tuned right, then you can match the impedance of the database with the lines your system is able to process, and nice parallelism is achieved. In the worst case, some operations may lag while the overall work proceeds though.

joinr · October 5, 2020, 9:28am

There’s also mmap as well; which would allow further efficiency for parallel reads (like the iota library).

daver · October 6, 2020, 2:08am

I’m not familiar with Aerospike, either, but to echo what @didibus is saying, most DBs allow you to make oodles of connections because the cost of connection setup is high and you don’t want to pay that every time you read/write anything. The actual DB engine is typically parallelized on the order of the number of cores in the system. That’s because concurrency (not parallelism) has a switching cost and they are going for maximum overall throughput. So, while they might expect hundreds or even thousands of connections, they expect most of those connections to be idle at any given time. Any number of concurrent requests beyond the core count will then be queued and processed as soon as possible.

Your task is to figure out what the real PARALLELISM of your system is and match your input load to match that. If you go above that, it might seem like you’re doing a lot “in parallel,” but really all you’re doing is causing a lot of queuing/dequeuing overhead which will slow you down. I suspect that if you throttled things down on the sending side to the order of 20 - 40 threads and just had those push data through twice the number of connections (say 40 - 80), you’d probably see higher overall throughput. You’ll obviously have to tune those through some experimentation. I would perhaps hand one file each to a thread and have those threads operate independently with 2 - 10 connections each, pushing data through them. Again, your goal is to keep the DB saturated, but not so overloaded that it’s having to queue a lot of transactions.

Remember, your peak throughput will be JUST BELOW saturation. Once you go beyond that, best case everything handles the additional load well and uses great load-management techniques (queuing, flow control, etc.), but that’s not a given and once you invoke these mechanisms, your DB is going to be doing more work to mange the load and not putting every CPU cycle into doing “DB stuff.”

system · April 6, 2021, 2:08pm

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.