This might be more of a JVM question than a Clojure question, but I hope you won’t mind if I ask it here.
I am aware of Amdahl’s law but I don’t know how to apply it.
I am looking for some intuitions or rough rules of thumb.
I am at a company that uses Hubspot for marketing. We also use many other 3rd parties:
Luma for events
Calendly for leadership meetings
Stripe for payment processing
Swarm for lead discovery
Mailgun for email
plus a few others
I wrote a Clojure app that uses the APIs of those 3rd party services to pull in all the data from the 3rd parties and store them in a central database. (I am using MongoDB as my database, in part to adapt to the many divergent 3rd party schemas that I have to only partly interact with – it’s not worth my time to fully map those schemas.)
I also need to find clues about our users and then push those clues to Hubspot. In other words, assume a person has an email such as tim@example.com, and I find the email tim@example.com also appears in our Luma, Calendly, and Stripe data sources. We want to gather up that data and push it to the Hubspot Contact that we maintain for tim@example.com. For instance, one question is “Where did tim@example.com first appear in our system?” So then I have to look at every date in Luma, Calendly, and Stripe, where tim@example.com appears, and I have to find the earliest date, to see how tim@example.com first arrived in our system.
In other words, there are a lot of background processes, each trying to find some data about our users, so we can aggregate that data and push it to the appropriate Hubspot Contact.
To run things on background, I’ve been using a thread pool, and for scheduling, I rely on the At/At library:
In my core.clj, in my main function, I set up a thread pool and initiate all the background tasks. I pass the same thread pool to each task so they can use it to schedule further tasks.
At first I had everything start simultaneously, but it got to the point where the server was suffering, so I started to have these things start at a random time after startup (a random delay that might be as long as 5 minutes), spreading out some of the initial load.
Most of these tasks only run once every 6 hours. Most of these tasks only take 5 to 20 minutes to run, but if they all run at the same time, then it puts a strain on the server.
(let [tp (at/mk-pool)]
(log/initiate tp)
(world/initiate tp)
(at/at (+ (long (rand-int 300000)) (at/now)) #(reports/initiate tp) tp)
(at/at (+ (long (rand-int 300000)) (at/now)) #(push/initiate tp) tp)
(at/at (+ (long (rand-int 300000)) (at/now)) #(delete-old-data/initiate tp) tp)
(at/at (+ (long (rand-int 300000)) (at/now)) #(algorithm/initiate tp) tp)
(at/at (+ (long (rand-int 300000)) (at/now)) #(mailgun/initiate tp) tp)
(at/at (+ (long (rand-int 300000)) (at/now)) #(stripe/initiate tp) tp)
(at/at (+ (long (rand-int 300000)) (at/now)) #(luma/initiate tp) tp)
(at/at (+ (long (rand-int 300000)) (at/now)) #(hubspot/initiate tp) tp)
(at/at (+ (long (rand-int 300000)) (at/now)) #(supabase/initiate tp) tp)
(at/at (+ (long (rand-int 300000)) (at/now)) #(calendly/initiate tp) tp)
(at/at (+ (long (rand-int 300000)) (at/now)) #(user/initiate tp) tp)
(at/at (+ (long (rand-int 300000)) (at/now)) #(judge/initiate tp) tp))
I am currently running this on an EC2 instance at AWS. The instance has 8 CPUs and 32 gigs of RAM.
I run htop to see how much load the server is under. Since I have 8 CPUs, I figure anything under a load of 8 is fine.
But lately, I have added some new tasks and now the load is hitting 10, when all the background tasks run at once.
This server is not memory constrained. Of the 32 gigs of RAM, I think I’ve only ever seen 7 gigs in use, and that is rare. Even with every task running simultaneously, the RAM in use is usually only 5 or 6. But the load goes up to 10, as I said.
I am trying to think about how to use the JVM scheduler to both speed things up but also spread out the load.
At one point I started wrapping some database updates in their own function pushed to the thread pool, and this gave me a significant speed up:
(at/at (+ 100 (at/now)) #(world/create (merge item {
:item-is-imported “yes”
:imported-via-api-from “hubspot”
:item-type item-type
:hubspot-id hubspot-id
})
:hubspot-id) tp)
But now I am thinking that doing this is also increasing the load on the server?
I’m assuming that when a thread is sleeping it imposes no burden on the server.
I also assume that feeding small functions to the thread pool allows the JVM scheduler to efficiently spread work to all of the CPUs. (Said differently, the JVM scheduler would not be able to efficiently spread large tasks, involving thousands of database calls, to the different CPUs, unless I first break up those large tasks into small tasks.)
But I’m also thinking this (dividing the work into many small tasks) allows the JVM to perhaps put too much pressure on the server?
There is very little happening on this server, other than this one app that I’m running, and it is not public, so I have almost total control regarding how fast the tasks should be fed to the server.
By default, the At library from Overtone creates a threadpool with threads set to “the number of CPUs, plus two.” I have accepted these defaults. Am I correct there would be less strain on the server if I set the threads to exactly the number of CPUs?
So, with all that as background, I am curious about:
- how do I control the pace of work
- how do I determine which tasks put the greatest strain on the server?
- Are there any clever “emergency brakes” I can implement to keep the server from being overwhelmed?
- When can I improve speed/performance by breaking down a task into smaller, more fine-grained tasks that can be fed to the threadpool independently of one another? (Versus when do I hit the limits specified by Amdahl’s law?)