Strategy/tactics to minimize unresponsive Ring handlers

Hello folks,

I am running a multi-tenant SaaS webapp. Now that the user base grows, it can occasionally happen that it becomes unresponsive; sometimes it hits an out of memory error, sometimes it seems like the threadpool is congested or something.

Are there general configuration strategies or tactics (or middleware?) to prevent this?

In my case, I run http-kit (fronted with nginx as reverse proxy), and I use compojure to create my endpoints.

It sometimes happens that when a certain handler that does a lot of IO (ie. generate a large PDF file from a lot of database queries) is hit in a bulk, and then the whole app becomes slow.

Is this something to improve in the code/app level? Or should it be more of an infrastructure concern?

My idea was that optimizing first on app level would make the most sense, and within that category, starting with like app-wide optimizations (rather than optimizing specific ā€˜notoriousā€™ endpoints)

Interested to hear your ideas!

Why is your web server building PDFs (or anything) from ā€œa lot of database queriesā€? That kind of work is like digging a ditch. If you run a ditch digging company, and the phone rings and someone orders a ditch, you do not put them on hold and run out and dig a ditch for them while other callers wait. Instead, you queue the job and let your ditch-diggers do it (with entirely separate resources) while you take other calls. The miracle of Javascript can give your client the impression that you are digging their ditch while in fact you have queued the job and hung up on them.

P.S. But by leaping to that question and ā€œgreat ideaā€ we have already been precipitous. Stuart Halloway gave a conference talk about methodical troubleshooting that you may watch for inspiration.

2 Likes

Yeah good analogy. Such a question of course induces such a response.

But thereā€™ll be challenges and complexities to work with queues, and other async things as well.

So sure, I can make tradeoffs like that on a per case basis, but I was wondering about whether thereā€™d be more global things to be done.

Via libs, middleware or other.

There are no magic/silver bullets to avoid that sort of thing ā€“ as Phill says, you need to take extra care around requests that do a lot of I/O and/or database requests and think about how to write those in a scalable way from first principles.

1 Like

@Kah0ona, itā€™s also perfectly acceptable to inform your users that your webserver is at capacity, and thinking in terms of ā€œbackpressureā€ might be respectful to the users that have existing workloads in progress. Not sure if http-kit has a rejected handler, but aleph (netty) does ā€“ you can use that to limit the total number of concurrent threads and render a message or page accordingly. Also, like @Phill suggested, moving the expensive work to an alternate timeline, even polling a job queue table could solve some parts of the problem

1 Like

We had the same problem. Ring handlers performed longer running tasks, all 4 httpkit threads were occupied, the health checks failed, Kubernetes started to restart web servers, uptime alerts got triggered.

Solved it by using virtual threads (JEP 425: Virtual Threads (Preview)). It still in preview mode and part of JDK 19 that is probably GA in September. However is running fine for several weeks now :sweat_smile:

I thought long about this problem and if I should use an early access JDK version in production. But virtual threads are the only solution that I found that solves this problem at the right level. It is transparent to the programmer and it does not add accidental complexity like async code.

You can pass the :worker-pool option to httpkit and add (Executors/newVirtualThreadPerTaskExecutor) as value.

2 Likes

ah thanks for this! So for you it was ā€˜justā€™ a matter of adding that configuration to httpkit, and then run the latest JDK on the server?

Yes, adding this to the httpkit configuration was the only change in the source code. However, adding it to the build process, deployment scripts, and development environment took much longer.

Basically, you need to add the --enable-preview flag to all places where you start the JVM. Here are the relevant pull request parts for our system:

We run our system in Google Kubernetes Engine, so I added the --enable-preview to the Docker entrypoint. For the development environment, I added it to the projectā€™s deps.edn file and the .clojure/deps.edn file, or rather the alias that is used to start the nrepl.

2 Likes

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.