I’m trying to maximize the performance of a web application that I’m working on with Clojure. I’ve done some reading and used a couple standard methods, but I want to be certain that I’m not overlooking anything important. Here are some details about my configuration and the things I’ve tried thus far:
I am using Clojure with Ring and Compojure for routing. The application is hosted on a cloud server with sufficient resources.
I am using PostgreSQL with JDBC. I’ve indexed my database tables and optimized queries as much as possible.
Implemented caching with Ehcache to store frequently accessed data.
Using core.async for handling some of the background tasks.
I also check this: https://clojureverse.org/t/clojure-web-server-performancemendix But I have not found any solution. Could anyone suggest the best solution for this. and I am still occasionally seeing latency problems in spite of my efforts, especially during periods of high consumption. I’m seeking guidance on best practices unique to Clojure web applications or more sophisticated optimization strategies. These are some particular queries:
What are some successful approaches to profiling and identifying bottlenecks in a Clojure application?
Are there any suggested libraries or tools for monitoring and performance analysis?
How can I better use concurrency to increase throughput?
Are there any typical problems or anti-patterns to avoid when developing Clojure online applications?
Any information or resources you could provide would be greatly appreciated. I’m especially interested in learning about real-world experiences and solutions that have worked for you.
With all due diligence, that should be enough to create a set where the issue is more or less easily reproducible. Assuming the issue isn’t caused by a sporadic lack of CPU or other resources caused by e.g. some background job, and instead lies within the Clojure app, you should be able to find what it is with the info on this website: https://clojure-goes-fast.com/.
I agree with p-himik comment that it is crucial to identify in production which area of your code is slower and in possibly which inputs make it slower.
I’ve highlighted “in production” because the long tail could be different in your offline test bench.
To better analyse this sort of issues I’ve developed a library called µ/log which can be used to do structured logging. In particular there is a macro called µ/trace which provides tracing of key parts in your code.
The key difference between µ/log and other tracking libraries is that µ/log is designed to capture the contextual data, for example your request parameters etc. This is fundamentally important to understand which inputs produce slower responses.
Once you instrument your code with µ/trace it should be much easier to identify and reproduce the code paths which are slower than your expected SLA.
Then, once you have a general idea of which key action in your system is slowing down, the clj-async-profiler should provide enough details to highlight exactly where the extra time goes.
A few years ago I wrote a few scripts to compare the performance of webservers in a specific scenario (see this benchmarks), the work was done to analyse the long tail in a production web service.
Although most the scripts could be out of date now, I would recommend two tools to perform offline benchmarks
wrk2 by far the best tool to benchmark the long tail (up to 6 nines like 99.9999% percentiles)
gatling.io has really useful charts to understand how many RPS (request per second) your web server can withstand before you need to scale out.
These tools are more useful to ensure that overtime there is no regression, for example you could add a load test directly into your CI/CD pipeline.
If you need more help or want to chat about how to interpret the benchmarks results, I’ll be happy to help