That’s an interesting dimension you’re adding that I think isn’t discussed enough.
For example, I brought up the point where for a traditional backend service, a type error in a strict language will throw an error and crash or at least kick off your alarms quite quickly. That means this type of error can be caught pretty quickly, most often at development time as you call your functions in the REPL, if not then during testing when you run your unit tests or integration tests, if not then while you’re baking your changes in staging environments, and if not then on a one box or some sort of staged rollout to production, and if it escaped all that, because maybe the code path to trigger it is rarely used and you didn’t have tests covering it, then at least once your alarm triggers for it, the fix is very quick most of the time.
Since static type checking primarly prevents this type of bug, I think like you said, it matters to evaluate the impact and occurrence rates of such bug to your use case.
For me personally, as I said, I feel a backend service will very rarely have such a bug escape all the way to prod, if it does, it’s most likely a rare occurrence that needed a not often taken code path, and why it escaped to prod in the first place, and wasn’t caught during your entire QA where you ran the service many times. So it’s not a common production bug. And when it does escape to prod, a backend service will generally catch the runtime type error and fail the request, log the error, and finally publish a failure metric which will cut some issue to the team where the on-call will be able to quickly debug the cause from the log, and make a quick patch. This also thus makes it relatively low impact.
Now does this occur a lot during QA? If every code change going through the CI would break the pipeline, break the build, etc. It would also cause development pain and slow the team down. But personally that hasn’t happened for us either, honestly type errors of that sort are most often caught in the REPL during coding time, or on a dev local environment when running tests or integ tests locally from their machine.
Now I’d be curious: what other type of bugs would a type checker catch? if any?
I think using closed ADTs over data-structures and a type checker can help catch typos in key lookups or accessing a key that wouldn’t ever exist on a map. I think this kind of error is more common in development, and maybe a bigger pain points when it comes to slowing the team down. They similarly tend to get caught pretty early though and rarely make it to prod, but if they did, they’d have a higher impact, because those won’t be strictly validated for at runtime either, so you’ll get a nil and the code might act as if it simply thinks the value is nil, or nil puning kicks in and everything “does the right” thing, and maybe there won’t even be an exception or error anywhere, making the bug silent, and only the user might start to suspect something is broken needing to report an issue to the team manually. So I’d personally consider this a bigger fish to fry if we wanted to focus on how to prevent some new kind of defects in Clojure.
What other kind of defects can it catch? And I think we should contrast that with defects a REPL can catch, or a code base with less LOC can catch, or that Spec can catch, etc. Static type checking won’t be your only tool to catch defects, so you have to contrast the occurrence and impact of defects caught by a static type checker against the occurrence and impact of defects caught by unit tests, integ tests, REPL driven development, lower LOC, simpler language constructs?, code reviews, manual QA, baking, generative spec testing, spec runtime assertions, immutability, dev training, better rested devs, etc. Not that static type checkers are exclusive to all of those, but it could affect how much, how quickly or how simply you can do any other one of these, so it still matters.
Finally what you really made me think about though is that there’s also a cost to having a bug in production depending on the use case from the point of view of the user.
For example, in a hardware driver, if there is a type error and it throws an exception and crashes, that really doesn’t help me as a user. My computer just becomes unusable, maybe even so unusable that even patching the driver is non-trivial, maybe I have to boot in safe mode or something like that to do it. So drivers really shouldn’t have type errors or anything to make them panic and crash. On the other hand, like I said, a backend service from the user point of view, you get a few 500 error codes in some rare circumstances, and a few hours later (depending on the service SLA), it suddenly works again, no need to patch, update or do anything from the user side.
But what about in application code? Like frontend or local apps like a command line, a text editor, a game, etc?
I think in those cases, it depends. A type error would be annoying, some error would show up, and so a piece of functionality would be broken, and as a user, I would need to report it manually to the devs (unless they have telemetry in the app), and wait for a new patched release to be made available to me, then proceed with manually performing the update, etc.
On the other hand, if the language has hot-code reload or is source based with readable source the user can edit themselves, or has a config file that allows them to hot-patch things, where the user might be able to fix bugs on their own, and don’t need to wait for the devs to release a fix. Maybe for a type error you’d still expect the devs to quickly deal with that, but for other type of errors, especially some related to your particular setup and environment, having that ability as a user (if you’re savvy) is great. Emacs is a good example of this, but I’ve seen greasemonkey scripts used to fix bugs in web apps before by users. So in Emacs when I encounter a bug, I can just patch my config and the bug is gone. For tech savvy users that’s awesome.
Similarity, if you were to write software for a remote robot sent to Mars, you probably don’t want panics that crashes everything, but you would also benefit from remote hot-patch functionality, because even if you use Idris and had a Coq proof that everything was bug free, and spent 200 months testing everything, you still could encounter a bug, and the ability to remote debug and hot-patch it for a million dollar robot and a year long space travel for it to reach Mars would totally benefit from that.
Anyways, I really like that angle, it brings concrete requirements into the equation and that makes a lot of sense.