Creating a central documentation repository / website, codox complications

olieidel · December 31, 2017, 3:55pm

Following up on @plexus’s autodoc presentation at the ClojureBerlin december meetup (great work!), Arne also briefly touched upon Elixir having a central documentation website, namely hexdocs.pm and whether we could accomplish the same for Clojure.

Over the holidays I thought I start hacking on it, it can’t be that difficult, right? Turns out, it probably is or I am missing something (which is equally likely). Spoiler: I didn’t get very far and thought it might be useful to collect some thoughts from more experienced Clojurists (= you) first before proceeding.

Hexdocs

The basic gist of hexdocs.pm is this:

You release a new version of your library to hex.pm (the Elixir package repository), example: Ecto
It automatically builds your docs and puts them up on hexdocs.pm, example: Ecto Docs where the “root” directory always points to the docs of the latest release
Docs for old versions are archived in a directory with the name of the version, example: Ecto v2.2.4
Yes, it’s that simple.

Clojure Sketch

So my basic sketch for Clojure was this:

Get all packages from clojars
Generate docs for all packages with codox
Put them up on S3 or so.

Sounds naive? It is. Here are the nitty-gritty details:

(easy) Scrape clojars.org and get packages with version numbers
- Clojars has easy-to-process data available.
(hard) Run all packages through codox
- ~~We can’t rely on the source code being in the jar file, we therefore have to head over to the source repository as seen on clojars (mostly some GitHub link) and grab the source.~~ Not all packages have GitHub links. Update: The .jars from clojars actually include Clojure source code (99% of them), so that’s good news
- After we’ve grabbed the source, we could in theory call codox.main/generate-docs with an options map which points to the directory. This will however fail for the following (obvious, but not to me initially) reasons:
  1. codox expects the package for which the docs are to be generated to be on the classpath! Why? It requires each namespace as prerequisite for parsing it
  2. require-ing a namespace assumes all dependencies are available and on the classpath
- I see three workarounds for this, all suboptimal in their own way:
  - Write a complete codox alternative which parses the source files on a lower level without require-ing them (lots of work, doesn’t sound like a good idea). This would allow us to generate docs without having the dependencies available. My best guess is that marginalia does this.
  - Dynamically load a package and its dependencies via pomegranate before generating the docs: May not work for some packages. How to unload packages before we start loading the next package for the next doc generation cycle? In playing around with it, I couldn’t get it to work but it could totally be caused by eating too many leftover christmas cookies…
  - Read the (leiningen) project.clj, assoc some codox options to it if missing and call lein codox via the shell (clojure.java.shell/sh), find the generated files, copy them somewhere and throw everything else away again (ugh).

I’m totally at a loss here and would appreciate your thoughts!
In short: How can we efficiently run codox (or something similar) on some random package without going through leiningen and / or having to download all its dependencies?

Some more random ideas

We could not generate static html like all other doc sites do, but instead create a repository of “raw” codox maps which include all needed information (namespaces, functions, doc strings, lines, etc.). We could then archive these codox maps for each package version.
(This would save space)
We could then create a super-slim single-page app which queries this repository of “doc maps”, parses and displays them (parsing markdown etc).
This would in theory allow for some super-neat features, like search queries over all functions of all libraries (because we have the data available somewhere), saving “favorites” and caching docs for offline usage (because it’s a SPA). We could also add some klipse in-browser-evaluation voodoo to the mix.

martinklepsch · December 31, 2017, 4:40pm

Awesome, that you took the time to look into this, it’s been on my mind since the meetup as well!

I think we can rely on it for 99% of packages. Most libraries don’t ship AOT compiled byte code but Clojure source code. For libraries (minority I assume) that don’t do this we could establish some ways they can work
around this issue.

Boot could be a useful companion in this adventure. It provides a convenient utility called Pods in which we can load dependencies in an isolated manner. If we need a fresh environment we just use a fresh pod. We can also dynamically add files from jars to some directory that we create beforehand.

I think it is highly preferable to use jars as foundation for this kind of effort since not everyone may be properly tagging releases on Github or even link to their Github repo in their pom.xml.

Having raw data is an amazing idea for editor integration and many other things. I’d expect something like this also to be a welcome addition to codox if it isn’t already present.

I’m in a bit of a rush right now but would love to collaborate on this kind of effort. I’ll try to whip up a basic example of using pods and copying jar files into some directory tomorrow or so.

olieidel · December 31, 2017, 4:49pm

Wow, thanks for your detailed reply, @martinklepsch!

Jep, I just noticed while examining some .jars… Good point!

That’s a great point and I think it sounds like the way to go!

Following up on this thought, we could maybe maintain an rsynced mirror of clojars which periodically generates docs for all new package versions. If the space requirements seem reasonable (I’ll look into that), we wouldn’t have to worry much about dependencies (because we have them all, haha).

I’m a total boot newbie and didn’t know about pods at all but they sound well-suited for this purpose. Will look into it in the next few days

plexus · January 1, 2018, 9:11am

I don’t have all the answers here, but some thoughts:

It’s understandable that Codox requires all source files given the dynamic nature of Clojure, but this opens this poses some serious sandboxing challenges. How long before someone uploads a jar that, when required, starts mining bitcoins?

The vast majority of libraries are well-behaved enough that they could be analyzed statically to pull out the docs. Presumably it should be possible to feed that into codox’s generation backend. Not saying this is necessarily the best way to go, but I wouldn’t rule it out, as it would simplify things a lot.

I don’t think it’s necessary to eagerly generate all docs for all packages and all versions, instead make it lazy. Whne someone request package x version y for the first time show a message saying “your docs will be ready in a minute”, and then fetch and generate the docs in the background.

Being able to read people’s Codox config is pretty important, as it specifies which format the docstrings are in (e.g. Markdown), and also lists extra files to be included (e.g. doc/GETTING_STARTED.md). Either we read those from lein.clj or we create some standardized way to include this metadata in a jar, so going forward library authors could use that.

I think it’s fine to start with a subset that we can easily handle, e.g. projects with a Github link on Clojars that use leiningen or boot. That should already cover a pretty large amount of libs, you can deal with other cases later.

martinklepsch · January 1, 2018, 10:26am

@plexus has a good point here. Simply loading and unloading dependencies will not provide enough isolation against bad actors. We could look into clojail and similar stuff but I guess it could be easiest to just wrap it all in some container thing.

martinklepsch · January 1, 2018, 10:47am

That’s a good point as well and one that isn’t easily solved when using jars as foundation. Would be interesting to look into how hexdocs solves this configuration problem.

olieidel · January 1, 2018, 8:52pm

Yes. It would be elegant to just use the jars but there we’re missing leiningen and boot files and we would therefore have to a) guess the source directory structure, b) guess the runtime (clj / cljs / clr) (codox needs this in its opts) and c) guess the codox config (ugh). Further, as @plexus correctly noted, we’re then missing the ./docs (or similar) folder and files.

Good point!

See also codox issue #126 which sadly went nowhere.

martinklepsch · January 5, 2018, 5:31pm

I made something which currently only operates on jar files: https://github.com/martinklepsch/clj-docs

After cloning you can run something like this:

# creates derivatives-docs directory
boot build-docs --project org.martinklepsch/derivatives --version 0.2.0

or this

# creates core-docs directory
boot build-docs --project boot/core --version 2.7.2

Here are some additional thoughts which are also in the README:

EDIT To prevent things from getting out of sync please check the README directly.

juhoteperi · January 5, 2018, 6:32pm

Jars packaged with Leiningen contain project.clj on the jar file root, and under META-INF/leiningen/group-id/artifact-id. Packaging build.boot on jar file would make a bit less sense, as it is code instead of data.

danielcompton · January 7, 2018, 10:03pm

I think this is a really cool idea. https://github.com/clojars/clojars-web/issues/510 was opened to talk about hosting this kind of thing on Clojars. Clojars already has a bunch of infrastructure in place for this kind of thing, and would be a fairly stable long-term location for this. If you’re interested, I think this could be something that Clojars could help out with.

arrdem · January 7, 2018, 10:37pm

hexdocs.org is awesome! I hadn’t seen that before. I’d definitely like to see the state of Clojure’s documentation overall improve, and I agree that having better tools for writing and sharing documentation is a important to that effort. Having a single, central domain such as say http://docs.clojars.org which leverages the existing clojars property would also dramatically help discoverability and searchability via Google of hosted documentation. Spreading docs out over lots of github pages and wikis defeats a lot of the heuristics Google uses to determine page rank and leads to the repositories themselves frequently being higher link score than the doc pages.

One idea worth exploring is Maven’s existing support for classifiers. Particularly, the Java ecosystem already uses the “javadoc” classifier to deploy documentation alongside jar artifacts.

As @plexus already noted, it’d be pretty trivial to steal compute resources from clojars if there’s an execution model – java bitcoin miner and all that. Coming up with a standard for how documentation is built and distributed as a static file format solves that entire problem.

There’s already some work in this area. For Grimoire, I developed a Maven-inspired store structure for documentation, notes and examples – lib-grimoire. Later, a GSoC student I was mentoring came up with a related specification – lib-grenada.

Building support for rendering “cljdoc” jars or whatever the classifier may be sidesteps the code indexing problem. Library authors may now build and deploy their own documentation, and serving it should be fairly straightforwards. This reduces the intractable Clojure sandboxing problem down to the much better understood HTML sanitization problem of making sure that served “cljdocs” don’t for instance package malicious JavaScript or what have you.

Building a plugin to make deploy generate and push a documentation jar as shouldn’t be too hard. It also gives people a natural offline documentation story, because docs are just a jar that Maven can fetch and cache.

To take a step back here, I’d like to point out that there’s a lot more to documentation than just API documentation.

I wrote conj.io working from the assumption that all you really needed was API documentation – If you just generated API docs users would be able to discover what they wanted (or needed) and life would be good. This turned out to be very false. Grimoire’s primary value add was its cheatsheet which related symbols to each-other along logical lines. I did disfiguring damage to the site and my traffic metrics reflected it when I started making changes to the cheatsheet without appreciating this fact.

While Codox supports articles, all the uses I’ve seen treat it like Javadoc which makes this same assumption that gendocs are sufficient.

This and other experiences leads me to think that documentation needs to encompass examples, cookbooks, API documentation and full length articles all of which need to be able to reference each-other. API docs are good, but unless you can relate the API to concepts it’s difficult to navigate them. Motivational and architecture documents are far more valuable when related to examples which are themselves far better when indexed against the API.

Doing what Grimoire does and maintaining a glorified tree of maps from fully qualified names to markdown files is nowhere near sufficient. It may be adequate for the backend, but it’s not the view you want to show users. Stacks is my attempt to sketch out some forms of more general documentation as data, but it’s nowhere near ready for use.

I suggest that stealing from Grimoire/Grenada, designing a spec for some documentation index datastructure and MVPing something that builds and renders it is probably the right way to move this forwards. At least that’s what I’m trying to do with stacks.

martinklepsch · January 8, 2018, 10:32am

I agree that “stealing compute” is an issue but as someone pointed out it’s a solvable one. Containers, Faas, etc could be solutions. I believe it is critically important that a user does not have to do anything to get some documentation. In my personal experience just having some documentation is a great incentive to make it better.

Those libraries look great and I will need to look more into how we might utilize them. I believe there is significant value in providing machine & human-readable documentation.

While this doesn’t address the issue of cross-referencing in documentation I believe this can (to some degree) be solved by providing “template files”, i.e. sections and files that give some inspiration to a library author what kind of documentation should be present. If they ignore it the “template” info will show up on in their documentation, at which point they hopefully reconsider their choice not to provide any non-API docs.

One issue with these kinds of template files is that something needs to put them into their project tree.

I’d really like to get something basic up and running while maintaining some openness to future improvements.

Some other related or not so related links:

https://docusaurus.io/ may also be interesting to check out for inspiration. Mostly focuses on non-API docs as far as I can see.
graphql.github.io/notes/NewSiteArchitecture.md at source · graphql/graphql.github.io · GitHub How the GraphQL project structures their site, maybe we can take some inspiration from there for our documentation “template”.