Review: What is Data Oriented Programming?

Yehonathan_Sharvit · June 5, 2020, 11:05am

In preparation for my upcoming book about Data Oriented Programming, I am writing a short paragraph to explain what is Data Oriented Programming and how it relates to Functional Programming.

Your review comments and improvement ideas are welcome!

Simplifying a bit, we can state that the two sacred paradigms of Object Oriented Programming (OOP):

Write code as methods inside classes
Encapsulate data as members inside classes

In a sense, FP is a rebellion against OOP first sacred paradigm: FP encourages us to write code inside functions that are not coupled to objects. In addition, FP treats functions as first class citizens (e.g. we are allowed to pass functions as arguments to other functions).

Similarly, we could say that DOP is a rebellion against OOP second sacred paradigm: OOP encourages us to represent data without the need to be coupled to members specified advance in a class definition. In addition, DOP treats data as first class citizens (e.g. we are allowed to inspect the fields of an associative collection programmatically).

There are programming language that embraces both FP and DOP (Clojure, JavaScript).

There are programming language that embraces neither FP nor OOP (old C++, old C#, old Java).

There are programming languages that embrace FP without embracing DOP (e.g. Haskell, Ocaml).

As far as I know, there are no programming languages to embrace DOP without embracing FP. However, it is possible to apply principles and recipes from DOP to OO languages

jan · June 5, 2020, 12:09pm

I’d argue that FP is chiefly about controlling side effects. Functions are useful, but immutable data structures make the real difference. A good type system can help a lot, too. See also this piece by Kris Jenkins.

OOP on the other hand is mostly about message passing, hiding data, and late binding. Classes are optional.

Yehonathan_Sharvit · June 5, 2020, 4:03pm

Great pice by Kris Jenkins. Thanks for sharing

didibus · June 5, 2020, 5:43pm

For a good explanation of FP, see my comment here: https://www.reddit.com/r/Clojure/comments/guv9xn/comment/fsm5jds

For a little overview of data-oriented programming as pursued by Clojure, see my comment here: https://news.ycombinator.com/item?id=23425698

Anthony_Leonard · June 6, 2020, 7:52pm

Not to be confused with “data oriented design” which is gaining ground currently (I think) in the C++ and gaming communities and espouses a “close to the metal” type of data orientation where design is focused around getting the data off where it’s likely to be (L2 cache, next array block, etc), transformed and then put where you need it (graphics pixel colour) as quickly as possible, and thereby dispenses with traditional design methods centred around modelling the world, layered abstractions and OO. So if your graphics card (type) changes, your world has changed, so your code and design changes too… The main proponent Mike Acton is persuasive in abandoning traditional design as having failed, but with very different alternative solutions that make sense in his domain. All quite interesting!

In the Clojure community I hear “data driven” much more than “data oriented”?

didibus · June 7, 2020, 2:04am

Naming is the hardest problem in computer science, so I doubt we will be able to solve it.

That said, whatever their name, these are some of the ideas I see:

Tentatively known as the Data Driven style

It’s the idea that you create DSLs which are made out of descriptive data. And then there is an interpreter/compiler which performs the necessary computations/actions as described by the data DSL. Examples of this are: hiccup, garden, regal, tools.deps, lein, datomic, datascript, meander, etc.

When applied to a more normal program, this might manifest by modeling actions as a data-structure of commands to be executed by a component that consumes them.

Tentatively known as Data Oriented style

If object oriented is the style where you model your domain using objects. Data oriented is the style where you model your domain using data. Examples of this is most things in Clojure, where your domain information will be mapped to Clojure data-structures such as persistent maps, vectors, lists, sets, etc. And in doing so, it means that you can now use functions which are generic over the type of information, in that they work for any data, no matter what the data represent.

Tentatively known as Data

We’ve been talking a lot about data, but what is it? The idea of “data” is that of having structured values. Values can be either dimensions (aka labels) or metrics (aka quantities). And data is a particular arrangements (aka structuring) of one or more values. Thus when I say data in the above two styles, I mean data structures of values.

One last thing to note is that two values are equal if they have equal dimensions and measures. And two pieces of data are equal if they have equal structural partitions and values.

Tentatively known as Data Oriented in gaming (but it’s a different idea)

While it shares name, this is a very different idea, and “data” here means something different as well, and Data Oriented means something different.

In this style, data refers to computer memory, all form of memory. I wish they’d called it Memory Oriented to be honest. Memory can be RAM, CPU caches, SSDs, HDDs, DVDs, etc.

It is the idea that you model your domain in a way that is most appropriate for the type of computer memory you will store it in. It contrasts itself to both Object Oriented, where we model the domain with objects, no matter where we are going to be storing the domain information in. As well as the prior Data Oriented, where we would model the domain as close to how the domain itself structures the information. In this Data Oriented style, we would not structure the data how the domain structures it, but how the computer memory itself is structured. And we wouldn’t use values that are same as the ones in the domain either, we would again use values that the computer memory itself supports, even at the price of losing precision (because the goal of this style is performance).

At least these are some of the ideas I see floating around. There’s more ideas out there, and there’s also many variants of the above ideas as well. Naming each one is a hard problem If only we all named things using namespaced names

org.clojure/data-oriented

No more confusion hehe

Yehonathan_Sharvit · June 7, 2020, 7:02am

I really like the clarifications you are providing @didibus
The topic I am interested in is what you call: Data oriented style.

What do you mean when you write that values can be dimensions (aka labels)?

didibus · June 7, 2020, 8:24am

Like keys on a map or things that are discrete.

{:user/name "John"
 :country :us 
 :alive true}

These are all dimensions, including the keys themselves.

While a measure would be like:

{:amount-owed 234}

Here the key :amount-owed is a dimension and 234 is a measure.

The terms come from analytics, you can think of dimensions as things you could include in a group-by or sort-by or join on clause. While measures are things you would plot or keep in the select clause.

Yehonathan_Sharvit · June 7, 2020, 8:38am

@didibus

What do you mean by analytics?
Could you point to where in analytics those terms come from?

didibus · June 7, 2020, 6:36pm

By analytics I mean the field of data analytics and business intelligence. I’m not sure where the canonical source would be, but for examples Google analytics help page talks about it:

https://support.google.com/analytics/answer/1033861

Or in SAP analytics:

Most data analytics platform use the term dimension, and measure or metric or facts.

Maybe this wiki page is a bit more “canonical”: https://en.m.wikipedia.org/wiki/Dimension_(data_warehouse)

slifin · June 8, 2020, 3:01pm

The best parts of Drupal are data driven when you look at them, form API, menu system, schema, they recently transitioned to OO looks like those data systems remained mostly intact

I would say that they’re not directly compatible since language primitives are typically the substrate used in data driven programs (not classes) but that doesn’t mean they can’t co-exist exist in the same system

ericnormand · June 11, 2020, 3:24pm

Okay, my 2 cents. I hope they clarify.

@Yehonathan_Sharvit, your distinction between OOP and FP reminds me of the model where OOP first switches on the class, then it switches on the function, while FP first switches on the function, then it switches on the class. (Many OOP codebases use classes as an open way of creating what in a typed FP language might be a tagged union.)

I don’t think that’s quite the point you’re making, but it is related. The model sees FP and OOP as essentially isomorphic. The question is then which switch (class or function) will be the most beneficial for your problem. For example, are you expecting to add classes but have the same functions? Then OOP is better. If you are expecting to add functions but have the same classes, then FP is better.

I think it’s a useful reductionist view of the differences, but doesn’t quite capture what is special about them as paradigms (thinking tools), seeing both FP and OOP as merely dispatch mechanisms that you could easily translate between.

However, for the purposes of your book, this might be just what you want. My opinion might or might not be useful to you, but I’ll state it here.

OOP models things as Objects. Each object has References to zero or more other Objects. An Object with a Reference to another Object may send that Object a Message.

Objects
References
Messages

Computation largely happens by messages flowing through a large, complex, and, on the whole, unknowable network of object references. A little bit of computation happens in the Objects themselves (for instance, when you send the + 1 message to 3, that terminates in an ADD instruction on the actual machine you’re doing). If there is class-based dispatch, some computation happens there, too (for instance class-based dispatch can replace conditionals).

FP models things as Actions, Calculations, and Data. Actions are often known as effects or side-effects. They have an effect on the world outside of the software. Calculations are timeless computations. They do not depend on when or where they are run. Data is facts about the world, and as facts, they don’t change, but also they are inert (they can’t execute as a calculation can).

FP differs primarily by recognizing that Calculations and Data are easier to work with. Actions are harder, and so we should devote more attention to getting them right.

I think Data Oriented Programming is the recognition that facts are

structured
interpreted in various ways, even in the same program

Haskell gets that data is structured, but the types make #2 very difficult. Yes, you can have different types for different means of interpretation, but they are difficult to know ahead of time and they are numerous.

A Data Orientation means you find an abstraction for storing data that allows you to capture the structure of the facts with a high fidelity. How to capture that structure is the province of data modeling. You can then interpret those facts in various ways for the many purposes of your software.

This is what Relational Databases were designed for. Get the data in there with some structure. Then the query engine can do arbitrary queries on it in something like a declarative logic language.

Clojure has a different model, but it is a model, nonetheless. Associate values with names (maps of keyword->value). Sometimes you need a collection that maintains order (vectors). And sometimes you need to check for value containment in a collection (sets with contains?).

Data Orientation mainly gets data into some structure, then interprets it in various ways.

One benefit is that you can write functions at two levels of generality. The first level is domain-agnostic operations that operate on data in given structures. Since there are only a small set of possible structures, these functions are very reusable and often allow for combinatorial recombination. It’s like being able to extend the SQL language.

The second level is domain-specific operations that understand something not captured in the data structure. For instance, your code might know that it can take a map under the :address key and send it to a geolookup API to turn it into a lat-long. These kinds of operations are less reusable but obviously necessary.

Most orientations focus on the second level of generality. Data orientation separates these out and leaves things as data.

Another benefit is that you can be very fluid with your interpretation of the data. For instance, different stages of a workflow may need different pieces of data all tied to a domain entity. Would you write different types for each of those aggregates of data? Having a generic data abstraction lets you deal with this fluidly. Some would argue it’s too fluid. I think spec was supposed to help with this (but wink wink I think it made it worse and is why spec 2 was started).

Eric

ericnormand · June 11, 2020, 5:13pm

If you count Erlang as OOP (since it is message passing), it could count also as DOP.

ericnormand · June 11, 2020, 5:14pm

I like the idea of FP and DOP being rebellions. Free the functions! Free the data!

delonnewman · June 11, 2020, 10:15pm

Prolog might be a good example of a language that is data-orientated, but not a functional programming language. Perl might be another.

didibus · June 11, 2020, 10:30pm

I think with Erlang and other languages that are not OO, the question is more vague, because we didn’t explain exactly what is data in DOP.

For example, Erlang has records, they’re like structs in C, and I think similar to how Haskell handles them as well. I think they compile to a tuple, so it seems it’s a compile time only concept.

In any case, is a struct/record data?

What would be the criterias?

That we can count the number of elements?
That we can iterate over the elements?
That we can dynamically add/remove elements from it?
That we can serialize/deserialize it easily?
That the schema for it is self-describing?
…

I think we’d need to answer that.

delonnewman · June 11, 2020, 10:55pm

The way I tend to think of data-oriented programming is at least two components—generality (using general structures to model your data, e.g. lists, maps, sets, vectors) and reification is the data abstraction a thing that you can grab at an talk about easily in your code. The “oriented” part would imply (in my mind) that the language makes these things idiomatic. Change-in-place (in many cases) breaks the second because the abstraction now takes on notions of time and place.

Some OOP languages that could be described as data-oriented are Smalltalk, Self and their children.

didibus · June 12, 2020, 12:33am

Why would you include Smalltalk as data oriented?

It seems to fail miserably at your first criteria:

delonnewman · June 12, 2020, 10:08pm

I intended lists, maps, sets, and vectors as examples of data structures that are general in nature that can be used in data-oriented programming not an exhaustive list. Smalltalk was an attempt to take the general an recursive nature of Lisp to the next level by defining a language in terms of only Objects and Methods.

Similarly, Prolog and SQL don’t implement those data structures but are similarly data-oriented due (in part) to their generality. Prolog’s describes data in terms of Relations and Rules. SQL describes data in terms of Tables.

didibus · June 13, 2020, 1:13am

What would not be data oriented then? I don’t know any language that don’t have some form of data-structure?

I think it has to be we go one level down and define some properties that the data-structures in a data oriented language must have. For example your point on generality. Now re-reading, I see you meant using general structures, but I don’t know what that means. What’s a general structure? One that is popular? One that is used pervasively? I was thinking of it more in terms of general operations over the structure. In that, it doesn’t matter what the data represents, like it doesn’t matter if the map is a bank account, a user, a receipt, etc. I’ll still use the exact same functions to manipulate it no matter.

In an OO language (but not sure of Smalltalk). This generality of functions doesn’t exist. For example, just getting an element from the structure is a custom method (so called getters). So someone would use “getName” to get the name out of a User structure. In a data oriented language, you’d use the generic “getElementFromData” function, which returns you the value at a particular key, it is agnostic of the fact that the data models a User.

I’m not super familiar with Smalltalk, are you saying Smalltalk would have had a generic getter that works to retrieve any element of any Object no matter what the Object models in the domain?