Bad-measurement data

When your data warehouse is struggling to store your data: is it wise to add another column to mark a row as measurement noise?

Normally I wouldn’t care about storing a flag bit. But after watching the first few lectures in CS50’s Introduction to Databases with SQL (harvard.edu) it seems that storing a byte becomes a problem over millions of rows.

Deletion of row is not feasible as history of bad-measurement needs to be stored anyway (for audit purposes).

So the 2 alternatives are:

  1. store an additional column for each row which contains a bit that denotes the column as human-labeled measurement noise.

  2. Warehouse a second set of data for analysis that does not contain noise-rows and has additional cleaning as data-versioning is needed for data science anyway.

(#2 has obvious drawbacks but I was wondering about what is done in practice).

There are other alternatives, e.g. using a second table with a foreign key to the first that would only contain the true values of that flag. Maybe also splitting the main table into a few similar tables with disjoint sets of columns. Maybe adding an extra column is the best approach in your case but would greatly benefit from partitioning the table on the values in that column or maybe on a set of other columns.

Many things depend on the specifics, only one of which is your particular DBMS which you didn’t even mention. So I would recommend a hands-on consultation from a proper DBA where they can go over all of the details and come up with a set of optimal solutions for your particular use-case.

1 Like

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.