When your data warehouse is struggling to store your data: is it wise to add another column to mark a row as measurement noise?
Normally I wouldn’t care about storing a flag bit. But after watching the first few lectures in CS50’s Introduction to Databases with SQL (harvard.edu) it seems that storing a byte becomes a problem over millions of rows.
Deletion of row is not feasible as history of bad-measurement needs to be stored anyway (for audit purposes).
So the 2 alternatives are:
-
store an additional column for each row which contains a bit that denotes the column as human-labeled measurement noise.
-
Warehouse a second set of data for analysis that does not contain noise-rows and has additional cleaning as data-versioning is needed for data science anyway.
(#2 has obvious drawbacks but I was wondering about what is done in practice).