Data Validation with loggit

loggit provides, first and foremost, a simple logging facility. However, the nature by which the logs are written and retrieved allow for users to analyze the log data locally, and not just in a remote log analytics tool (like Splunk). One of the most powerful ways to use loggit, and indeed the motivation for this package in the first place, is to use it as a data validation buffer.

Say you have a data pipeline you’ve written in R. Maybe you read some input data, perform some transformations, and then output the results to a database. However, you worry that the data being output is of low quality. Maybe the integrity of the data is impacted during the transformations, or a grouping is lost after a join. By leveraging loggit as a validation buffer, you can prevent writing out erroneous data to the database and alert your team that the data quality is to blame.

Let’s take the iris dataset as a stand-in for real data:

#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          5.1         3.5          1.4         0.2  setosa
#> 2          4.9         3.0          1.4         0.2  setosa
#> 3          4.7         3.2          1.3         0.2  setosa
#> 4          4.6         3.1          1.5         0.2  setosa
#> 5          5.0         3.6          1.4         0.2  setosa
#> 6          5.4         3.9          1.7         0.4  setosa

You’re tasked with aggregating the data by species, finding the mean, and outputting the results. Easy enough; the rest of the work you did somewhere else in the analysis pipeline, renaming the columns in iris to be neater, etc. You’d named that cleaned data frame iris_0:

#>   sepal_length sepal_width petal_length petal_width species
#> 1          5.1         3.5          1.4         0.2  setosa
#> 2          4.9         3.0          1.4         0.2  setosa
#> 3          4.7         3.2          1.3         0.2  setosa
#> 4          4.6         3.1          1.5         0.2  setosa
#> 5          5.0         3.6          1.4         0.2  setosa
#> 6          5.4         3.9          1.7         0.4  setosa
iris_agg <- aggregate(. ~ species, data = iris_0, mean)
#>      species sepal_length sepal_width petal_length petal_width
#> 1     setosa     5.073333    3.488889     1.482222   0.2511111
#> 2 versicolor     5.936000    2.770000     4.260000   1.3260000
#> 3  virginica     6.588000    2.974000     5.552000   2.0260000

Nice and compact.

However, you’ve been hearing from downstream that your aggregations don’t seem right. You’ve tried to look through your code to find why that might be, but nothing stands out; and frankly, you haven’t found the time to dig any deeper. It would be nice if you’d written a way to catch any miscalculations automatically, based on business logic.

This is where loggit can help! A good workflow I like to use is to have all my code in functions (you should do this anyway), and then have separate, similarly-named validation functions that execute right before the end of the analysis functions:

some_function <- function(df_in) {
  # Do your regular transformations, modeling, etc.
  df_out <- aggregate(in_some_way, df_in)
  # Just before returning from the function, call the validator, which logs out
  # the result
  validate_some_function(df_out, df_in)
  # Then, return or exit as usual

validate_some_function <- function(df_out, df_in) {
  df_in_expected <- some_code_to_get_df_in_to_look_like_df_out
  if (df_out$value != df_in_expected$value) {
    loggit("ERROR", sprintf("Actual (%s) != Expected (%s)"), df_out$value, df_in_expected$value)

Then, at the very end of your pipeline, script, etc. before the data is written out, you can check to see if you captured and data quality errors during the run (which should be in its own function):

logdata <- read_logs()
logdata <- logdata[logdata$log_lvl == "ERROR", ]
if (nrow(logdata) > 0) {
  stop("Data validation failures detected! Review above!")

This will terminate the pipeline, and print an informative set of data to review (note that what’s included is entirely dependent on how you logged the data out, and how you structure that failure message). Doing it this way also allows you to continue executing the full pipeline, without terminating until the very end, so you can see all the issues you wanted to track.

Returning to our iris example: you suspect it’s an issue with the sepal_length field causing data quality issues. So we can construct a (very targeted) validator for that like so:

validate_aggregate_iris <- function(iris_out, iris_in) {
  actual_mean <- mean(iris_out$sepal_length)
  expected_mean <- mean(iris_in$Sepal.Length)
  if (actual_mean != expected_mean) {
    loggit("ERROR", sprintf("Means differ! (actual = %.4f, expected = %.4f", actual_mean, expected_mean))

validate_aggregate_iris(iris_agg, iris)
#> {"timestamp": "2021-02-27T20:37:41-0600", "log_lvl": "ERROR", "log_msg": "Means differ! (actual = 5.8658__COMMA__ expected = 5.8433"}

Ah-ha! It was (at least) Sepal.Length that seems to be causing the issue! Now, you have an excuse to dig through your code (and can no longer blame it on “source data quality”). You find that you had this tiiiny line somewhere else in your code, where you subset the data for some reason:

iris_0 <- iris[iris$Sepal.Length > 4.5, ]

Now, you can either keep the subset and write the validation with that in mind, or remove the subset operation entirely. But careful planning and using loggit to track the pipeline quality helped narrow down the issue.

In many ways, this feels like unit testing your data quality. It’s also infinitely flexible; you can do validations in loops to prevent code repitition, you can use other libraries like validate to generate more validation output and log each result with loggit, and more. You can write as many of these validation functions as you think is necessary – I had a project with nearly 50 once!

Keep in mind that loggit only provides the means to track your job logs; the implementation is entirely up to you – and that’s what makes it both unobstrusive, and powerful!