Classification of Bugs: Chapter 1 - Null and Empty
10 April 2020
I am not the best programmer in the world. I wouldn’t know or care to know what it means to be that. Are you the fastest to implement a feature? Are you handy with multiple languages and frameworks? Is your code the most bug-free? Is it the most readable code? All of the above? I don’t know. People have written potential descriptions. Judge for yourself. I am deviating… deep breath
The point is - if there is one developer skill that gives me an adrenaline rush that makes me project myself as Sherlock Holmes in my head, it is debugging. I am a sucker for bugs. I will drop everything to help you if I find out you have a pest problem with your software. I am not always helpful, but it is an aspiration to be the best debugger there ever was. After gdb, of course.
This aspiration has led me to refine my debugging practice into a system. I am not the first person to write about this. However, people usually preach about tools and techniques to find bugs faster and robustly fix them. All that’s good, but let me be more specific about what gives me the high - guessing the source by looking at the symptom, without looking at the code. When I open the code and a debugging tool, it should be only to verify my hypothesis. My practice is to maintain a mental database of common programming styles. Guess the architecture, then guess the source of bug based on common mistakes within the architecture. Maybe somebody has already made a super exhaustive list for this as well. The worldwide web is quite wide. But anyway, this blog series is an attempt to write down my database for mostly my own reference.
Null and Empty are not the same
Problem domain: Filtering data
The error: Treating nulls and empties as equals
The symptom: Data leaks in/out of your pipeline.
You know this. Everyone seems to know this, at least when you say it out loud. Yet, I keep seeing this. To reiterate the concept, in programming languages, at least the common ones, there is a difference between null and empty or 0. I won’t go into type theory because I don’t know it well. It’s not required to understand this. A null variable does not have a value. A variable set to an empty string has a value - it’s nothing. Maybe a numeric variable example is more natural to emphasise this. A variable holding the value 0 is not the same as a null variable.
In most languages, null has its separate place in the type system. You cannot perform the same set of operations on it. It is well distinguished. In Python, you cannot do
'2' + None, you get a
'2' + null gives you
'2null'. There is also a huge debate around null references. Let me clarify this is along the same lines conceptually but unrelated to the problem domain.
Data pipelines are potentially humongous these days. A developer working on one section of the pipe may often have little attention to the details of sections upstream and downstream. This lack of context is not the problem. It’s a good thing. You should not have to understand everything about a codebase to work with it. Data pipelines are nothing special. The problem is lack of discipline in dealing with null data cells.
Imagine a scenario. You are working with a dataset which has a column representing a person’s name. You want to extend the dataset with a column holding the person’s address. You are using a function
getAddress(name) which returns the person’s address given the name. You try this out, and you find that the function
getAddress throws if the name is null. You have no control over the
data or the function. You look at the data, and you say, “Well, maybe we should just throw out the rows which have a null name.” It’s not that unreasonable. The few rows are either garbage or irrelevant or so few that you don’t care. Maybe the broad purpose of your software is voided if there is not a valid name for a data row. You put a filter, filtering out nulls. Everything is good.
A few epochs later, the software starts breaking. You are a different developer than earlier. You trace the problem to the
getAddress(name) function. It turns out that when
name is an empty string, the
getAddress (name) function returns null. Later in the pipeline, a null address breaks a few things. You see the filter for rows with a null name. It’s right there. What do you do? You add an OR condition to the filter, also removing rows that have the name as an empty string. What happens now? Well, the developer who was responsible for generating the data you are reading would fill an empty string when the name is unknown. The null names were true errors, but empty string names were information. Inevitably, some rows leak and someone has to come back to this code and fix it again. What could have been different? Well, fix the
getAddress(name) function or write a wrapper around it. Upstream data ingestion should have documentation that specifies what an empty name string means and more importantly, the developer fixing the downstream pipeline should care to read it. That’s not the point. You’ll easily find yourself in a situation where you do not have control over externalities. You must deal with it in the downstream code. Even then, maybe there should be a null address filter. Maybe there should be more careful reviews for data filters such that possible leakages are identified before they occur. That’s not the point either. This is not a programming practices or styles lecture. This is a diagnosis lecture. You’ll look for filters when you find data leaking. But when there are a lot of them (filters) and the leakage is mysterious, have a keen watch for filter statements which are too aggressive with empty values when they should ideally be restricted to nulls.
The address and name story was only one example. Data that should be filtered out can start leaking “in” your set as well. You were filtering everything null, but the upstream designers started using empty or “impossible” values to denote erroneous measurements. For example, maybe it is a row that measures the height of a human being. No human being would have a height equal to 0 cm, so they find it perfectly acceptable to put that number since it’s “understood”. But that’s for humans. The code is dumb. Unless there is explicit code that sets reasonable limits on human height, it is just another numeric value, and an invalid measurement is best represented as a null. Gosh, this is sounding more and more like a rant. It’s not meant to be, I am sorry. Let me put the moral of the story very simply and wrap this up - Stay on the lookout for code that filters on null/empty values when you have data leakages.