> NULL indicates a missing value, and NaN is Not a Number.
That’s actually less true than it sounds. One of the primary functions of NaN is to be the result of 0/0, so there it means that there could be a value but we don’t know what it is because we didn’t take the limit properly. One of the primary functions of NULL is to say that a tuple satisfies a predicate except we don’t know what this one position is—it’s certainly is something out in the real world, we just don’t know what. These ideas are what motivates the comparison shenanigans both NaN and NULL are known for.
There’s certainly an argument to be made that the actual implementation of both of these ideas is half-baked in the respective standards, and that they are half-baked differently so we shouldn’t confuse them. But I don’t think it’s fair to say that they are just completely unrelated. If anything, it’s Python’s None that’s doesn’t belong.
Related to this, below someone posted a link to the blog post of Wes McKinney where he discussed Pandas limitations and how PyArrow works around these
> 4. Doing missing data right
> All missing data in Arrow is represented as a packed bit array, separate from the rest of the data. This makes missing data handling simple and consistent across all data types. You can also do analytics on the null bits (AND-ing bitmaps, or counting set bits) using fast bit-wise built-in hardware operators and SIMD.
> The null count in an array is also explicitly stored in its metadata, so if data does not have nulls, we can choose faster code paths that skip null checking. With pandas, we cannot assume that arrays do not have null sentinel values and so most analytics has extra null checking which hurts performance. If you have no nulls, you don’t even need to allocate the bit array.
> Because missing data is not natively supported in NumPy, over time we have had to implement our own null-friendly versions of most key performance-critical algorithms. It would be better to have null-handling built into all algorithms and memory management from the ground up.
Note that the post is from 2017, and pandas now has (optional) support for PyArrow backed dataframes. So there is movement away from the critiques that were presented there.
Why don't we build a portable numeric system that just has these mathematical constants and definitions built in, in a portable and performance manner?
> One of the primary functions of NULL is to say that a tuple satisfies a predicate except we don’t know what this one position is—it’s certainly is something out in the real world, we just don’t know what.
I'm not sure I understand this. Can you explain it with an example?
What I mean is, the ivory-tower relational model is that each table (“relation”) represents a predicate (Boolean-valued function) accepting as many arguments as there are columns (“attributes”), and the rows (“tuples”) of the table are an exhaustive listing of those combinations of arguments for which the predicate yields true (“holds”). E.g. the relation Lived may contain a tuple (Edgar Codd, 1923-08-19, 2003-04-18) to represent the fact that Franklin was born on 19 August 1923 and died on 18 April 2003.
One well-established approach[1] to NULLs is that they represent the above “closed-world assumption” of exhaustiveness to encompass values we don’t know. For example, the same relation could also contain the tuple (Leslie Lamport, 1941-02-07, NULL) to represent that Lamport was born on 7 February 1941 and lives to the present day.
We could then try to have some sort of three-valued logic to propagate this notion of uncertainty: define NULL = x to be (neither true nor false but) NULL (“don’t know”), and then any Boolean operation involving NULL to also yield NULL; that’s more or less what SQL does. As far as I know, it’s possible to make this consistent, but it’ll always be weaker than necessary: a complete solution would instead assign a variable to each unknown value of this sort and recognize that those values are, at the very least, equal to themselves, but dealing with this is NP-complete, as it essentially amounts to implementing Prolog.
In the broader DB theory context and not strictly Pandas, think of a `users` table with a `middle_name` column. There's a difference between `middle_name is null` (we don't know whether they have a middle name or what it is if they do) and `middle_name = ''` (we know that they don't have a middle name).
In that case, `select * from users where middle_name is null` gives us a set of users to prod for missing information next time we talk to them. `...where middle_name = ''` gives us a list of people without a middle name, should we care.
> If a string-based field has null=True, that means it has two possible values for “no data”: NULL, and the empty string. In most cases, it’s redundant to have two possible values for “no data;” the Django convention is to use the empty string, not NULL.
I've been around numerous Django devs when they found out that other DBs and ORMs do not remotely consider NULL and empty string to be the same thing.
That’s actually less true than it sounds. One of the primary functions of NaN is to be the result of 0/0, so there it means that there could be a value but we don’t know what it is because we didn’t take the limit properly. One of the primary functions of NULL is to say that a tuple satisfies a predicate except we don’t know what this one position is—it’s certainly is something out in the real world, we just don’t know what. These ideas are what motivates the comparison shenanigans both NaN and NULL are known for.
There’s certainly an argument to be made that the actual implementation of both of these ideas is half-baked in the respective standards, and that they are half-baked differently so we shouldn’t confuse them. But I don’t think it’s fair to say that they are just completely unrelated. If anything, it’s Python’s None that’s doesn’t belong.