> NULL indicates a missing value, and NaN is Not a Number. That’s actually less ...

miohtama · on Aug 29, 2024

Related to this, below someone posted a link to the blog post of Wes McKinney where he discussed Pandas limitations and how PyArrow works around these

> 4. Doing missing data right > All missing data in Arrow is represented as a packed bit array, separate from the rest of the data. This makes missing data handling simple and consistent across all data types. You can also do analytics on the null bits (AND-ing bitmaps, or counting set bits) using fast bit-wise built-in hardware operators and SIMD.

> The null count in an array is also explicitly stored in its metadata, so if data does not have nulls, we can choose faster code paths that skip null checking. With pandas, we cannot assume that arrays do not have null sentinel values and so most analytics has extra null checking which hurts performance. If you have no nulls, you don’t even need to allocate the bit array.

> Because missing data is not natively supported in NumPy, over time we have had to implement our own null-friendly versions of most key performance-critical algorithms. It would be better to have null-handling built into all algorithms and memory management from the ground up.

https://wesmckinney.com/blog/apache-arrow-pandas-internals/

jononor · on Aug 29, 2024

Note that the post is from 2017, and pandas now has (optional) support for PyArrow backed dataframes. So there is movement away from the critiques that were presented there.

miohtama · on Aug 29, 2024

Yes I think that early post discussed about upcoming Pandas 2.0 which now realised 7 years later.

cpcloud · on Aug 29, 2024

100% this.

We'd love to have a docs contribution that lays this out in detail if you'd be up for it!

Disclaimer: (I lead the project and work on it full time).

aeonik · on Aug 29, 2024

Why don't we build a portable numeric system that just has these mathematical constants and definitions built in, in a portable and performance manner?

verandaguy · on Aug 29, 2024

We just don’t have the technology for that, else there’d be libraries floating around that let you do that.

OptionOfT · on Aug 29, 2024

> One of the primary functions of NULL is to say that a tuple satisfies a predicate except we don’t know what this one position is—it’s certainly is something out in the real world, we just don’t know what.

I'm not sure I understand this. Can you explain it with an example?

I'm thinking something like:

    let (x, y) = NULL;

But then I'm stuck.

mananaysiempre · on Aug 29, 2024

What I mean is, the ivory-tower relational model is that each table (“relation”) represents a predicate (Boolean-valued function) accepting as many arguments as there are columns (“attributes”), and the rows (“tuples”) of the table are an exhaustive listing of those combinations of arguments for which the predicate yields true (“holds”). E.g. the relation Lived may contain a tuple (Edgar Codd, 1923-08-19, 2003-04-18) to represent the fact that Franklin was born on 19 August 1923 and died on 18 April 2003.

One well-established approach[1] to NULLs is that they represent the above “closed-world assumption” of exhaustiveness to encompass values we don’t know. For example, the same relation could also contain the tuple (Leslie Lamport, 1941-02-07, NULL) to represent that Lamport was born on 7 February 1941 and lives to the present day.

We could then try to have some sort of three-valued logic to propagate this notion of uncertainty: define NULL = x to be (neither true nor false but) NULL (“don’t know”), and then any Boolean operation involving NULL to also yield NULL; that’s more or less what SQL does. As far as I know, it’s possible to make this consistent, but it’ll always be weaker than necessary: a complete solution would instead assign a variable to each unknown value of this sort and recognize that those values are, at the very least, equal to themselves, but dealing with this is NP-complete, as it essentially amounts to implementing Prolog.

[1] http://www.esp.org/foundations/database-theory/holdings/codd...

mananaysiempre · on Aug 30, 2024

s/Franklin/Codd/

s/represent the above/relax the above/

(Sorry!)

kstrauser · on Aug 29, 2024

In the broader DB theory context and not strictly Pandas, think of a `users` table with a `middle_name` column. There's a difference between `middle_name is null` (we don't know whether they have a middle name or what it is if they do) and `middle_name = ''` (we know that they don't have a middle name).

In that case, `select * from users where middle_name is null` gives us a set of users to prod for missing information next time we talk to them. `...where middle_name = ''` gives us a list of people without a middle name, should we care.

Edit:

A related aside is that Django's ORM's handling of NULL gives me freaking hives. From https://docs.djangoproject.com/en/5.1/ref/models/fields/#nul...:

> If a string-based field has null=True, that means it has two possible values for “no data”: NULL, and the empty string. In most cases, it’s redundant to have two possible values for “no data;” the Django convention is to use the empty string, not NULL.

I've been around numerous Django devs when they found out that other DBs and ORMs do not remotely consider NULL and empty string to be the same thing.

emmelaich · on Aug 29, 2024

In Oracle, "" is converted to NULL so maybe that's where they get that from.

kstrauser · on Aug 29, 2024

Whoa, you're right.[0] I wouldn't have expected that at all.

[0] https://docs.oracle.com/en/database/oracle/oracle-database/1...

rbanffy · on Aug 29, 2024

You can almost count on Oracle doing the wrong thing…

It never ceases to amaze me they got themselves into a position where they had to introduce a VARCHAR2 type.