Half-open intervals are why I try as much as possible to stay away from languages that use 1-based indexing (Lua, Julia, Matlab, R, ...) - 1-based indexing lends itself to closed intervals because an array of N elements has [1,N] as its index range, whereas 0-based indexing lends itself to half-open intervals because an array of N elements has [0,N) as its index range.
-------
However, I know of one case where closed intervals really shine. Consider displaying a zoomable map in tiles. On a given zoom level, each tile has some coordinates (x;y) where x and y are integers denoting the virtual column and row. Suppose that we allow zooming out by a factor 2, so that two-by-two tiles are aggregated into a single tile. Then a natural choice for the coordinates of the zoomed-out tile are (floor(x/2);floor(y/2)), that is, divide by two and round down. Suppose that a dataset has data on tile coordinates [x1,x2]×[y1,y2], meaning that there's only data on tiles (x;y) where x1≤x≤x2 and y1≤y≤y2. These are closed intervals, but stay with me - the reason they are nice in this case is because of how you compute the range of valid tile coordinates when you zoom out: The range becomes [floor(x1/2),floor(x2/2)]×[floor(y1/2),floor(y2/2)] - that is, you simply divide the range endpoints by two and round down. If you try to do this with half-open intervals, then you need some +1/-1 shenanigans, which are normally what I try to avoid by going for half-open intervals.
I see why you mention 1-based indexing, but it’s mostly a different beast. I’ll defend 1-based indexing here because, having used Lua a lot, the indexing issue just goes away.
People like what they’re used to. Most coders are used to 0-based, but we all start life 1-indexed. We begin counts with 1: chapter 1, the 1st floor of a building (in the US), 1 AD, the 1st of Dec, etc.
In C/C++, 0-based makes sense because we often care about pointers and pointer arithmetic. But in languages without pointers, that reason goes away. And you can still use half-open intervals with 1-based indexing.
It's a matter of (natural language) terminology. It's the difference between asking "which year of your life are you currently in" as opposed to "how many full years have elapsed since you were born?"
Agreed. My previous reply just pointed out that 1 is not "how many full years have elapsed since you were born?". The answer to this question is still 0.
Can you explain what you mean by "having used Lua a lot, the indexing issue just goes away"? Most modern languages have facilities (iterator combinators) that make indexing less common than e.g. in C, but in my experience, you can never completely escape indexes, and for the indexes that I do have to deal with, I really want half-open intervals and 0-based indexing.
And this makes things really confusing. 1999 is 20th century even if the year starts with "19". 2000 is still 20th century, 21th century starts in 2001. "BC" dates feel like negative numbers but they are not. Using closed intervals [10 BC, 10 AD] spans 20 years, but [10 AD, 20 AD] spans 21 years.
I don't see how your example works? consider how to implement zoom out generically.
We do the 1d case and lose nothing compared to the 2d case.
Let some sequence of 2n tiles labelled [k+1 k+2 ... k+2n] and we wish to map to a sequence of n tiles labelled [l+1 ... l+n]
map e = (e - (k+1)) `intdiv` 2 + (l+1)
where intdiv is just the usual truncating division
mapping from 1-based array to 1-based array instantiates k=l=0
so
map e = (e - 1)/2 + 1
clearly the simplest map is actually instantiating k = l = -1 which is 0-based
map e = e/2
So now consider the half open case. it must also work the same way. because the 2 tiles beyond the last one map to the 1 tile beyond the sequence to be mapped to. it couldn't work any other way.
The solution is using a 0-based array, not the choice between closed or half open.
I don’t think your mapping example works. You’re manipulating a range in a way that is inherently lossy and incorrect, and chosen a method that is trivially not reversible—try to go the other direction, which you will want to do in projecting from tile space to the original coordinate space, and your closed range introduces error. So using a closed range only helped in one direction, and gave you a misleading sense that your range was still equivalent. Consider also what happens with the next tile along: you have overlap between adjacent tiles unless your tiles are aligned to a multiple of two; and this demonstrates why it’s such a risky proposition, because you’re baking in unstated assumptions which will have a habit of coming back to bite you later when you want to relax them.
My observation with this kind of data is that people work in a theoretically-continuous coordinate space (though it’s probably represented in floating-point numbers so that it’s strictly discrete, just at very high precision), and, when forced to deal with coarser resolutions like tile pixel grids, immediately project back so they can work in the preferred coordinate space. Even if your coordinate spaces are all square, you will have margins of error however you do this, due to quantisation error—and does the pixel at (0, 0) show the data from (−0.5, −0.5) to (0.5, 0.5), or from (0, 0) to (1, 1)?
"does the pixel at (0, 0) show the data from (−0.5, −0.5) to (0.5, 0.5), or from (0, 0) to (1, 1)?" - that's an interesting question, but it is independent of tile grids.
You're right that the "ideal datasets" of this world aren't discrete and pixelised, however, real-world pixel-based datasets exist, and real-world pixel-based screens exist, and you want to be able to explore large pixel-based datasets on pixel-based screens, which can be done in a really nice and crisp way if you just put one data pixel on each screen pixel. Suppose that these datasets live in some common cartesian coordinate system but have different extents. The issue of storing the tile ranges that cover each dataset is a real proposition - it's not "lossy and incorrect" once you've accepted that pixels are what you have to deal with as input.
If I want a system where zooming out aggregates 2x2 tiles to a single tile, then I have to decide what happens if there's an odd number of tile rows on a zoom level: Should the number of tile rows on the coarser zoom level be half rounded down, or half rounded up? It seems natural to take the half rounded up, and then fill out the missing tile row with blank pixels. What are the tile row numbers on the coarser zoom level? Suppose we start with tiles on rows 2,3,4,5,6 - that's [2,6] as a closed interval or [2,7) as an open interval. On the next zoom level, the tiles are aggregated so that rows 2 and 3 end up on row 1, row 4 and 5 end up on row 2, and row 6 ends up on row 3. This means we have tiles on rows [1,3] or [1,4). In general, if you have tiles on rows [a,b], then the next zoom level will have tiles on rows [floor(a/2), floor(b/2)]. Expressed with half-open intervals instead, if you have tiles on rows [a,b), then the next zoom level will have tiles on rows [floor(a/2), floor((b+1)/2)). I don't like the +1 in the formula for half-open, so I'll take the closed interval formulation any day.
I wonder if a lot of Lua's 1-indexed pain would be avoided by encouraging for-each style loops. Unless you need a specific sub range within an array, for-each will get you pretty far.
That's a big thing with Julia actually. A lot of people first getting into it criticise it for its use of 1-based indexing (I'm not the biggest fan of it either tbh), but in reality, there are very few cases in which you actually directly interact with array indices at all. And even if you do, the idiomatic way to iterate over the indices of an array is to use the `indices`-function, which has the benefit of working with multidimensional arrays, or arrays that start with an index other than 1.
Yeah, but it's hard to argue with the fact that pairs/ipairs in loops are 25% to 2.5x slower than using `for i = 1,#tbl do v = tbl[i] end`. Performance isn't everything, but it's not nothing either.
> Never, ever, ever use [closed, closed] intervals
I’m not really a fan of “never,” or “always” rules, when it comes to programming. I’ve found it’s usually better to have a “make sure to justify deviations from” heuristics.
I usually use [closed..open) ranges (as they are called in Swift), but sometimes, an inclusive range is a lot more appropriate, for expressing an operation (for example, I may express a range as start...end, as opposed to start..<(end + 1)).
Advice is supposed to convey information simply and efficiently, and "asterisks" for every edge case just muddle it. Now my pet peeve has moved to folk wheedling about 100% accuracy in advice. ;-)
"Don't run a red light, unless you're on a bicycle and you know that the intersesction you're at is on a weighted sensor that you're not going to set off. In which case, check your local laws on how to get through the intersection. Also, this likely applies to motorcycles. But not all motorcycles, so check your motorcycle's weight and cross-reference it to the weight sensors in your intersection (you'll need to contact your municipal engineering corps to get this). Also, one should look up for falling pianos at this point as well, just in case, because I can't say the word 'never'."
This is the sort of thing that's great for helping engineers go from junior to mid-level and then dangerous for going from mid to senior. Thinking clearly about exceptions to "rules" is one of the most important skills you can impart.
Don't Repeat Yourself is great advice... but only if you start thinking about all the caveats where over-application leads to architectural disasters.
The GP is largely correct here: If your advice includes "never" or "always", you're probably doing more long-term harm than good.
That's why I have “make sure to justify deviations from” in there.
It's easy enough to make that policy, but you run the risk of making the development process too unwieldy to implement (See: Taligent's Style Guide).
I would hope that most style guides have caveats for deviation (disclaimer: I haven't read a style guide in eons, so I don't know what they generally say, these days).
I'd say that "We should do it this way, but, if you think your way is better, convince me." is a good approach.
Really, I see a lot of problems, caused by corporations' obsession with hiring lots of bad programmers, and trying to force them to be good programmers, by setting up strict boundaries, as opposed to just hiring a few good programmers, in the first place (which, admittedly, has its own issues). This is nothing new. I have seen this since the 1980s.
The really cool thing about software, is its flexibility. I feel that the enormous toolbox at my disposal, along with my experience in using said tools, makes it possible for me to develop some pretty good stuff. If someone tried to force me to do a "lowest common denominator" approach, I don't think the results would be good.
There is still a difference between leaving out special cases to give more succinct advice and going out of your way to repeat the words "never"/"ever" three times to imply the absolute absense of any exceptions.
If conveying information is the goal, then that is crappy advice because it conveys no information. The information you're not conveying is that "running red lights adds significant risk of injury and death for both yourself and others because you cannot easily see cross traffic with enough time to save yourself and them."
This is not only informative about stopping at red lights but also about when it might be ok to go through them without enumerating every grain of sand on the beach.
> Good advice comes with a rationale so you can tell when it becomes bad advice. If you don't understanding why something should be done, then you've fallen into the trap of cargo cult programming, and you'll keep doing it even when it's no longer necessary or even becomes deleterious.
"We should be careful to get out of an experience only the wisdom that is in it -and stop there; lest we be like the cat that sits down on a hot stove-lid. She will never sit down on a hot stove-lid again -and that is well; but she will never sit down on a cold one anymore."
Agree, in programing you can always find a contrargument to any "never ever use X". I noticed that people often treat such phrases godmatically and then use it in another context, where it might not be true.
I like more phrasing it like this "in 90 % of the cases it is better to use X over Y", to leave the room for other options. But it is not a catchy phrase if you want to write a blog post.
Nevertheless I really like the points the author gave, something I didn't thought about before.
I'll repeat the sentiment that "always" and "never" rules usually have their fair share of exception.
And while I agree that [closed, open) intervals are often the best choice... sometimes what you want is [closed, closed], or (open, open), and it's nice to use a language that makes that easy.
For example, Raku makes it easy to do:
[closed, closed] with ($a .. $b)
[closed, open) with ($a ..^ $b)
(open, open) with ($a ^..^ $b)
(open, closed] with ($a ^.. $b)
You should keep your code consistent across your organization, so that a large number of programmers knows how your code works. You should have a "default writing style", and the "default writing style" should be used unless you have very, very, very good reasons to avoid it. (And an errant +1 or -1 here and there isn't a good enough reason to switch).
There are four styles of intervals. Lets say you want to represent a loop of 5 iterations numbered [555, 556, 557, 558, 559]. You've got:
* [closed, closed] -- [555, 559]
* [closed, open> -- [555, 560>
* <open, closed] -- <554, 559]
* <open, open> -- <554, 560>
There's not much difference to any of these four. As long as you pick a singular choice, get comfortable with its quirks, and make it consistent across your organization, you get benefits.
The main reason we do [closed, open> is because Dijkstra (father of structured programming back in the 1960s), argued to use [closed, open>, when presented with all these options. (and argued for zero-indexed as well).
The "[closed, closed]" set is one-too-small (559 - 555 == 4), so you need to add +1 to the representation.
The <open, open> set is one-too-large: (560 - 554) == 6. So this too seems prone to off by one error.
[closed, open> and <open, closed] are both the correct size of 5 when you subtract, but both "includes" a number that doesn't exist. In [closed, open>, the latter number isn't part of the array (560 is "one past the end), while in <open, closed>, the first number isn't part of the array.
Make that what you will of it. [closed, open> became a programmer convention because of these reasons. The important bit is to know all the quirks / off by one errors associated with this representation.
Dijkstra is a smooth talker and sometimes he's able to convince people that an argument is settled even when it is not... Here, he conveniently ignores examples that are better for [closed,closed], such as iterating backwards over the range. That is more annoying for [closed,open) because it turns into the abhorent (open,closed].
An argument I prefer is this: [closed,closed] is better for representing an index into the range while 0-based with [closed,open) is better for offsets. They are not the same, as C would have you believe, and there are situations where behaves nicer than the other.
> That is more annoying for [closed,open) because it turns into the abhorent (open,closed].
What's wrong with (open, closed] ?? Especially if you're going backwards?
for(int i=559; i>554; i--){
//do stuff with i
}
Perhaps the most important reason to do zero-indexed + [closed,open> and <open, closed] is because these are the defaults the C / C++ / Java for loops assume you're doing.
Its all non-intuitive because humans are really bad with off-by-one errors.
------
EDIT: I guess familiarity with [closed, closed] is somewhat important, because other programmers like the form. The pattern here is: for(int i=start_closed; i<=end_closed; i++), if anyone's curious.
The problem with that example is that you had to ±1 both of the ends of the original range. That is, you started with [555,560) and had to rewrite it to (554,559] for the reverse loop.
I remember at the time when C++ ranges were getting standardized there was a talk about an alternative replacement for iterators: splitting them into things that point at elements and between them. That solved the reversal / backward iteration problem rather neatly, although IIRC there wasn’t always a clear choice of which kind of thing should now be returned in place of a single iterator.
(Of course, that doesn’t help at all if you’re using raw indices, but then again, people used to describe screen coordinates as pointing between “little square” pixels rather than at sample locations back in the days of the original Macintosh and 16-bit Windows.)
> You should keep your code consistent across your organization, so that a large number of programmers knows how your code works.
I would argue that for this reason it's not just important to have a consistent style across your organization, but also to use the convention that your slice of the industry has settled on, which for most companies is [closed, open). If there's no particularly good reason to use one over another, why add extra overhead to the onboarding process?
Here's one quirk I once encountered in the real world: a route represented by a sequence of distance, elapsed-time tuples. You can calculate the average speed to each waypoint - except for the zero-time waypoint, if it is included in the list.
I guess we normally think of intervals starting at some point and extending until the next interval starts, but that is not necessarily the most useful or 'natural' way of representing the situation.
> I guess we normally think of intervals starting at some point and extending until the next interval starts
Given how bad people are with the fencepost problem in practice... (Each pair of fenceposts covers 1-yard (or meter) of distance. How many meters does 100 fenceposts cover?), I'd say humans normally don't think of intervals very much.
Instead, we programmers have to think of intervals way more than other humans. So we need to develop systems and shortcuts for our brain to handle these cases quickly, concisely, and then move onto the next problem immediately.
-------
It doesn't matter what style you use, as long as you're fast and precise with that style (and as long as your coworkers are also fast and precise with the style).
Half-closed is generally good because it tends to reduce off-by-one errors.
But this article overstates the case. Especially for floating-point, where the distinction between a < b and a <= b is not straightforward. Some times you’re modeling closed intervals, and in that case using closed intervals is the right decision.
The “splitting by time” section should probably just be removed, since it confuses the point and doesn’t add anything. The scenario doesn’t really make sense (if you wanted this you’d store the registration time and get the hour you’d be better off by truncating the value, not using an interval). Also, if you’re going to be doing math on time values, you better know the precision of the values you’re working with (among many other things). Intervals, however closed or open aren’t going to help you there.
Maybe I shouldn’t criticize so much, since I agree with the general point. But this makes the case awkwardly.
I am more surprised it is about numbers. I found it more important to take care when dealing with date ranges in databases. And SQL "BETWEEN" lures you into using [from, to], which is great way to shoot yourself in foot. So just my 2 cents:
- what are you going to do when you need to split [a,b] into multiple parts, for performance reasons? [a,c], (c,b] ? Can your codebase handle different interval types? Having [a,b) everywhere is simpler
- stuff like 23:59:59.999999999 is code smell, well until database rounds it up, to next day, then it turns into bug
A system I work on chose to use [closed, closed] intervals for date time intervals, where multiple intervals should be adjacent, but not overlapping. So the intervals look like 2022-11-22T00:00:00.000Z - 2022-11-22T23:59:59.999Z. Of course the codebase is sprinkled with +/- 1ms, and off-by-one issues causing date times to match _no_ interval. [closed, open) would've been so much easier to work with and reason about.
Let's say you want to find accommodation from day x to y, would you rather have [x, y + 1) as the interval in the code?
In other words, I think the blog post lists problems the author has had and in those cases an open-closed interval would have worked better. But one could come up with an opposite case, would that be equally true?
IIRC (bit rusty now) python makes you do exactly [x, y + 1) in code and it pisses me off which is why I did my own trivial end-is-included range for loops.
Otherwise would result in surprises like range(10) giving 11 values, violating convention. I admit I usually have to pause to remember whether it's inclusive of the upper bound or not.
Why would you want a half-open interval when booking an AirBnB or flight? If I search for flights from February 24 to February 24 I don’t expect the empty interval.
AirBnB bookings are actually half-open, because you book a number of nights - so if you book December 1 - 3, you book the nights of December 1 and 2 - excluding 3. This seems expected to me.
However, as an AirBnB host, this bit me once because if you set that you're available on December 1, 2 and 3, guests can actually book December 1 - 4. But this is more because availability uses individual dates (almost like closed ranges) instead of half-open ranges like bookings do.
For booking flights, though, it's more like you're booking two individual flights on specific dates, it's not really a range at all, and so it doesn't really matter whether you think of it as open or closed. The flight search engine might want to check that the second flight departs after the first flight lands, but that has more to do with the flight times than the dates you enter.
Ignoring user experience, I would argue that this is actually an implicit half-open interval. You aren't really saying "give me flights from feb 24 to feb 24", you're saying "give me flights ON feb 24", which can be rephrased as "give me flights that take off feb 24th at midnight up to but not including february 25th at midnight".
Flights are just a set of two dates, not a range, like the other comment mentions.
I thought about it and this applies to an AirBnB, or more generally, a vacations, too. A vacation from Nov 25 to Nov 27 doesn't mean you stay from Nov 25 0:00 to Nov 27 23:59:59, but it means "arrive sometime Nov 25, leave sometime Nov 27" - a set of two dates
Quite a few APIs use a pair of `{start, length}` instead, which in the context of the post's example, is even clearer. Empty interval would be `length == 0`, time interval would be a single array of `starts`, etc. Fewer subtractions (to get length) usually end up nicer too.
It's the default in Python, and it's more helpful that not, so I would say it's a good design decision. But "always" is a dangerous word in engineering, and in this case, definitly not warranted.
Case in point, last week I worked with list ofdates, and I needed the last date to bracket my sliding windows as a time period cleanly.
I think this advice could be better summed up into: to minimise off-by-one errors, choose a consistent strategy for describing intervals, and stick with it as much as is sensible.
If by "enumerable of discrete integers" you meant Enumerable.Range, it's not just that. Other examples include String.Substring, Array.Fill, Stream.Read, Span constructor etc.
If you meant it's not practical for non-integer types such as floats or dates, then of course you are right.
I'd say bound, range is an acceptable way of defining an interval, it can save space for multidimensional intervals, though at the cost of increased computation.
I have recently been on the other side of this argument for specific case.
In our case we (users of our API) are to specify date ranges, representing a list of partitions. So we are not counting nights between dates, but rather a set of daily or hourly buckets.
Here (maybe even only here) I argue that inclusive ranges feel more intuitive.
I find it much more intuitive to represent the 1st 7 days of January as
['2022-01-01', '2022-01-07']
compared to
['2022-01-01', '2022-01-08').
Another very common example is to specify the last 7 days (incl 'today') in which case I find
*half-open intervals. There have been times I needed an (open, closed] interval. But there have also been times when I wanted a fully closed interval because I was setting an arbitrary limit and it's easier to tell a user "100 is the maximum" versus "it must be below 100”, so what value should be put to max out out, 99? 99.99? etc
To add some nuance I'd say that if you're dividing a larger interval into smaller subintervals then a half-open one is probably what you want.
I know Dijkstra's paper and it's short, good and should be read but this article is wrong in saying always. It feels like a newbie programmer came across a good thing then lost all proportion; use the right tool for the right job, as ever.
This depends on use case. If you are doing a frontend layer on top of closed-open, then the frontend will have to handle the points the article is rambling about.
Users are used to selecting a daterange in closed closed format for example.
It depends on the use-case. Selecting an airbnb booking for example, you pay by the night, and the last day is check-out day, so it's sort of [bookingStart, bookingEnd).
Though from a user perspective, they are still there on the final day. Still, I think there's some nuance to this, as it's generally understood the date range they select isn't the number of nights they expect to pay for.
Wholeheartedly agree that sticking - where possible - to [closed, open) is a good idea. It has helped me tremendously when implementing, e.g. computational geometry algorithm. Robust triangle rasterization comes to mind.
Another interesting point: in the weird corner of the world where I grew up, half-open intervals were always denoted : [low_bound, hi_bound[
I am of course completely biased, but I've always found this notation much more elegant and intuitively obvious than the [low_bound, hi_bound) that seems to be the prevalent norm in the anglo world.
Using '[' after the upper bound clearly shows that we're open at the top whereas the ')' is fairly arbitrary.
And while I'm on the topic of weird culture-induced quasi-arbitrary biases: I had a math teacher that would bark (and I mean BARK!) at us if we ever used '>' in inequalities.
The justification was that with this constraint, all inequalities ended up written and laid out with its two members respecting the standard "left-to-right" drawing of the real line, which made it much easier to picture what was going on geometrically.
It also enforced consistency throughout a long demonstration - one less thing added to the cognitive load.
He was made fun of a lot by the student body, of course, but later in life, as a programmer, I have found myself sticking to the habit and I always force myself to mostly use '<', very rarely '<=, and almost never '>' and ">".
I find this makes code much more readable, just like back in the days of my old teacher with math inequalities., and pretty much for the exact same reasons.
Of course, doing that does not help at all when reading other folks code, those uncivilized heretical users of the 'greater than' form.
Python uses "closed open" intervals with `range(0, n)`, the reverse is then `range(n - 1, -1, -1)`, which is then highly unintuitive. This in connection with 0-based array indexing makes certain algorithms then very cumbersome. For example Knuth-Shuffle. In Python this is:
from random import randrange
x = [10, 20, 30, 40, 50 ]
for i in range(len(x) - 1, 0, -1):
r = randrange(i + 1)
x[i], x[r] = x[r], x[i]
print(x)
With 1-based indexing and inclusive ranges it would be much more understandable:
a[] = [ 10 20 30 40 50 ]
for i = len a[] downto 2
r = random i
swap a[i] a[r]
end
print a[]
In Python, reversed(range(0, n)) (which is also a thing you can write—does that help?) is indeed written range(n-1, -1, -1), but I think I recently encountered a language where that was written as (the local syntax’s equivalent of) range(n, 0, -1), that is to say the rule was not “start inclusive, end exclusive” but “low inclusive, high exclusive”. I’m not sure what language it was and whether this option ultimately ends up more intuitive, but in any case it’s a valid option.
(One difference is that the start-end rule can work polymorphically with equality only, while the low-high rule requires an ordering.)
Nope: unlike, say, sorted(), reversed() will never force an iterator into a sequence—it will either call __reversed__ or fall back to __len__ and __getitem__. An iterator needs to know how to reverse itself, or it’s out of luck. (Which is quite annoying because it forces you to write a class rather than a generator function.) Ranges are in the former category, they know how to reverse themselves[1]. (Yet for some reason range(0, 42) and range(41, -1, -1) repr themselves as range(0, 42) and range(41, -1, -1) respectively while reversed(range(0, 42)) is instead <range_iterator object at ...>. Truly, I cannot touch anything without finding a bug or at least a suspiciously insectoid lifeform.)
I know Knuth wrote it that way, but there's no good reason why the iteration should be downwards. That is, you could write:
for i in range(1, len(x))
and you would still get an unbiased shuffle algorithm.
The upwards iteration has a few advantages:
• You can start shuffling before you know how big the input is.
• Algorithm R for reservior sampling can be seen a specialized version of shuffling, in which you skip the work that wouldn't affect the first k items, or would only affect their order.
But I've bookmarked it, in case I run into someone who thinks they disagree, in which case I can offload the explanation.
I had a similar incident with colleagues who had discovered the "Default" trait and starting adding defaults to everything, including things that didn't have good defaults, and things where they didn't mean default but actually something quite specific such as "empty". The canonical "don't do that!" blog post didn't exist, so I had to create one.
So, if I want to book the night from Jan 13th to Jan 14th on AirBNB (which is the very first example given in the post), I should search for "Arrival Jan13th, Departure Jan 13th"?
As with any piece of advice, there are cases where it applies and cases where it does not apply. Or, in other words, someone can claim "Well, of course!" for the opposite advice.
I am entrusted with a database application for a tour operator. The actual use cases for start and end dates can be quite tricky.
For some of them arrival and departure date are most relevant, for others it is arrival and nights (or first and last date of the evening of the night). In the latter case, think of a calendar that shows the occupancy of a room for the manager. The only reasonable way to do this is to mark the days where the room is occupied in the evening and ignore the morning.
With rental cars it is even trickier, because there a 24 hour cycle form pickup time to drop off time is the basis for counting the days of rental.
Some metrics are best made available to managers in two versions for convinience, e.g. trip price per person per day vs. trip price per person per night (where day = night + 1).
For the business logic (except for rental cars, which use a modified one) I implemented a special general purpose Duration class with the following properties:
DateOfArrival
DateOfDeparture
DateFirst /* same as DateOfArrival */
DateLast /* day before DateOfDeparture */
Nights
Days /* Nights plus 1 */
The class comes with some operations for combining Durations or calculate/check-for intersections, etc.
With this class, it is easy to switch between an open and a closed interval view of the same data, and it is very transparent in the source code which is actually used.
No, a [closed, open) interval of [Jan 13, Jan 13) would give you 0 nights.
A [closed, open) interval of calendar dates incidentally does correspond to the number of nights spent. So booking [Jan 13, Jan 14) would mean spending one night. The hotel gives you until, say, 11:00 to get out the next day without paying for that date.
Half-open intervals are neat because they concatenate without overlap.
So [Jan 13, Jan 14) + [Jan 14, Jan 15) = [Jan 13, Jan 15).
This is a very convenient property when dealing with dates in particular, but also indices over arrays.
So it boils down to what is being represented: an amount of items versus an amount of _boundaries_ across items.
When I'm booking that stay between Jan 13th and Jan 14th, and I am presented with options, the ones that say "sleeps 2" does not mean [1, 2) but rather [1, 2] and the one that says "sleeps 6" does not mean [1, 6) but rather [1, 6].
When looking for a booking you are using an interval.
When looking at room capacity you are looking for a minimum. This is not the same as an interval. If you are looking for "sleeps 6" that is your bare minimum acceptance in capacity. You are not interested for the [1, 6] interval. Nothing below 6 makes sense.
Even if you wanted to force the interval use here, I'd say you'd be looking for the [6, INF) interval. That is, rooms with capacity of at least 6 (but having more capacity is fine).
I think this has to do with the nature of the metric underneath. Closed-open intervals are the way to go for integers. However they don’t seen to be a good fit for sampling from continuuous space
When you are working on a continuous space, what exactly is the difference in a program whether you use a closed-closed or half-open interval?
The only case that comes to my mind where it makes a difference at all is when you take discrete values from the interval and happen to hit exactly the end of the interval. But then the difference comes from the "discrete" part again. As long as you work on the continuous space, everything you typically do (e.g. integral, average, ...) give the same result no matter whether the end is part of the interval.
(You mentioned sampling, but I'd suggest that sampling by just taking values without lowpass filtering first is a bad idea anyway, and with a filter you are back to an integral which doesn't care about open/closed)
From the abstract: "Drawing a floating-point number uniformly at random from an interval [a, b) is usually performed by a location-scale transformation of some floating-point number drawn uniformly from [0, 1). Due to the weak properties of floating-point arithmetic, such a transformation cannot ensure respect of the bounds, uniformity or spatial equidistributivity."
If, however, a and b are dates, these operations are nonsensical, as is using anything else except closed intervals: for date range comparisons it is much more understandable to use start-of-day on a and end-of-day on b.
> If, however, a and b are dates, these operations are nonsensical
If you encode dates as number of days since some fixed date, they can be seen as numbers on a timeline, so the start date and the length (in days) of a "date interval" could be useful information. If you mean that subtracting two dates represented as strings of "YYYY-MM-DD" form is nonsensical... well, duh.
> for date range comparisons it is much more understandable to use start-of-day on a and end-of-day on b
I personally agree with this, but then again, "more understandable" is a subjective notion. I don't see how it make open intervals for dates nonsensical in itself.
"You're arriving on 15th and going home 23rd" implies that you're there 23-15=8 full days (unless we count the arrival/departure half-days as full days, in which case the arithmetic itself breaks down, so it doesn't matter which intervals we use mathematically) - that's an open interval.
Encoding dates in data in any other way than native date datatypes or full ISO date/timestamp strings is incredibly dangerous and prone to tz and dst conversion issues. It still might fit some niche use though. I was thinking more in line with some sort of Date entities.
Your example, as you point out, is flawed with respect to reality: no system would record arrival/departute dates and expect to calculate accurate length of stay, except in full days, as in b-a-1 = 7 days, effectively making the example use a closed interval if it were implemented in reality.
Imagine a hotel reservation system using just dates. Arrival on the 15th and departure on the 23rd would mean a reservation from the 15th to the 22nd, while the 23rd would already be free for someone else to book.
This is only because many programming languages don’t have proper calendar-date data types. If they had, the values would just represent the dates, without an implied time-of-day, and adding/subtracting integers would simply add/subtract the respective number of days.
Don't call you integer bound vars `start` and `end` please.
Either use `start` and `endExclusive` or start and length - this greatly reduces confusion.
In my experience, half opened integer intervals lead to fewer `- 1` in the code.
I prefer `pastEnd`, perhaps in reaction to the original SGI STL, where the documentation called it "past end", but the actual names in the code were just `end`.
...where the documentation is very careful to say "one past the end of the range", and then they proceed to just call `v.end()`. I always thought that was silly.
Both forms are valid according to ISO 31-11.
Your form is often called the "French notation" and I assume used more commonly in France and countries with French influences.
I'm from the US and I was taught the way it is in the article title. Can't say I've ever seen that before.
It's an interesting notation, but I feel like if I encountered it out in the wild I might assume it was a typo if it was only on one side. (IE: If I saw `[100, 200[` I would think they meant `[100, 200]`)
Right, I think the hotel room use case helps a lot with the intuitions here. And it helps to see the COUNTER EXAMPLE:
* Hotel A [Jan 10, Jan 13] means 4 nights; same as [Jan 10, Jan 14).
* Hotel B [Jan 13, Jan 15] means 3 nights; same as [Jan 13, Jan 16).
These intervals overlap! That is, if I try to get them together I book Jan 13 on both Hotel A and Hotel B.
With the half-open interval is really easy to spot: you cannot concatenate unless the open-end and the closed-begin are the same.
So you can concat [Jan 10, Jan 14) with [Jan 14, Jan 16); BUT you have an overlap if you see this [Jan 10, Jan 14) with [Jan 13, Jan 16) as in the previous examples.
With the closed interval this is hard to see. As in the example by @parekhnish; it seems that [Jan 10, Jan 13] and [Jan 13, Jan 15] are concatenable and there's no overlap. But the final operation ends up booking on Jan 13 twice.
An open interval doesn’t include its boundaries. It’s like a room missing one or more walls, and which hence is “open” towards the respective side(s).
Imagine a series of adjacent boxes with shared walls, that you cut up. Depending on how you cut them up, each wall will stay with one of the boxes, which thus will remain “closed” on that side, while the other box who shared the same wall is now open on the side where you separated the boxes.
+———+———+
| | |
+———+———+
|
V
+———+ ———+
| | |
+———+ ———+
closed open
[0,1] (1,2]
In an (imprecise) way, the closed set is a water balloon and the open set doesn't have the rubber balloon wrapping the water - I would consider the water balloon to be 'closed' in this case.
As it goes with maxims like this, it depends on the problem domain. In probability theory and statistics, a cumulative distribution function is defined as F(x) := Pr(X <= x), not Pr(X < x).
Like others are saying, it is consistency within the code base that probably matters the most.
The notation here really bothered me. The author defines the [closed, open) interval [a, b) as the list of all numbers number x that fulfill a ≤ x < b. Good so far, but when they talk about the empty interval [a, a) we get into problem because a ≤ x < a is a nonsensical statement. a cannot be equal to and strictly less then it self.
I think this is a problem when borrowing math concepts to programming. What the author is really talking about here is slicing, not intervals, and the slicing behavior is hopefully well defined on the construct you are working with, most of the time in a manner that makes sense to each construct, or in a consistent manner to other related constructs in the language.
If the author would stick with programming concepts, I don’t think this is a rule we should abide to, rather, a guideline which can be employed. And I think most programmers value consistency, so this really isn’t that much of an issue.
I was just watching YouTube series on arithmetic coding and I was wondering why the intervals were specified this way. Makes a lot more sense now. The "b - a" use case is quite prevalent there.
Yes, that's it. I've not seen a ligature used that way in text before. English has words that can be spelled with ligatures to this day, but not for the sound "st"
Its just a fancy typographic ligature, not a spelling “ligature” (which, in turn, is more of a letter with an optional decomposition as an anti-ligature) like ash (æ) or ethel (œ).
It’s the empty interval. It’s the set of numbers x for which x >= 2 and x < 2, which is none, and hence the empty set. It’s the equivalent of this for loop (for integers):
This is not universal advice. If it’s nonsensical for your app to have start-end events of zero length, nothing in the article applies, and using closed intervals does make things a bit cleaner.
-------
However, I know of one case where closed intervals really shine. Consider displaying a zoomable map in tiles. On a given zoom level, each tile has some coordinates (x;y) where x and y are integers denoting the virtual column and row. Suppose that we allow zooming out by a factor 2, so that two-by-two tiles are aggregated into a single tile. Then a natural choice for the coordinates of the zoomed-out tile are (floor(x/2);floor(y/2)), that is, divide by two and round down. Suppose that a dataset has data on tile coordinates [x1,x2]×[y1,y2], meaning that there's only data on tiles (x;y) where x1≤x≤x2 and y1≤y≤y2. These are closed intervals, but stay with me - the reason they are nice in this case is because of how you compute the range of valid tile coordinates when you zoom out: The range becomes [floor(x1/2),floor(x2/2)]×[floor(y1/2),floor(y2/2)] - that is, you simply divide the range endpoints by two and round down. If you try to do this with half-open intervals, then you need some +1/-1 shenanigans, which are normally what I try to avoid by going for half-open intervals.