Hacker News new | past | comments | ask | show | jobs | submit login
ISO-8601, YYYY, yyyy, and why your year may be wrong (ericasadun.com)
222 points by ingve on Dec 26, 2018 | hide | past | favorite | 123 comments



It probably would have been better to use entirely different letters (xxxx instead of YYYY) to reduce sources of human error.

Notwithstanding these small issues, iso8601 is a godsend, but even with this spec it's amazing how many times we just get it plain wrong dealing with time. Time is hard! It gets even worse in binary formats, dealing with leap second tables, time zones, daylight savings, different epochs etc, which is why I developed the smalltime [1] format.

[1] https://github.com/kstenerud/smalltime


The Twitter bug in December 2014 was caused by using %G instead of %Y. So that won't help much. G comes before Y alphabetically, as does almost ant other letter...


G comes before Y alphabetically

That's irrelevant. There's a comment I made on a discussion about this a while ago:

https://news.ycombinator.com/item?id=17059958

It's sad that most people seem to put the blame on everything else except the developer --- who was simply not exercising any common sense nor thinking critically. There's a very good reason to how the format specifiers were assigned, and anyone who doesn't notice the pattern (and surprising deviations from it) has no one to blame but his/herself.


But what is relevant is that the manpage explanation is still "precise but misleading":

http://manpages.ubuntu.com/manpages/cosmic/man3/strftime.3.h...

"The ISO 8601 week-based year (see NOTES) with century as a decimal number. The 4-digit year corresponding to the ISO week number (see %V). "

It's still misleading in the sense: "do we want ISO standardized year? Hell yeah! We've found what we want."

It induces all the wrong reflexes: is ISO datetime "the name of the most standard date format that can't be read wrongly? Yes. ( https://en.wikipedia.org/wiki/ISO_8601 ) Is it "ISO 8601"? Yes. Do we get "four digit" "decimal" year? Yes. Hm... it's "week-based"? "Well every year has weeks, so I guess it's all right." Has references to some other part of documentation, "that means that the details probably don't matter much here." Etc.

The proper documentation would be: "%G (short for "wronG in most use cases") returns the "week-number-based-otherwise-wronG-year" used only for the special number-of-weeks-based calendar representations, according to the ISO 8601 rules for such representations, and it only by accident sometimes looks like the common calendar year. Should not be used unless it's specifically needed to produce such a number-of-week-based calendar. For the details about these calendars see (the reference)."

In the later case the "don't touch this unless you know you need exactly this" is explicit. It even gives a good mnemotechnic device for remembering the "wrongness" of using it in most of the cases. Good documentation is really important, even if the "traditional" -n-x users feel some kind of satisfaction in having the most misleading or useless documentation, to the point that when somebody asks some specific question they point to the very man pages which exactly don't answer that exact question (there are many such examples on the web). And I've personally seen exactly these same self-satisfied programmers behaving just like the users they subconsciously (or consciously) mock, that is, being equally clueless, once they are in front of any other, probably even a bit better manual, but for the topics they are not familiar with. Like setting up a darned printer.

Being user hostile is never a virtue, and never something that should be supported or explained away.


You are omitting the requirement of common sense, or should I say discouraging it, and that is a dangerous path to go down. Almost all of the letters are strongly mnemonic. Someone who does not even think about what %y or %Y would be, after reading %G and thinking it correct, is going to have trouble with a lot of other things too.


> Someone who does not even think about what %y or %Y would be, after reading %G and thinking it correct, is going to have trouble with a lot of other things too.

I'm surely not questioning that specific claim that you make now. However, my major claim is:

Everybody is stupid, if he's not working in the area of his narrow specialty, and more than that, even those who are working in the areas familiar to them will not always have ideal circumstances in which they will use the manuals or the APIs. Therefore designing anything only for those who have infinite time and concentration to use your product is inherently wrong.

Specifically, I can imagine a person who under ideal conditions would spend some necessary days to learn all the details of that formats and date use cases and what not to have some other circumstances in which it has to produce some result fast or while distracted, and that other persons who are in charge to confirm what that person produced also fail to recognize the error, to the point that the subtly wrong implementation is eventually not properly tested. Which is what provably also happens in practice. And I also really saw the persons capable of learning and remembering a huge amount of all unnecessary (for me) switches for commands x, y, z, x1, y1, z1 etc, in some other, still not too stressful situation unable to eventually manage to install the mentioned printer on the same OS.

In the same sense, the writer of manual who spends days learning about all the features that he documents should not assume that his readers will spend the same amount of time or will have necessary conditions to figure out all nuances that were "obvious" to the writer at that specific moment. In fact the similar forces that produce buggy code are the ones that produce poor documentation.

So we should not try to excuse both bad code and bad documentation, but instead support the "empathy" for the possible less-than-ideal conditions of our users.

And I claim that the documentation obviously produced without understanding what most of the users need to be able to easily read from it is bad, and that it should be recognized as such.


> (see NOTES)

If you skip over this huge red flag, sure. "See notes", "see below" &c. are almost invariably a bad sign: they indicate that there's complexity here that's too long to explain in-line.

Reacting to that with "Hell yeah! We've found what we want." is beyond salvation.


No. What you put in “see notes” notes is the boring technical detail such as how week numbers are calculated and which document or paragraph this particular implementation holds authoritative over conflicting alternatives.

What you put in the man page is the clear statement that for example YYYY “week year” should not be used where yyyy “calendar year” would usually be more appropriate: point out the confusing alternatives and which one the unaware reader is most likely to need.


... overlooking the fact that you aren't sure what it means for a year to be week-based in the first place, since years aren't a part of a week and weeks don't fit evenly into years.

Literally the next sentence suggests that %Y is broadly similar, and that %G has exceptional cases:

> The ISO 8601 week-based year (see NOTES) with century as a decimal number. The 4-digit year corresponding to the ISO week number (see %V). This has the same format and value as %Y, except that if the ISO week number belongs to the previous or next year, that year is used instead. (TZ) (Calculated from tm_year, tm_yday, and tm_wday.)

The %Y text, by the way, is much shorter:

> The year as a decimal number including the century. (Calculated from tm_year)

I find it hard to fault the manual here; the only improvement I can see making is specifying that %Y is the calendar year, and perhaps noting that it is more appropriate when months are used. Of course, speople (e.g. GGP) feel that in this situation some condescension is deserved, and it may be true that what few users would be misled when they really do want to use a week-based year are more than outweighed by those who correctly choose to use %Y instead.

The objection to this isn't technical, because the perceived slight isn't technical. But there is a sentiment, more popular in some circles than others, that the user should be respected as a responsible adult who knows what they're doing, that the programmer using the interface is a peer, and that if they aren't, or are operating while impaired, it behooves them to realize this and either sober up or pass the buck. This isn't necessarily true. Sometimes, e.g. if the interface is exposed to end-users in a field for preferred date format on an invoice or something, it's totally inappropriate. But it is a somewhat traditional assumption, especially for ~system-level interfaces like strftime.


There’s also the issue of week-of-year calendars being useful for a very small niche, so it’s basically technical jargon of one speciality imposed on the rest of the world, the same way lawyers debate commas and treat and & or completely differently to sane people.

Familiarity with strftime is part of gaining experience in the field, and will usually involve using YYYY until it breaks and then finding articles like this one written over the last two decades explaining that you have tripped over a loose paver that everyone else in the industry already knows about (but at least now you understand why there are so many broken wrists).

There’s a similar problem in orbital mechanics where people try doing a calculation using Kepler’s equations, but because they’re using degrees they get the wrong result. Of course the equation is entirely in radians, which they quickly learn about but not before suggesting that the languages or APIs they are using need new features to help other people avoid falling into that trap.


You are not wrong. But everyone is stupid some of the time, and documentation should take that into account.


People complain a lot about PHP but it's not always horrible. Excerpt from the documentation:

o - ISO-8601 week-numbering year. This has the same value as Y, except that if the ISO week number (W) belongs to the previous or next year, that year is used instead. (added in PHP 5.1.0)

(https://secure.php.net/manual/en/function.date.php)

We had this problem a long time ago when asp was used. We really needed this Year-week numbering but it didn't work. It was hard-coded to 52 weeks per year and the year was the true year. Someone had to manually fix it every few years when it didn't add up.


In Java `x` is reserved for time zone offset: https://axibase.com/docs/atsd/shared/time-pattern.html#patte...


I thought you’re supposed to use Z for that? What’s the difference then?


There's an example column on the link, it has to do with the exact formatting of the offset.


Z is just shorthand for for offset +0.


It would be nice if anywhere at the beginning of the post or around it the author said it was about swift, or that the blog is focused on it.

I understand other languages may have similar problems, but they will certainly have very different formats.


> they will certainly have very different formats

Not necessarily; as the article discussed, the format characters are part of a Unicde standard. Apple/Swift Foundation uses the ICU library[0] for the heavy lifting of date formatting, which is certainly widely available (by design).

[0]:http://site.icu-project.org/


That reference seems to be offline at the moment...

I don't think it's widely used. It is great that some standards body took that problem into their scope, but it may take a long while until most languages agree on anything here.


Yeah, what crazy obscure languages expose a binding to strftime???


> I don't think it's widely used.

You may wish to look at this: http://site.icu-project.org/#TOC-Who-Uses-ICU-


Java affected in exactly the same way.


I thought the same; context is everything; this isn't an issue with a date format, but a date format when used by a specific language (or specific languages).


IMO there’s no valid reason for date parsers to accept YYYY by default. There’s no sensible use case where you’d want to mix week-based dates with month-based ones, and the latter type is way more common.

So why not have a separate “WeekOfYearDateFormatter” subclass for the rare use case? The default class could then explicitly fail when you’re accidentally using YYYY in a format string, saving you from a weird end-of-year bug that might spoil your holiday.


It would reduce value of a pattern-based API as an idea, that you can specify any format with combination of special characters. But you are right: this case is so special that it would make sense to protect majority of developers from misuse. If builder pattern is used for date formatters, then it could throw an exception unless certain flag is also set:

    var formatter = format(“ww-e-YYYY”)
                    .withWoYCalculations()
                    .build();


One of the big benefits of using these APIs is often that you implicitly accept a greater range of user formats than you anticipated.


Most people, for their entire career, should literally never write out a date/time-format string. Even once. If you want ISO8601, use that constant, don't write it out.

Use pre-defined formats, like 8601, 3339, Long, Short, etc. Or datetime skeletons if your system supports it and you MUST do something non-standard, and even then do a day or three of research before typing the first letter. Basically nothing else is even remotely acceptable for internationalization (and whatever the "intercalendarization" equivalent would be) and stands a good chance to get even the extreme basics like "yyyy" wrong.


IIRC, ISO8601 isn't a single constant, fixed format. It's a variety of optional representations that generally follow the rule least-to-most specific. For example 19991231 would be as valid as 1999-12-31.


Yeah, good point on 8601. But that's even more of a reason to use a pre-defined 8601-parser/printer/whatever instead of a hand-written format string.

If you're using 8601 for an interchange format... don't. But if you're forced to emit a hard-coded format for whatever consumer can't be forced to do things sanely, this would be one of those exceptions to "most people". And it should immediately stop after that exception.


https://tools.ietf.org/html/rfc3339 exists as a profile of ISO 8601 for precisely that reason, its representations are much more restricted.


In Postgres `extract(week from t)` has a similar danger. If you combine it with `extract(year from t)` then you land in the wrong place. You need to combine it with `extract(isoyear from t)` instead. That bug is sort of the opposite of this one: instead of parsing a date and using the ISO year by mistake, it's formatting a date and omitting the ISO year by mistake.


This seems substantially less likely to be a problem since

(a) G-V-u is relatively rare compared to Y-m-d (b) The choice-of-year problem is inherent to week-based dates, but not to month-based dates, since years are not week-aligned, so knowing the difference is table stakes for implementors of week-based timestamps

Now, "isoyear" seems like an awful name in various ways, not least of which is people seeing "iso" and assuming that this is what they really want (i.e. the same problem all over again)...


Java:

new java.text.SimpleDateFormat("YYYY-MM-dd").parse("2018-12-30") returns Sun Dec 31 2017

new java.text.SimpleDateFormat("YYYY-MM-dd").format(new java.util.Date("12/30/2018")) returns 2019-12-30

java.time from Java8 also affected.

java.time.format.DateTimeFormatter.ofPattern("YYYY-MM-dd").format(java.time.LocalDate.parse("2018-12-30")) returns 2019-12-30


Javadoc uses yyyy for year since at least 1.5

https://docs.oracle.com/javase/1.5.0/docs/api/java/text/Simp...

I've never seen YYYY used in Java.


Isn't it fun that while we're all discussing 'yyyy' vs 'YYYY', it sees like we're missing out that what everyone actually should use most of the time is rather 'uuuu'?



Yes, indeed so. That's why I answered in the java thread. Sorry for not making that clear.


Some additional stories I enjoyed about this “feature” (which I’d never heard of):

https://rachelbythebay.com/w/2018/04/20/iso/

https://rachelbythebay.com/w/2018/05/13/dates/


This appears to be about the DateFormatter class in Swift.

However, it states there that YY is part of the Unicode standard, so I imagine it might affect other languages as well.


Foundation (where DateFormwtter is defined) does use Unicode’s standard, so this should be language agnostic.


A lot of languages take their date/time formatting from C strftime (and quite a few simply use light wrappers around actual strftime), where the format code for ISO year is %G.

And FWIW, Python's (strftime-based) datetime library won't let you mix ISO and non-ISO format codes. Trying to use %G with %m, for example, raises an exception, as does trying to use %Y with %V (%V is the ISO week number format code).


And just to clarify a bit: the specific restriction Python imposes is that if a strptime() format string contains one of %G (ISO year) or %V (ISO week number), it must also contain the other one, and must contain a day-of-week format code (%A, %a, %u, or %w).

Examples:

'%G/%m' is illegal; it contains %G without %V, and does not contain a weekday format code. Attempting to call strptime() with this format raises ValueError.

'%V/%u' is illegal; it contains a weekday format, but has %V without %G. Raises ValueError.

'%G/%V' is illegal; it contains both %G and %V, but does not contain a weekday format code. Raises ValueError.

'%G/%V/%u' is legal; it contains both %G and %V, and contains a weekday format code.

'%G/%V/%w' is legal; it contains %G and %V and a weekday format code. It's a bad idea, though, because %w numbers days 0-6 starting Sunday, while ISO (%u) numbers them 1-7 starting Monday.

If you need to work with ISO week date formats for some reason, you should stick to one of these two format strings:

'%G-W%V-%u'

or

%GW%V%u

The date of this comment (December 26, 2018) comes out as either '2018-W52-3' or '2018W523' using those format strings.


What is the rationale for forcing the presence of the day of week? It seems plausible that a weekly report, generated every Sunday for the previous week, would have %GW%V as the title. Seems more correct than using %V together with %w at least.


I don't know for certain, but what I would guess is that strptime() without a day-of-week indicator is ambiguous.

strptime() produces a datetime object, which consists of year, month, day, hour, minute, second, microsecond, time zone, fold. If you do something like "2008-12" with format "%Y-%m", strptime() will fill in the remaining arguments with day=1 and all time components set to zero, so what you get is datetime(year=2018, month=12, day=1, hour=0, minute=0, second=0, microsecond=0).

That works because it's unambiguous -- there aren't multiple possible numbering schemes for the day of the month in the strptime() formatting options.

But there are multiple possible numbering schemes for the day of the week, which means a year + week with no day-of-week format code is ambiguous. Worse, the two options don't even share a start: one of them begins numbering at 0 (Sunday) and the other at 1 (Monday).

So I'd guess the insistence on a day-of-week format code is to force you to indicate which day-numbering scheme you want, in order to avoid the possible ambiguity.

(and you might think it's reasonable to assume if someone uses ISO year + ISO week number, they'd also want ISO day-of-week number, but we're talking about dates and times here, and "reasonable" left the building a long time ago)


How hard would it be to make a liner for the common cases if this? I'd expect most format strings to be literals (no concatenation or anything), passed to the formatting API in a way that can be easily statically determined. Any use of YYYY without we should be assumed to be a mistake.

It can also be detected at runtime. Debug builds could warn it crash.

Future APIs just shouldn't have YYYY mean this. Use another letter.


Bad autocorrect / typos:

   * s/liner/linter/
   * s/we/ww/
   * s/warn it crash/warn or crash/
Sorry; I typed this comment on my phone in a hurry, then didn't look at it again until I could no longer edit it.



A little amusing that PHP, everyone’s favorite whipping language, eliminates this particular footgun (it uses ‘o’ for the week-based year).


I’ve seen it done. Had to patch a few, too.


‘o’ comes first alphabetically, so I can see how this can still be an issue.


Maybe YYYY should be changed to IIII minimizing confusion.

Related article (heavily discussed on HN in the past):

http://rachelbythebay.com/w/2018/04/20/iso/


Seems like a good candidate for a GitHub sitewide search, with (manually filed) issues against improper uses (maybe a cut and paste) to alert callers who are using it incorrectly.


This is one of those bugs that scare me the most, because they are literally a time bomb.

Twitter was affected by this a few years back https://www.google.com/amp/s/amp.theguardian.com/technology/...


Agreed. Calendar/time-related bugs are brutally hard to test for, one reason I was so pessimistic about Y2K.


Software i18n is still hard to do, even in 2018.

Not completely related to the post, but sometime ago I had to build a calendar widget, in javascript, capable of displaying ISO and Hijri dates simultaneously. It turns out the most reliable way to convert between the two is to use a lookup table, similar to what Java Time does. Algorithmic implementations available started to drift in odd ways after a certain number of days.


Are these date format strings part of ISO-8601 or standardized in some way? They look pretty much identical in JS and Python.

If so, maybe a big help in preventing human error are editor plugins that verbalize what a given string represents. I found the regex websites that do this to be invaluable in learning and validating regex.


> Are these date format strings part of ISO-8601 or standardized in some way? They look pretty much identical in JS and Python.

They’re part of Unicode.

> maybe a big help in preventing human error are editor plugins that verbalize what a given string represents

This is a good idea, but it’s easy to get this wrong. There are websites that let you enter a format string and it will format the date for you as a “preview”, and they often have a list of format specifiers that you can pick from. The issue is that it’s easy to pick YYYY because it might end up coming first in the list and have a description like “the year”, which makes it seem no different than the one you’d want to use.


They're very different from the format strings in Python, which takes format strings from the [C standard library](https://en.cppreference.com/w/c/chrono/strftime), which is different from (and older than) these Unicode format srrings. Same holds for other libc based languages like Perl or R.

JS does not have native datetime format strings AFAIK, and 3rd party libs such as moment.js often invent their own strings.


Case sensitivity is a mistake in every platform that embraces it. I will die on that hill.


I'll go ahead and voice some support. My position isn't as extreme, but I do think that case-sensitivity is overused and should not be the default. There are places where it makes sense and should be used, but not many compared to how much it's used in the wild.

This is especially true for programming languages. Case sensitivity means that fooBar and FooBar can be valid identifies in the same scope but refer to different bindings. I see many ways where that can produce an errors and very few (possibly none) where it can help create well-structured code. If the names shadow, clash, or override then the error cases become much easier to see and diagnose.

Honestly one of my favorite tiny things about programming in Lisp is the identifier rules. The caps insensitivity, plus '-' being available because there's no infix operators, means that multi-word identifiers are easier to type (no shift key) and IMHO more aesthetically pleasing than underscores or snake case.


Lisp isn't actually case insensitive. The reader is case-folding by default. You can set your READTABLE-CASE to change this: http://clhs.lisp.se/Body/26_glo_c.htm#case_sensitivity_mode

I can see why this is available to the reader (it has to handles arbitrary characters), but I have no idea why anyone would ever want :INVERT.


That depends on Lisp dialect; the above is true for ANSI Common Lisp.

Scheme is basically case sensitive. R5RS was insensitive; they repented from the idiocy and made R6RS sensitive. Finally, they caved in and made it configurable.

Emacs Lisp is insensitive; FOO and foo are different symbols.

TXR Lisp, ditto.

Case insensitivity is arbitrarily stupid; you're taking two subranges of a character set and making them equal. It ignores the fact that there are semantic differences linked to case: like Pole is someone from Poland, whereas pole is one end of a magnet.

I see foo and Foo as completely different; I've been programming in C and using Unix file systems for thirty years.

Case insensitivity is complicated in Unicode. To be implemented correctly, it must handle every script that exhibits a case-like duality or plurality whereby the same sounds are encoded with multiple sets of related glyphs:

CLISP fails here:

  [1]> (eq 'ångstrom 'Ångstrom)
  T
  [2]> (eq 'フバル 'ふばる)
  NIL
How about SBCL and others?

If you make things completely insensitive so that different codepoints are different characters, you don't have to implement reams and reams of tables for handling all of the scripts.

There are other issues there, like similarities between glyphs and groups of glyphs that aren't in a case-like relationship. These are worse than case, even; we can have a symbol ffi that is a ligature, and one that is just the three ASCII letters.

We don't even have to go to Unicode to find visual confusion issues: should we fold together 1 and l so that someoone doesn't name different variables poll and po11? These two look more similar to me than B and b, or E and e. In some typefaces it's worse than in others. People have exploited this sort of thing in obtaining vanity license plates that are hard to transcribe correctly for police officers.

All sorts of language features can be abused. We can have nested scopes such that an identifier can have different bindings in the same apparent scope. We don't need foo and Foo to create confusion: we can just have nested let-s with foo present at different levels of nesting. Should that be banned?

Basically, the right guiding principle is to trust the users of programming languages to be grownups.

By the way, in a number of mainstream functional languages, FooBar would be a type and fooBar a value, enforced at the language level. That is monumentally idiotic; you basically have case sensitivity and the implementation still has to recognize case distinctions. If we use Japanese names in Haskell, which is the type: the one starting with katakana or the hiragana one? How do you use scripts that don't exhibit case?


> Case insensitivity is arbitrarily stupid; you're taking two subranges of a character set and making them equal.

I think this is backwards. Languages are top-down, and character sets are bottom-up.

A language can reasonably choose to use the common concept of "the 26 English letters (ignoring case)". This would require dealing with such implementation details as "subranges of character sets" only because that's the platform that computers today present.

It's not unique to characters. Many programming languages say that 2 is equal to 2.0 (and 2/1 and (2,0) and ...), even though their internal bit patterns are usually completely different. We simply decided that this is the semantic meaning that we wish to expose to users, and the implementor has to work out how to deal with IEEE FP and such.

> It ignores the fact that there are semantic differences linked to case: like Pole is someone from Poland, whereas pole is one end of a magnet.

There are also semantic differences from context and from pronunciation -- English is pretty stupid -- but I don't think anyone would suggest appending every identifier in a program with a pronunciation key, or the OED meaning. Just because a word without capitalization can be ambiguous doesn't mean a word with its capitalization is unambiguous, or even that this would be desirable if it were.


Case-insensitivity is annoying to me, and much more difficult to implement than case-sensitivity. Yes, yes, one should always implement Unicode normalization or at least form-insensitive comparisons, but that's a subtly different. Even if you implement normalization / form-insensitivity, case-insensitivity still requires significantly more constant data tables. And you really have to have normalization support if you're going to do case-insensitivity.


> If we use Japanese names in Haskell, which is the type: the one starting with katakana or the hiragana one?

That sounds like the wrong question to ask. Neither script has uppercase, and you're not mixing them in a word. A better question would be: should scripts without case support be valid Haskell identifiers to begin with.


Yeah, I could see a language taking the position that referencing fooBar as FooBar is an error, and that level of case-sensitivity makes sense to me (my name is Martin not martin) but allowing both Martin and martin to exist as distinct identifiers is crazy.

On the subject of spaces, I do wonder why no language has used enough punctuation to make spaces allowed in identifiers. I mean, most Algol-derived languages could get away with it if they put punctuation between type keywords and variables:

    private, my class: my instance = my factory function(my parameter one, my parameter two);


> make spaces allowed in identifiers

IIRC from my days at Cambridge, Algol68 allowed that. (Our local variant was Algol68C.)


Lisp allows spaces in identifiers, or any other character. You just have to escape them:

    (defun |my + function| (x y) (+ x y))
    (|my + function| 2 2)
(Or maybe write a reader macro to deal with them in some other way.)


Hah, that's even uglier than the SQL "just put square brackets around it" approach.

We're looking for something nicer than underscores and hyphens.


Being that whitespace is the primary way most languages separate tokens, yeah putting whitespace inside tokens is going to be ugly.


Robot testing framework language does allow spaces in identifiers and it is a terrible idea.


Typically case-insensitivity is a filesystem feature rather than a pervasive feature -- if some programming language is case-sensitive (most are!) then that's what it has to be regardless of platform/OS. So your comment doesn't quite make sense.

The problem here isn't case-sensitivity anyways. The problem is that many programmers think Y instead of y for years because that's how it is in many contexts. E.g., the C strftime() and strptime() functions use %Y for four-digit year.

I don't think case sensitivity is a mistake. Case is a mistake. But we have it, and we have case-sensitive and case-insensitive things. It is what it is.


What about passwords? match?("AAA", "aaa") == true is a bug, to me.


... okay, I'll grant passwords as a rare exception, but I do wonder if "correct horse battery staple" isn't a better solution to the problem that password case sensitivity solves.


When you get into international, your intuition breaks down. Unicode is a zoo. Greek, for example, has TWO lower case forms of capital sigma depending on where it's placed in the word, so how would your system handle that? Latin has similar quirks.


Yes, but string equivalence in Unicode is a hairy mess even without cases. You get all kinds of bits where two characters are semantically and visually identical but are different points. Raw binary comparison of Unicode without normalizing first is a bad idea even without cases.

Folding the cases is a continuation of the normalization operation you were going to do anyways.


You need more tables for case folding.

Also, case-folding doesn't round-trip in Unicode. In particular, only to-lower-case conversions work well enough.

Case is a mistake. Case-insensitivity is another mistake though. Don't do it. It's a pain for users (me, for example).


Technically, doesn’t English have two forms of lower case “s”?

“s” and “ſ” - the latter being u017F.


From the perspective of Unicode, no. What you're looking for here is what Unicode calls "equivalence", and it comes in two variations: canonical equivalence and compatibility equivalence.

For example, "é" can be written as either U+00E9 LATIN SMALL LETTER E WITH ACUTE, or as the sequence U+0065 LATIN SMALL LETTER E followed by U+0301 COMBINING ACUTE ACCENT. These two options have canonical equivalence; what this means is that Unicode treats them as two ways of specifying exactly the same thing.

Now, consider "½". That's U+00BD VULGAR FRACTION ONE HALF. Generally you can replace that with the sequence U+0031 DIGIT ONE, U+002F SOLIDUS, U+0032 DIGIT TWO ("1/2"). This is not quite the same thing; most places where someone writes "½" can safely be replaced by "1/2", but not necessarily all, and it definitely doesn't work in reverse. This is compatibility equivalence, and under compatibility equivalence "½" maps to "1/2".

So to get to your actual question: U+017F LATIN SMALL LETTER LONG S has compatibility equivalence with U+0073 LATIN SMALL LETTER S. But U+03C2 GREEK SMALL LETTER FINAL SIGMA does not have any type of equivalence with U+03C3 GREEK SMALL LETTER SIGMA.

If you follow the general recommendations for things like comparing Unicode identifiers, you'll apply normalization to form NFKC (which decomposes by canonical equivalence, then recomposes by compatibility equivalence); this will turn a "ſ" into a "s". It will never turn a "ς" into a "σ".


If you're just comparing strings then just do character-at-a-time comparison, which allows you to decompose (no need to recompose) and only one character at a time (look ma', no allocation needed), compare the two decomposed characters' codepoints, then fail or move on to the next character. I call this form-insensitive string comparison.


Inventing your own pseudo-normalization of Unicode is a worse idea than using the actual normalization forms Unicode defines.

Also, if you think you can decompose without allocating memory... well, try a code point like U+FDFA.

For reference, its decomposition is:

U+0635 U+0644 U+0649 U+0020 U+0627 U+0644 U+0644 U+0647 U+0020 U+0639 U+0644 U+064A U+0647 U+0020 U+0648 U+0633 U+0644 U+0645

(and that doesn't begin to touch any of the potential issues with variant forms, homoglyph attacks, etc.)


There's nothing pseudo about it. To normalize both inputs first then compare, or normalize one character at a time and compare that is equivalent. There is a maximum number of codepoints in a canonical decomposition (or at least there used to be).

This is actually implemented in ZFS. (And also character-at-a-time normalization for hashing.)

I don't see how homoglyphs enter the picture. Can you explain?


Unix filesystems?

Names in programming languages?

English?

Care to elaborate?


Case-sensitivity is such a rare concept in English that any API or programming language that is based on it cannot be semantically thought of as English, even though it uses English words for all its function and class names.

I've used systems that are case sensitive and systems that are case insensitive, and the former are responsible for an order of magnitude number of bugs more than the latter.


> Case sensitivity is a mistake in every platform that embraces it.

> I've used systems that are case sensitive and systems that are case insensitive, and the latter are responsible for an order of magnitude number of bugs more than the former.

Do you have "latter" and "former" transposed in the second quote? If not, I'm confused.


Oops, yes, corrected.


Rare?

Practically every programming language is case-sensitive. Unix filesystems are generally case-sensitive.

You statements make no sense anyways. What is the antecedent of 'it' in "... any API or programming language that is based on it ..."? Since when are APIs / programming languages "semantically thought of as English"?!


CasE-SenSITIVITY IS rarE IN ENGLISH? I WiSh SOMEone would'VE TOLD mE EARliER!


When was the last time you had someone refuse to deliver a letter to YOUR NAME in YOUR CITY? Ever have trouble catching a plane because your drivers license says Alice but your boarding pass says ALICE?

Normal people don't think of case as changing the value in the same way that even abbreviations or alternate spellings do — I know people who've had trouble boarding planes because they have ID/boarding passes which are inconsistent on, say, using ü or ue – especially annoying when the latter usage was just to work around a bug in someone else's system.


Capitalization of proper nouns is a core semantic feature of English. While I'm sure you'll be tempted to hand-wave it away as a special case (see what I did there) of case-sensitivity, it nonetheless refutes your argument.


The problem comes primarily from systems that allow identifiers (or other semantically meaningful concepts like format strings) to differ only by case. This could be considered a side-quest of the naive implementation of case-sensitivity.

"Case-sensitive but identifiers cannot differ only by case" would be a reasonable compromise.

"Martin" and "martin" are obviously not different names. Martin may be more correct, but allowing "martin" as a separate name is worse.


No, as you'd learn if you tried to find counter-examples for the cases I mentioned. Computer systems commonly treat terms which differ only in case as separate values but humans almost always try to do the opposite. People have preferences for proper usage but that's quite different from treating alternate case values as separate because people try to recognize the intention even if it's poorly expressed.


> Capitalization of proper nouns is a core semantic feature of English.

It is a core spelling rule of English. They could have gone the German way, with all nouns capitalised and compound words without spaces.


Tell it to the lawyers writing disclaimers in uppercase.


That's an artifact of the years past, still alive because everyone is too afraid to do something about it, and most lawyers just don't care about styling.

Every time I copy a license to the project, I edit the formatting to use Markdown bold (or sometimes italic/emphasis) instead of ALL THAT UPPERCASE SCREAMING. Much easier for the eyes.

IANAL, but I believe that was the intent. Just to highlight the important sections to be even safer with "you have been warned about this and can't say you haven't noticed it". If any lawyer out there knows if this is correct understanding or not - would be interesting to hear.


Golang's time formatting seemed odd.

Now it looks like genius.


When designing a new API, it makes sense to consider this kind of thing: How are people going to use it, and how are people going to screw it up? Kids these days copy and paste everything so bugs tend to multiply if you (god forbid) make a popular API. Date formatter APIs are opaque and anyone trying to use one in the morning or before the 12th day of a month has to check the documentation, so what do we actually gain with this abstraction?

There's a small number of formats you might need that it makes sense to try to enumerate them:

* ISO8601 (and get it right, it's a comma not a dot)

* that weird ISO8601 variant that uses a dot

* kdb's .z.p

* DJB's TAI

* ISO/IEC 9899 asctime/ctime

* IEEE 1003.1 "ls" format

* Yankee-doodle format/other localised formats

* RFC1123/RFC7231

* RFC2109

* RFC822/RFC2822

* Fancy "X units since/until" relative time

Do you really need so many others that you should have "yyyy-MM-dd" anywhere in your code? Each of these are trivial to construct from a struct tm/Date object, that you'll end up with fewer bugs if you stop making up mini-languages for dates and just do it directly. Oh and your code will be faster.


Both comma and dot are allowed. My copy of ISO 8601-2004 says in section 4.2.2.4 “Representations with decimal fraction“:

<< If a decimal fraction is included, lower order time elements (if any) shall be omitted and the decimal fraction shall be divided from the integer part by the decimal sign specified in ISO 31-0, i.e. the comma [,] or full stop [.]. Of these, the comma is the preferred sign. >>

Also, RFC 3339 is useful to have as a specific standard profile of ISO 8601, because 8601 comprises several formats with lots of options.


> Both comma and dot are allowed.

They are distinct formats: You're not going to alternate them in your output. You're going to output one or the other.

> Also, RFC 3339 is useful to have as a specific standard profile of ISO 8601, because 8601 comprises several formats with lots of options.

RFC3339 describes ISO8601 well enough. Is there something you think I missed?


I didn't notice any glaring omissions, tho if you are including DJB's binary format you should also include the NTP and PTP scales.

RFC 3339 doesn't include the ISO 8601 week calendar formats, nor does it include the 8601 syntax for time periods. 3339 is a lot simpler than 8601.


> and get it right, it's a comma not a dot

I would have upvoted your comment but for this. In the Anglosphere, we always and without exception use a dot for decimal fractions. Yes, many other cultures use a comma, but when writing English (as you yourself do) one always and without exception uses a dot.

(yes, I'm aware that ISO 8601:2004 states that the comma is preferred: it’s simply wrong)


And then is the funny idea with an unknown date of birth, when you end up with 00-00-yyyy in your passport.


Yes! That's a very good point!

How is a data-formatting mini-language supposed to sensibly deal with this?

RFC3339 specifically prohibits the idea of unknown dates/months...


There are proposed extensions — for example, the Library of Congress[1] has proposed "X" for unknown field values and ".." for when only one part of a range is known:

https://www.loc.gov/standards/datetime/edtf.html

1. Disclaimer: my employer but I don't work in that group or on that project


Thanks for that.

I think this begs for a better "date" object though, rather than an advocate for the programmable formatter, no?


oh man... I found these kind of dates dealing with MARC21 records. It's a problem as enforces us to duplicate data and use some complex replaces on SQL when we need to do queries by date.


Support of legacy data formats is one of the reasons. Of course, it’s very unlikely that someone would ever use dd-MM-YYYY intentionally.


Why exactly do you think a legacy format will benefit from a special mini-language?

It's not going to be faster, and as we can see: It's more likely to have bugs.


Because it's a compact and user-friendly form of specifying such formats. Some strange behavior in edge cases doesn't invalidate the benefits of not writing a new parser any time you need to handle some old representation of date.


Formatting dates with a format string is an anti-pattern. Format strings are too easy to get wrong and don't scale when you need to account for different locales. JavaScript's Intl functions do this correctly by providing an api that accepts a locale and options like weekday: narrow|short|long, year: numeric|2-digit, etc.


This sounds to me like a really weird YYYY2k problem.

In all seriousness, that is weird. In JavaScript you can't even format a datetime string as far as I've researched, the only way is to import some third party library, otherwise you're concatenating a bunch of function calls to piece together the string you want.


Perhaps not part of ECMA standards, but browsers/node have Intl.DateTimeFormat as the standard library

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe...


I believe I've tried that approach before too, but fell short when my boss wanted a specific format that just wasn't doable with that approach. I'm still a little shocked that this isn't something part of ECMA somehow given how many more other programming languages support using time formatted strings.


The whole point of Intl.DateTimeFormat is to intentionally not give you the specific format your boss wants, because your boss is almost certainly wrong about how a date should be formatted in many cases.

You tell the API you want the date and the hours and minutes, but not seconds. It localizes the result correctly for the vistor’s preferred culture (language-country).

If you show someone from England “4/5/2019”, they will think it represents a date 31 days after someone from the USA given the same string.


Using 'e' in the 'ww-e-YYYY' example is pretty bad form, as it is dependant on your local system. So on a US system, 1 in the 'e' field means Sunday, but in say a German system, it would mean Monday. Although, I cannot figure out which one would be system independent.


The same format string for "week years" was introduced in Java 7: http://www.juandebravo.com/2015/04/10/java-yyyy-date-format/


Ran into this type of thing last year. Definitely a weird one to debug by looking at things, but easy to figure out when you print out the dates or read documentation.


I get how the year went forward at the end of the previous year in your example, but how in the world in this post's article did it move back? I'm so confused.


I think that's the date parser's behavior in Swift (if no additional information is given in the input string, revert to the first day of the week before the first week of the year). The article I linked to displays what the formatter does.


Got bitten by this one, I can't remember the specific details other than the bug was due to using YYYY in a system that parsed log lines.

The bug manifested itself around this time of the year, so not a great time to be called out to find billions of log lines are being rejected by the system!


This website degrades unusually poorly with 3rd-party JS disabled. It looks fine, and I was really confused until I realized the main content was in images that had been omitted without any kind of placeholders.


Seems like crushed the site (I get a database error)


https://outline.com/xSadBb

TLDR: ww-e-YYYY gives you week number, day in week, and the ISO year in which to count weeks. yyyy-MM-dd gives you the calendar year, month, and day. Using YYYY when you mean yyyy give you unexpected results.

YYYY-MM-dd unexpectedly - or expectedly, depending on what you expect from a programming language and ISO spec combo - gives week zero (you didn't specify ww), day zero (again you didn't specify e), which means it gives you the first day of the last full week of the preceding year. 2019-1-1 parsed with YYYY-MM-dd will return a Date of December 23rd, 2018.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: