I don't think `/{year}/{slug}.html` is what people mean when they talk about "ugly" URLs.
That moniker, at least for me, is reserved to links that look something like this: `/{endpoint}/{long_hash}?__gtr[0]&__jd__[df]=%ezaz54%d/{another_very_long_hash}[c__f]/`
When we implemented URLs for Django our nemesis was Vignette, a popular CMS at the time (~2003) which frequently included commas in long weird URLs.
It's hard to find an example
Of one of those now, because the kind of sites that tolerate weird comma-infested URLs in 2003 aren't the kind of sites that meticulously maintain those URLs in working order for 20+ years!
Wow, when I woke up this morning I had no clue that THE Simon Wilson would be replying to my comment!
Right now, I’m knee-deep in coding my Django app. I totally dig how the framework kinda "forces" you to write neat URLs ― it’s one of my favorite things about it. This might seem silly, but I actually take immense pride in crafting simple, elegant URLs, even if the majority of the users won't even notice it.
As for the comma infested URLs, the website of one of the major news outlets in my country manifests such behavior. It always puzzled me as to what tech stack they were using. I'm not sayin they still use it today (as Vignette went belly up in 2009), but this can be a heritage from those days.
I really enjoy using Django since I first got to know it back in the 2.2 days, I’ve used nothing else for my projects, big or small. I’m head over heels for every bit of it and having recommending it for years to my friends!
Big thanks to you, Simon, for helping create this awesome piece of tech!
My recollection of the "old days" may be a bit hazy, but I think comma delimited parameters were a work around for frameworks that did not support multiple values (or users not knowing how to handle it)
Example of a "correct" url
?value=A&value=B&value=C
Complete frameworks would have a method that returned the values as a list. Some like PHP required ugly work arounds where you had to name the parameter using the array syntax: value[]=A&value[]=B&value[]=C
Even if the framework supported multi-values, many preferred the shorter version: value=A,B,C and split the values in code instead
Django actually has a special mechanism for dealing with ?value=&A&value=B
values = request.GET.getlist("value")
# values is now ["A", "B"]
We built it that way because we had seen the weird bugs that cropped up with the PHP solution, where passing ?q[]=x to a PHP application that expected ?q=x could result in an array passed to code that expected a string.
I don't know if it's something from the old days or not, but iirc URLs have a semicolon separator (;) that would go before the ?. I have never seen it being used. I'm betting it's even less support than commas!
In the OG RFC 2396, each _path segment_ can specify parameters similar to query parameters, but using a semicolon to separate them to the main segment value instead of question mark. This has effects e.g. when calculating relative URLs. This is now obsolete, but many URL-parsing libraries have an API for that for compatibility.
I may have misunderstood your initial comment. Was Vignette a nemesis because letting people migrate to Django from it while preserving URLs involved commas, or was it just a nemesis in general and you're pointing out a flaw in how they did URLs? If the latter then yeah there's no point in me mentioning a mainstream use of commas in URLs.
I think it's a specific reference to one of the tenets of Cool URIs Don't Change, which was that you should drop the file extension from URIs. So, indeed, not that ugly, but also, not cool, according to the good people of the W3C, back in the day.
microsoft teams is a good example of ugly urls. it could be a just a couple of letters that are mapped in a backend database but the urls feel like there is a whole javascipt file encoded in there
Unlikely to change over what timeframe? Image formats on the web have moved from .gif to .jpeg/.png to .webp to .avif. Video and audio formats have always been a mess. For a time it seemed things would move to .xhtml.
That the page is sent to your browser as HTML is not a defining attribute and could very well depend on HTTP content negotiation.
I think the point being made is that the contents of the file will be html whether it’s a static file on disk or dynamically generated using php. This may be more obvious when thinking about dynamically generated svg or pdf. Php nodes or python would be implantation details. HTML is the content type, and that is not likely to change.
This is an aspirational abstraction. HTML will probably outlast most websites, and those .gifs are probably /foo.gif on every site too. Even if that somehow changes, it won't break the existing URLs. Less confusing to just call it what it is for the time being.
That is a very formal way of looking at it. Moreover, this is rather simple hypertext, not an image. HTML, or a remarkably similar and compatible descendant of it, is likely to remain in use for centuries.
That's an implementation detail that doesn't make sense in the addressing scheme. Like adding "brick house" to the end of every mailing address when the destination is made of bricks.
> What about if an mp3 is at the end of a URL? Is that an implementation detail that doesn't make sense? Just take of the .mp3 extension?
Yes, why not? Just because file extensions matter to certain systems doesn't mean they do for others, and nothing about a URL to a file is required to match its DOS/Windows friendly file name.
> GET /<artistname>/<albumname>/<songname>/download HTTP/1.1
> It's nice in browser history to see foo.mp3 and know it's an mp3.
TBH I agree, I personally do my best to ensure the extension in the URL matches the document type on sites I run, but my point was that it's not in any way required and it's actually somewhat common for it to not be the case where the person I was replying to seemed to think it mattered.
Not seeing any advantage, and serious disadvantage.
The hiding of the index file name in a folder is kind of a quirk, it's automatic behavior that is being taken advantage of to make "nice" URIs, but it's actually hiding useful information.
File extensions, while a DOS/Windows thing, I've found to be an extremely useful convention on unix, and linux, and just about any other system I've used (though I can't remember what we did on VAX box we used to use in the 90s).
If the extension is there because that's what the file is on the server, that's wrong. If the extension is there because the endpoint will return that type of content, I'm fine with it.
I've put blobs of JSON in a URL before. It was dirty but I thought it was better than having pages with no direct URLs or breaking the browser's history.
For my personal website, I have gone back and forth on using "cool URIs" without the ".html" extension. Initially when I began building my website in the early 2000s, I configured my web server to handle requests to /blog/{slug} by serving the corresponding {slug}.html file stored on the disk. However, over time, I opted for simplicity and got rid of such server configurations. I now simply expose /blog/{slug}.html in the URLs.
> File name extension. This is a very common one. "cgi", even ".html" is something which will change. You may not be using HTML for that page in 20 years time, but you might want today's links to it to still be valid.
But I have been running my website for over 20 years now and I do think I'll stick with ".html" for the foreseeable future. This combined with the fact that I strictly use relative links for cross-linking between pages, for loading CSS, images, favicons, etc. means that I can browse my website offline (directly from my local disk) too just by opening the local index.html file on my web browser.
I recently thought through this problem and came up with the concept of building of a list of "candidates" for a given URL. Then the caller loops through and returns the first candidate that actually exists. It's a nice boundary between functions. I wrote up my solution in literate markdown (and javascript) here [0].
(Apart from supporting optional extensions, this code also supports throwing an error if someone prepends dots into the url - which, for me, indicates someone probing the server for weaknesses and is not a legit request.)
The funny thing is that I still often use file extensions since IntelliJ can only let me easily navigate/check existence if I use the extension.
Eventually I'll support slugs in the filename by just ignoring everything after the first dash.
How I wish they were right about .html.,, I wish we had something else by now.
Personally I'm a fan of including a post ID in the URL, e.g. /category/123/post-name. Because if you want/need to change the URL later, you can simply parse the URL to get the ID back to create redirects. A lot of sites of all scales don't implement redirects which makes me sad.
I think there was a news site acquired by Bloomberg, I forgot the name. When you visited an article in the old domain, it redirected to a landing page on Bloomberg saying it was part of Bloomberg now instead of redirecting to its new URL.
> How I wish they were right about .html.,, I wish we had something else by now.
You can thank the browser complexity moat for that. If browsers were simpler to implement someone would have started experimenting with this (markdown at least) years ago and other browsers would have picked it up.
PDF is done via an internal plugin. Standards compliant web browser doesn't have to do anything with PDF. Major browsers have internal type handler for PDF.
Similar type handler is engaged with XML. Unless you can utilize W3C standards to implement a custom markup language using XML/XSLT and have it work across browsers without plugins.
SVG is vector graphics.
For another full markup to be even considered there would have to be one that's widely adopted and realized through plugins. Nobody is making interventions in standards to open up venues for easy implementation of custom markups when those markups are used by 0.001% of publishers.
Yeah, sending a .md for client-side rendering would allow the client to reformat it more easily based on user preferences. Then again, Safari/Firefox reader mode already do an ok job with HTML for this.
But we could go so much further than reader mode. Users should have way more control over how content is rendered. But I'm something of an extremist. I don't really consider CSS/JS part of the web.
I don't really agree about CSS/JS, but either way, I've been in plenty of situations operating informational sites that just want to serve mixed text/image without worrying too much about how it's formatted. Unfortunately there isn't such an option. Regular HTML tags are supposed to do this, but most browsers won't format those in a modern-looking way. It'd save a lot of collective time if they could.
When those "informational" sites were normal 15 years ago, browser like Opera had user-CSS that you could just override, and had a number of presets. You could format the site to look like C64 BASIC.
The stuff you're talking about isn't about browsers its about the websites.
If you had a website that uses javascript to parse MD or any other markup, spit it out as trivial HTML with light DOM, client-side formatting can do everything you want.
The problem is that modern websites use patterns that workaround users' capability to customize the presentation of the website. They do not want you to look at their site the way you want.
Browsers can reformat clean HTML easily in theory, but I mean the defaults aren't nice, and most users aren't changing them. You have to use CSS to make a site look good by default.
I guess the best solution to that isn't browser-side .md rendering, though.
This is somewhat stupid from my angle (the W3C recommendation).
I don't expect that url.html is a static html file. I expect it to be server-side generated in 2024. For me site.com/page and site.com/page.html are the same. I do not expect different behavior from my web client side. So I may switch backend engine every year, and I'll just route the request sfrom page.html and that's it.
What's way worse than this is using non-HTML extensions for emitting html. I go to pichost.com/image.jpg and I get a webpage served. This is a bad pattern and it needs to go away. I'm not even going into responding differently depending on user-agent or referrer, if you have combination of these you get JPG returned, if you don't you get a webpage returned.
> What's way worse than this is using non-HTML extensions for emitting html. I go to pichost.com/image.jpg and I get a webpage served. This is a bad pattern and it needs to go away. I'm not even going into responding differently depending on user-agent or referrer, if you have combination of these you get JPG returned, if you don't you get a webpage returned.
It's mostly based on the Accept header these days (browsers don't tend to include HTML there in image contexts) and the Referer should have been removed decades ago. This means browsers (the ones with a large market share at least) are 100% complicit in enabling this behavior.
Agreed... but not what I was talking about.
HTTP has no files or extensions, it's just URL that someone named dot something. Since it doesn't have to be that file type behind, I don't expect it to.
The internal framework we have at my company directly ties the extension of the endpoint to an expected mimetype return from the controller. So endpoint.html / endpoint.xml / endpoint.json / endpoint.csv you always know what you are getting. Only the implemented extensions work, defined per controller, no magic here.
There is an escape mechanism for making endpoints without an extension but we rarely use it.
It’s a weird design I probably wouldn’t make these days, but for debugging at a glance it’s honestly pretty nice to look at the stream of requests and just know the type of each.
That's an interesting choice. I like that from an ease of use perspective, but I don't love it from the perspective of knowing what you're actually accessing, ie, if it's a .JSON URL I'm expecting to be served a static JSON file rather than a script that's serving me JSON dynamically. I kind of feel the same way about certain uses of HTTP status codes, like, if I get a 404 I would expect it to be because the page wasn't found, not because a POST parameter was wrong. The worst offenders don't serve an error message with the status code, but I'm getting off track here.
That's clearly incorrect semantics, and should be 400 Bad Request. Unfortunately the semantics of HTTP status codes are unenforceable with some obvious exceptions.
There's no excuse for not implementing them properly, however. I'm less of a fan of the existence of verbs, which I consider to be a part of the URI which isn't in the URI itself. Things would be better if one URI was one endpoint, rather than potentially as many endpoints as there are verbs.
Most people have a /blog/{slug} directory with an index.html inside it. This is also a nice place to put images and other files you only include in a single page.
This sounds like a (critical) bug with Cloudflare Pages to me. No hosting provider should be fiddling with the url scheme, especially with permanent redirects. That's invasive and wrong. If it's an official policy or "feature" then someone at Cloudflare made a BIG mistake.
> Pages will also redirect HTML pages to their extension-less counterparts: for instance, /contact.html will be redirected to /contact, and /about/index.html will be redirected to /about/.
Yeah, the permanent redirect is what really sounds weird to me. Those can be really invasive and should not be used lightly. I rarely use them these days because back when I did it was almost always a mistake.
IIRC, using a permanent redirect makes sure search engines treat the two URLs as pointing to the same page, accumulating all "page rank" to that one page, rather than treating it as two separate pages.
Many hosting providers -- and many web servers, going back decades -- offer this functionality, because a lot of people want it.
Keep in mind that this is Cloudflare Pages, not Cloudflare in general. Cloudflare Pages is a product where you give it a bunch of files, and it serves them as a web site. You don't have your own server behind Cloudflare in this case.
Serving a web site based on a directory of files is tricky, because URL space and filesystem space are a little bit different. Files on disk need to have file extensions to indicate their type, but URLs are not supposed to have file extensions, because their type is indicated by the `Content-Type` header. So if you are taking a bunch of files and serving them as a site, you need to figure out how to transform the type info in the URLs into Content-Type headers in an appropriate way. This is a solution to that.
Another remapping that nearly every file-based web server does is, if the URL turns out to be a directory, it returns a redirect to add `/` to the end, and then from there it serves the file called `index.html` in that directory. Again, this is needed because URL space and filesystem space don't exactly match: a directory on the filesystem cannot itself have byte content, it can only contain files. But a URL that is a directory can also directly serve content, so you have to figure out how to resolve that.
`index.html` remapping is pretty much universally accepted. But it's true that people have differing opinions on extension-stripping. The extension is redundant, but some people would rather keep it just to make it clearer how URLs map to files. Fair enough.
Unfortunately Cloudflare Pages does not have a setting for this right now. It has chosen to implement only the most popular approach. This is a product decision, and of course some people will disagree with it. You can submit a feature request, or you can use a different product that works the way you want (there are tons of them out there). But it's not a "bug" that the product has not chosen to implement your specific preferences.
(Disclosure: I work for Cloudflare, but not specifically on Pages.)
One of the most unexpected and unwelcome features, like many others I only found out about this once my pages went live and users had cached the redirects.
Yeah this should definitely be opt-in. Cloudflare are infrastructure, and infrastructure should strongly prefer to be as neutral as possible on decisions that have the potential to break things.
It's quite normal for static sites to do something weird here. Like having folders that all contain index.html, and then having settings to strip (or add) the final slash.
There are so many different flavors, the only somewhat neutral default is what apache does.. still it's not much :)
The "coolness" of the URI is measured by how non-changing it is.
Including ".html" in the URL when you're first creating a site signifies a risk that it'll change in the future, because it's evidence you went along with what was easiest to get the backend technology to serve your content, and as the backend changes over time, you'll do that again, changing the visible URI as you go and causing bitrot.
But if you picked ".html" and stuck with it, that's now the cool URL, and you should use web server configuration to make sure it remains that way, even if the backend technology has changed completely.
For an extreme example, when eBay started, everything was cgi.ebay.com/ws/ISAPI.dll?ViewItem=blah (or something like that), which has many specific technology implications! But it stayed that way while they changed out all that technology over the years. (I see that now they’ve gone more abstract, though.)
> GitHub Pages does something similar: If you request /path, it will serve up /path.html. [This would] does not lock me into anything at all.
This is how I decided to configure my nginx as well for my web page, but note that it still locks you into something: you will still end up seeing links out there that reference /path without the extension and you will need to set up all future web servers to find the right resource on that URL. (Even if that is by adding files to the file system rather than writing web server configuration.)
My opinion is: as long as a URL 3xxs to the latest content destination, its still a cool URL. The goal I think should not be to create a web that is crusty, calcified, ever-unchanging; but rather create a web that is adaptable, dynamic, where producers have the freedom to leave breadcrumbs and consumers have the intelligence to follow them.
When did "URI" become a thing? Was it not cool enough to call them URLs, so they had to make another abbreviation that looks very similar? I'll bet there's supposed to be a difference, but they're totally used interchangeably.
The Wikipedia page on URIs has examples that look a lot like URLs. Seems it's trying to say that URLs are only for WWW addresses, but Postgres refers to things like "jdbc:postgresql://host:port/database" as URLs: https://www.postgresql.org/docs/6.4/jdbc19100.htm
Or maybe the presence of host:port qualifies it as a URL.
A URI (indicator) is a unique reference to a resource, of some kind.
One type of URI is a URN (name), e.g. doi:10.5281/ZENODO.31780 - a unique name for a resource, but no instructions on how to obtain it
Another type of URI is a URL (location), e.g. https://doi.org/10.5281/ZENODO.31780 - same resource in this case, but now we know we can obtain it via the HTTPS protocol
Few people call the address in the web browser a "URI" any more, even though technically it is one. Your JDBC URL is a URL, as is "mailto:president@whitehouse.gov" or "tel:+44-118-999-881-999-119-7253"
I get what they were going for here, but ehhh, the only useful designation is URL. And even acknowledging that URIs exist, it's overly broad to refer to http://... as one. I remember seeing "URI" a lot some ObjC libraries to refer to URLs, it was just confusing.
Yup, URNs were part of the "semantic web" craze, so you could e.g. record facts about a book with isbn: scheme URNs. Nothing much consequential ever came from all that committee busywork, but people got to pontificate and sound smart talking about reification and so on. I still wonder who paid for all of it.
Seems like URNs fit into the XML/XMPP/SOAP genre, old bloated stuff. For some reason there had to be a whole fad for people to realize you can just shove data into JSON and it's good enough.
The only difference between a URI and URL is semantic - URLs point to resources over a network, URIs point to resources that could be anywhere. Colloquially they're used interchangeably.
I highly recommend reading Weaving the Web by TBL. He explains how URI (identifier) was the term he wanted but he settled on URL (locator) because of politics. The semantics are actually fairly important IMO. Does your URI represent a resource's identity or where that resource is?
URI came first and URL was adopted for dubious reasons. Personally, I now use URL for user-facing things because more people know what that is, and URI when talking to other developers because it sparks conversations like this which I think are useful.
But it's hard for monitoring tool to tell, which part of URL is API endpoint (which you want to report on) and which is user data (which you don't want to report on). I wished people used query portion of the URL for user data, so it's syntactically distinct from the path.
I've seen some static site generators sidestep this issue by always putting HTML files into its own directory and relying on `index.html` being correctly handled. That hindered my attempt to use HTTP content negotiation for multilingual sites (e.g. `foo.en.html`), unfortunately.
If I manually put those files, yes. But those generators wouldn't know that part of the file name and put `foo.en.md` to `foo.en/index.html` for example. Can be fixed later, sure, but still annoying and often breaks other features in the generator.
You don't link without .html then you won't break anything. That's what the author is saying.
In general trying random URLs and them accidentally working and then not working, despite you weren't linked from somewhere is not something that counts as a broken link.
Say for example you added "?page=123" to a URL that had no pagination. So the normal page opens but it ignores the parameter. Then later the parameter is added, so when you add this parameter now you get a 404, because there's no such page. Was a URL "broken"? No.
> But Cloudflare’s redirect is permanent and has been public for a few weeks, therefore all Google search results were pointing to the cleaned up URLs. If I wanted to move to a different static site host, I would have to install additional redirects so that none of those links break, just to clean up a mess I didn’t cause.
The "would have to" remark is odd. It's too late; you'll need to install redirects to stop those links from breaking anyway. Whether GitHub supports this automatically doesn't change anything. You may as well have not switched.
I didn't realize CloudFlare would forcefully start redesigning your URLs to their taste. This is absolute nonsense, I can't believe they do that. Really poor choice.
Note this is Cloudflare Pages, not Cloudflare in general. Cloudflare Pages is a product that hosts static content on Cloudflare. You upload your files to Cloudflare, and it serves them, you don't have your own server.
Many static content hosting services have this exact behavior. In fact, many web servers have offered this behavior, going back decades, because it's what a lot of people want. It's kind of needed to work around the fact that files usually indicate their type by filename extension, but URLs are not supposed to have such extensions since they indicate their file type by `Content-Type` header.
(I work for Cloudflare but not on Pages specifically.)
Thanks for the clarification, but even if that's what people want, then CloudFlare should ask them if they want it or not, at the very least allow them to opt out of it. According to OP's story it seems there's no (obvious) way to opt out of this.
A Hackernews discovers that when you outsource not only server space, but also server software, and therefore give up control over URI routing, it may differ between providers. News at 11.
"Cool URIs Don't Change" was always such a pretentious page to begin with.
No, just because I hosted something for awhile does not mean I am obligated to host that resource in the exact same way for eternity. There is no contract, implicit, social, or otherwise that I will continue to provide that free thing for you in a way that is convenient to you personally in perpetuity.
Nah, it craps all over site operators for their lack of "forethought".
Oh, you didn't perfectly lay out your URIs in the initial design? Too bad, you're saddled with the unending burden of maintaining redirects forever or you're not "cool". Should have known the company was going to move to Markdown static site generation five years before Markdown was invented.
Miss me with that shit. Link rot is the burden of the link author, not the target.
Supporting redirects can be simple, depending on your SSG (and it's possible to make extensions to most of them, so this could be something that responds to a posts frontmatter). It could just generare an html file with this
<meta http-equiv="refresh" …>
sent in the head, and some html/css to make it pretty. It's not ideal, but I assume search engines support it (dunno if there's any additional SEO improvements).
> the unending burden of maintaining redirects forever
Right because keeping a list of source->destination and configuring your current server based on that is such a burden...
> Miss me with that shit. Link rot is the burden of the link author, not the target.
The link author isn't the one making the changes, the target is. The link author might not even be alive anymore. Expecting others to untangle your mess is ... not cool.
> Should have known the company was going to move to Markdown static site generation five years before Markdown was invented.
Okay, but you did know, right? Maybe not that the new thing would be called Markdown or exactly when but that there would be a new thing. The W3C sure knew and told you. That's why they wrote e.g. this paragraph:
> Software mechanisms. Look for "cgi", "exec" and other give-away "look what software we are using" bits in URIs. Anyone want to commit to using perl cgi scripts all their lives? Nope? Cut out the .pl. Read the server manual on how to do it.
Nope, unless you're bankrupt you're supposed to host forever:
> Pretty much the only good reason for a document to disappear from the Web is that the company which owned the domain name went out of business or can no longer afford to keep the server running.
If you do not mind the self-promotion, I am building a links checker service that also monitors all your website's links, so if you forget to set up a redirect after moving and renaming some pages, you get a notification.
Mind you, this feature is still under development, but this is the ultimate goal of my app.
It is currently in free beta if you are interested in giving it a go: https://bernard.app
> No, just because I hosted something for awhile does not mean I am obligated to host that resource in the exact same way for eternity. There is no contract, implicit, social, or otherwise that I will continue to provide that free thing for you in a way that is convenient to you personally in perpetuity.
That moniker, at least for me, is reserved to links that look something like this: `/{endpoint}/{long_hash}?__gtr[0]&__jd__[df]=%ezaz54%d/{another_very_long_hash}[c__f]/`
Now, that's an ugly URL!