You might not know this, but I've been working on a thing. Well, finally and after many years' work it has been published as an RFC.
So, of course, I've thought of a bunch of things that I wish I'd added, or done differently.
A big one is that I wish I'd thought to split it into two files: the normative standards track spec that defines the scheme, and an informative document covering all the non-standard stuff in Appendix E—contentious things people do (and in many cases have done for decades) that could never be included in the main standard for political reasons but you probably need to be able to deal with if you want to interact on the open internet anyway.
I would totally use that as the title.
The reason for two files is that the core spec, being very stable, is probably not going to change much; but in contrast the informative bit, which documents the crazy stuff people do on the wacky internet, is liable to drift and warp and change over time. If we wanted to update the second part we'd have to re-release the entire document.
And now some politics: how do you justify pushing out a document that updates or obsoletes a standards track spec but doesn't actually change the spec? It's much easier to replace an informational memo.
I also wish I'd been able to find a way to better address Windows' quirks and UNC strings. Some of the non-normative appendix content used to be in the main spec, but somebody on the mailing list complained that I was giving too much attention to "Windoze" (presumably because 2017 will be the year of Linux on the desktop?) As a result, all the dumb quirks about dealing with drive letters and resolving relative references and ".." segments and all that, and how many slashes to put after "file:", were relegated to an appendix – and, I regret to say, in some cases completely forgotten about.
And so a lot of text that would have removed edge cases and resolved historical quirky behaviour—and made "file:" URIs really widely interoperable—is not actually standardised. I mean, it's written there, and sometimes I even tried to say "you probably really want to do this", but someone didn't like Windows so I couldn't make it really real.
I guess I could just write it in my blog. Yeah, that sounds cool. Here you go, an officially unofficial guide to using "file:" URIs by the guy who wrote the spec:
file:/foo/bar.baz
and file:c:/foo/bar.baz
are perfectly legitimate, unambiguous, and beautiful.
file:/c:/foo/bar.baz
is fine, too, if you
prefer that aesthetic.
file:///foo/bar.baz
and
file:///c:/foo/bar.baz
have been working absolutely
perfectly for decades, if you don't want to rock the boat.
file://c:/foo/bar.baz
– and particularly
file://c|/foo/bar.baz
– are just... no. Don't
do that. This isn't 1997. We have standards.
\
. Ain't nobody got time
for that.
file:////example.org/Qux/foo/bar.baz
is obviously
pointing to this file on an SMB share:
\\example.org\Qux\foo\bar.baz
file://///example.org/Qux/foo/bar.baz
is
acceptable, if a bit... y'know... slashy.
file:///d:/foo/bar/baz.htm
and you see a reference like
<img src="/foo/bar/pong.png">
you know it should
resolve to file:///d:/foo/bar/pong.png
– even if
your CD is in C:\ somewhere.
<a href="/f:/oof/rab/zab.htm">
resolves to file:///f:/oof/rab/zab.htm
<a href="/a:foo/bar.baz">
or <link rel="/e:../bar.baz">
is not trying to
interoperate – they're looking for exploits. Don't fall for
it.
file://
and the next
/
is confused and broken and there'll always be someone
who gets it wrong, so just don't write anything in there.
<a href="/%E3%81%A1">
may mean
many things to many people. (/ち
in UTF-8,
/πüí
in CP-437, /TA~
in EBCDIC, etc.) Just avoid the whole mess – use an IRI.
file:c:/reçu.txt
always means exactly that, even if it
gets turned into 0043 003a 005c 0072 0065 00e7 0075 002e 0074
0078 0074
in NTFS's UTF-16 encoding, or 43 3a 5c 72 65
87 75 2e 74 78 74
in MS-DOS's CP-437.
<a href="~matty/.plan">
doesn't
mean what it does in bash, and you know it doesn't.
$HOME
and %SystemRoot%
and
all that sort of guff.
Abide by these guidelines and, while not necessarily adhering to the strictest interpretation of a Standards Track RFC, at the least you'll be a well-intentioned and interoperable member of the internet community.