RFC 8089: The "file" URI Scheme

You might not know this, but I've been working on a thing. Well, finally and after many years' work it has been published as an RFC.

So, of course, I've thought of a bunch of things that I wish I'd added, or done differently.

A big one is that I wish I'd thought to split it into two files: the normative standards track spec that defines the scheme, and an informative document covering all the non-standard stuff in Appendix E—contentious things people do (and in many cases have done for decades) that could never be included in the main standard for political reasons but you probably need to be able to deal with if you want to interact on the open internet anyway.

I would totally use that as the title.

The reason for two files is that the core spec, being very stable, is probably not going to change much; but in contrast the informative bit, which documents the crazy stuff people do on the wacky internet, is liable to drift and warp and change over time. If we wanted to update the second part we'd have to re-release the entire document.

And now some politics: how do you justify pushing out a document that updates or obsoletes a standards track spec but doesn't actually change the spec? It's much easier to replace an informational memo.

I also wish I'd been able to find a way to better address Windows' quirks and UNC strings. Some of the non-normative appendix content used to be in the main spec, but somebody on the mailing list complained that I was giving too much attention to "Windoze" (presumably because 2017 will be the year of Linux on the desktop?) As a result, all the dumb quirks about dealing with drive letters and resolving relative references and ".." segments and all that, and how many slashes to put after "file:", were relegated to an appendix – and, I regret to say, in some cases completely forgotten about.

And so a lot of text that would have removed edge cases and resolved historical quirky behaviour—and made "file:" URIs really widely interoperable—is not actually standardised. I mean, it's written there, and sometimes I even tried to say "you probably really want to do this", but someone didn't like Windows so I couldn't make it really real.

I guess I could just write it in my blog. Yeah, that sounds cool. Here you go, an officially unofficial guide to using "file:" URIs by the guy who wrote the spec:

An Officially Unofficial Guide to Using "file:" URIs by the Guy Who Wrote the Spec

  • file:/foo/bar.baz and file:c:/foo/bar.baz are perfectly legitimate, unambiguous, and beautiful.
  • ... and file:/c:/foo/bar.baz is fine, too, if you prefer that aesthetic.
  • ... and file:///foo/bar.baz and file:///c:/foo/bar.baz have been working absolutely perfectly for decades, if you don't want to rock the boat.
  • file://c:/foo/bar.baz – and particularly file://c|/foo/bar.baz – are just... no. Don't do that. This isn't 1997. We have standards.
  • While we're there: don't use \. Ain't nobody got time for that.
  • file:////example.org/Qux/foo/bar.baz is obviously pointing to this file on an SMB share: \\example.org\Qux\foo\bar.baz
  • ... and file://///example.org/Qux/foo/bar.baz is acceptable, if a bit... y'know... slashy.
  • ... and if you don't speak SMB, no one is forcing you to implement it. Just recognise that that's what the link means.
  • If you're in Windows and you're in a HTML document at file:///d:/foo/bar/baz.htm and you see a reference like <img src="/foo/bar/pong.png"> you know it should resolve to file:///d:/foo/bar/pong.png – even if your CD is in C:\ somewhere.
  • ... and you know that <a href="/f:/oof/rab/zab.htm"> resolves to file:///f:/oof/rab/zab.htm
  • ... and anyone writing <a href="/a:foo/bar.baz"> or <link rel="/e:../bar.baz"> is not trying to interoperate – they're looking for exploits. Don't fall for it.
  • Anything you write between file:// and the next / is confused and broken and there'll always be someone who gets it wrong, so just don't write anything in there.
  • This reference <a href="/%E3%81%A1"> may mean many things to many people. (/ち in UTF-8, /πüí in CP-437, /TA~ in EBCDIC, etc.) Just avoid the whole mess – use an IRI.
  • ... and if you want a counter-example, this UTF-8 IRI: file:c:/re├žu.txt always means exactly that, even if it gets turned into 0043 003a 005c 0072 0065 00e7 0075 002e 0074 0078 0074 in NTFS's UTF-16 encoding, or 43 3a 5c 72 65 87 75 2e 74 78 74 in MS-DOS's CP-437.
  • This reference: <a href="~matty/.plan"> doesn't mean what it does in bash, and you know it doesn't.
  • ... same with $HOME and %SystemRoot% and all that sort of guff.

Abide by these guidelines and, while not necessarily adhering to the strictest interpretation of a Standards Track RFC, at the least you'll be a well-intentioned and interoperable member of the internet community.

Matthew Kerwin

CC BY-SA 4.0
development, ietf, rfc, web

Comments powered by Disqus