Voting

The Note You're Voting On

22 years ago
--- 1) About "reserved" characters in URLS:

Beware that RFC 1738 specifies that the characters "{", "}", "|", "\", "^", "~", "[", "]", and "`" are all considered unsafe and SHOULD be URL-encoded with a "%xx" triplet within *ALL* URLs.

However, some HTTP URLs seem to use the "~" character as a prefix for a user account for example:
http://www.any.host.domain/~user/subpath/page.html?query#fragment

This usage is acceptable, but the RFC specifies that "%7E" should be used instead of "~" in the path component. HTTP servers should accept "~" as being equivalent to "%7E", and according to the RFC, the "%7E" form should be the canonical one.

However, some HTTP servers are not fully complying to this RFC and consider "%7E" differently from "~" (i.e. they consider it as being part of a path component name, and search a directory name containing a "~" character, instead of mapping the "~user" path component to a user's directory. In that case, these non compliant HTTP server will not find the resource associated to that URL and may return a 404 error or other errors such as an access denied.

When using rawurlencode() on such HTTP URLs, it's best to consider this legacy usage, by using str_replace() on the result to convert back "/%7E" to "/~", so that the URLs will correctly map to the legacy use of the "~" character by these servers. On compliant HTTP servers, they will treat the "~" unsafe character equivalently with the "%7E" recommanded form, so they will automatically canonicalize the "~" character into "%7E".

--- 2) Encoding of hostnames in URLs

Finally, beware that host domain names parts in URLs *MUST NOT* be encoded with rawurlencode(), as the "[" and "]" are valid delimiters that *MUST* be used to reference an IPv6 address or other hostnames that don't fit to the restricted set of characters allowed in a host name (the "[" and "]" characters MUST be used if the hostname includes characters such as ":" which is typically used to specify an alternate non-default port number).

The encoding of host names uses another encoding, required to encode international domain names, with a base-64 encoding of Unicode characters and a "bq--" prefix. This encoding must be used only on individual subdomain parts (separated by "." characters). This encoding does not use any "%xx" triplets.

So NEVER use urlencode() or rawurlencode() on an unparsed URL, unless this full URL is part of a query parameter string!

--- 3) Encoding of username/passwords in URLs:

There is no standard to specify a password in a URL. In fact, there's a legacy usage of the ":" character to separate a username from a password, but it is strongly discouraged. The RFC does not attempt to specify a semantic to the authentication part of an URL (before the "@" character and the hostname part).

If you need to encode a password, always use rawurlencode() on username and passwords separately, and then insert the ":" character to separate both components. Don't use urlencode() (which could use a "+" to encode a space, and would not work because usernames and passwords consider "+" and spaces as being different!)
<< Back to user notes page