PHP 8.5.0 Alpha 1 available for testing

Voting

: three plus two?
(Example: nine)

The Note You're Voting On

Anonymous
22 years ago
rawurlencode() MUST not be used on unparsed URLs.

rawurlencode() should not be used on host and domain name parts (that may include international characters encoded in each domain part with a "q--" prefix followed by a special encoding of the international domain, currently in testbed).

rawurlencode() may be used on usernames and passwords separately (so that it won't encode the ':' and '@' separators).

rawurlencode() must not be used on paths (that may contain '/' separators): the ['path'] element of a parsed URL must first be exploded into individual "directory" names. A directory or filename that contains a space must not be encoded with urlencode() but with this rawurlencode(), so that it will appear as a '%20' hex sequence (not '+')

rawurlencode() must not be used to encode the ['query'] element of a parsed URL. Instead you must use the urlencode() function:

Typical queries often use the '&' separator between each parameter. This '&' separator however is just a convention, used in the www-url-encoded format for HTML forms using the default GET method. However, when references are done in a HTML page to an URL that contains static query parameters, these '&' separators should be encoded in the HTML code as '&' for HTML conformance. This is not part of the URL specification, but of the HTML encapsulation! Some browsers forget this, and send '&' with their HTTP GET query. You may wish to substitute '&' by '&' when parsing and validating URLs. This should be done BEFORE calling urlencode() on query parts.

The ['fragment'] part of a parsed URL (after the first '#' separator found in any URL) must not be encoded with this rawurlencode() function but instead by urlencode().

Validating a URL sent in a HTTP request is then more complicated than what you may think. This must be done only on parsed URLs (where the basic elements of an URL have been splitted), and then you must explode the path components, and check the presence of '&' sequences in the query or fragment parts.

The next thing to do is to check the URL scheme that you want to support (for example, only 'http', 'https', or 'ftp').

You may wich to check the ['port'] part to see if it's really a decimal integer between 1 and 65535.
You may wish to remove the default port number used by the URL schemes you want to support (for example the port '80' for 'http', the port '21' for 'ftp', the port '443' for 'https'), and restrict severely all port numbers below 1024, or some critical ports below 140 (this includes DNS and NetBios ports).

Then you may wish to control severely the ['host'] part (in fact a full host domain name or an IP address), by forbidding those host names that don't contain at least one dot, forbidding those that start with a dot, those that contain two consecutive dots, those that start or finish with a '-' dash, those that contain '.-' or '-.' (invalid in all domain names), those that contain two dashes in another position than the second and third character of a domain name part and not folled by at least one other character, forbid top level domain names that have only one non numeric character, or more than 6 characters (".museum" is, for now, the longest acceptable TLD), check that pseudo-TLD names that are pure integers are effectively between 0 and 255, in that case check that this is a valid IPv4 address by comparing it to long2ip(ip2long($host)), ...

This done, you must use the urlencode() function on all parts up to the exploded path elements, and rawurlencode() on the query and fragment parts, according to the specs, to recreate a complete and validated URL.

<< Back to user notes page

To Top