@OpenGeek - as far as I read things, I think i’m correct (although the answer is wrong (different solution below) to remove the line, but I had the right reasons!)
Also, to clarify: the filtering I’m talking about is this filtering:
$alias = preg_replace(’/[^\.%A-Za-z0-9 _-]/’, ’’, $alias); // strip non-alphanumeric characters
Explanation, and my possible solution below:
[tt]RFC3986 defines the syntax for URLs (actually URIs, but that’s a moot point) explicitly and states which characters are allowed in a URL. This includes little more than English letters and numbers from the lower half of the ASCII chart.[/tt]
That leads me to believe legal URI’s can only use valid characters (which the line I commented out filters for).
To solve this, we would need to encode the characters percent encoding (again see the page I linked to).
Excerpted:
[tt] GET /wiki/%D7%A2%D7%9E%D7%95%D7%93_%D7%A8%D7%90%D7%A9%D7%99 HTTP/1.1
Host: he.wikipedia.org
In order to understand what this percent encoding means, you need to know a bit about Unicode. Basically, the Unicode URL is encoded in UTF8 and each byte of the UTF8-encoded string is encoded using percent encoding. The browser apparently recognized this specific encoding scheme (which isn’t documented anywhere I could fine) and displays nice internationalized URLs for the user.[/tt]
This is both a valid URI as far as the RFC’s, and passes the UTF8 through unmangled.
Further reference here also -
http://en.wikipedia.org/wiki/Percent-encoding
Again, relevant bits excerpted:
[tt]
The generic URI syntax mandates that new URI schemes that provide for the representation of character data in a URI must, in effect, represent characters from the unreserved set without translation, and should convert all other characters to bytes according to UTF-8, and then percent-encode those values. This requirement was introduced in January 2005 with the publication of RFC 3986. URI schemes introduced before this date are not affected.[/tt]
Having read this, then we should simply be doing this.
Checking if the string is UTF-8
If it is, then converting the UTF8 text into % encoding, this should just be a matter of adding a
$alias=rawurlencode ($alias); to the previous line. Then the A-Z filter would be irrelevant, and B unecessary.
Thoughts?
The basic concepts and what I’ve explained are also available at the W3C site here -
http://www.w3.org/International/articles/idn-and-iri/