UTF-8 encoding doesn't work 100% anymore

46 Posts

computersolutions.cn Reply #31, 15 years, 3 months ago

I’ve commented out that line and its all ok - now my UTF8 characters work

Offending line: $alias = preg_replace(’/[^\.%A-Za-z0-9 _-]/’, ’’, $alias);

Next time, I’ll check the code first before pulling hair out!
Chinese is now working for me!

Eg - http://www.citrixtrainingseries.com/cn/Locations/北京.html

Don’t see how to login or register in the bug ticket section, so can one of you add this as a Bug.
Probably could go into Bug 874, as its more or less related.

171 Posts

etal Reply #32, 15 years, 3 months ago

I’d just like to confirm this bug, I tried using aliases in Hebrew a few days ago and had similar results.

24,544 Posts

BobRay Reply #33, 15 years, 3 months ago

Good catch. The login link is at the upper right. The "Signup" link is below the login button. It would help a lot if you could identify the file and line number. I think this bug also exists in Revolution. I couldn’t paste your alias in Revo successfully.

Did I help you? Buy me a beer
Get my Book: MODX:The Official Guide
MODX info for everyone: http://bobsguides.com/modx.html
My MODX Extras
Bob's Guides is now hosted at A2 MODX Hosting

46 Posts

computersolutions.cn Reply #34, 15 years, 3 months ago

@Bobray, sorry, not seeing that at all for the signup.

http://modxcms.com/bugs/

Just see login, no signup. My forum login isn’t working for that.

I’ve already said the filename in the above post, but I’ll repeat again -

Modx 0.963

Line 707* in manager/processors/save_content.processor.php
In stripalias()

Line - $alias = preg_replace(’/[^\.%A-Za-z0-9 _-]/’, ’’, $alias); // strip non-alphanumeric characters

*I may be 1 or 2 line numbers off, I added/removed debug code at various points in the code to find out where things b0rked, before I saw the obvious answer!

Regards,

Lawrence.

173 Posts

simonok Reply #35, 15 years, 2 months ago

I had issues with the utf-8 and upgrading modx, the problem as far as I could detect was me using phpmysql, never backup the database with this, do it yourself on the server. I have now stopped using phpmysql gui.

I made my first site with modx
------------------------
Shopping blog
Sky+ HD

MODX Staff
10,725 Posts

opengeek Reply #36, 15 years, 2 months ago

Sorry, somehow I missed the part about just the alias being affected. The strip alias function does this intentionally. What needs to happen is this behavior needs to be moved into a plugin and then custom transliteration can be performed there rather than in the core. The core should only enforce not using characters that are illegal in URLs. Some work has already been done on this but it needs to be finalized and applied; I’ll prioritize this work for inclusion in Evolution 1.0 and Revolution 2.0 releases.

Jason Coward
Chief Architect @ MODX
http://www.jasoncoward.com | http://twitter.com/drumshaman | https://github.com/opengeek

24,544 Posts

BobRay Reply #37, 15 years, 2 months ago

Quote from: computersolutions.cn at Feb 15, 2009, 08:19 AM

@Bobray, sorry, not seeing that at all for the signup.

http://modxcms.com/bugs/

Ah . . . The "bugs & requests" link at the uppr right of this page is supposed to take you to:

http://svn.modxcms.com/jira/

Did I help you? Buy me a beer
Get my Book: MODX:The Official Guide
MODX info for everyone: http://bobsguides.com/modx.html
My MODX Extras
Bob's Guides is now hosted at A2 MODX Hosting

46 Posts

computersolutions.cn Reply #38, 15 years, 2 months ago

@OpenGeek -

Should we not be re-encoding UTF8 non Ascii into its url encoded state then, as opposed to filtering out completely?

Take a look here for some discussion on this.

http://www.gooli.org/blog/unicode-and-permalinks/

MODX Staff
10,725 Posts

opengeek Reply #39, 15 years, 2 months ago

I think you are confusing yourself a little, UTF-8 does not equal Unicode, but is one way of representing Unicode that is backwards compatible with ASCII. The filtering you are talking about is called transliteration, and depending on your language, how (or even if) it is performed needs to be different. Thus, IMO, we need cultural specific plugins to handle the transliteration, with a default that simply makes sure no illegal characters are used in the url (i.e. to use Unicode characters in the URL you would have to install a plugin to allow it, and one of those could simply be responsible for url_encoding rather than transliteration).

Jason Coward
Chief Architect @ MODX
http://www.jasoncoward.com | http://twitter.com/drumshaman | https://github.com/opengeek

46 Posts

computersolutions.cn Reply #40, 15 years, 2 months ago

@OpenGeek - as far as I read things, I think i’m correct (although the answer is wrong (different solution below) to remove the line, but I had the right reasons!)

Also, to clarify: the filtering I’m talking about is this filtering:
$alias = preg_replace(’/[^\.%A-Za-z0-9 _-]/’, ’’, $alias); // strip non-alphanumeric characters

Explanation, and my possible solution below:

[tt]RFC3986 defines the syntax for URLs (actually URIs, but that’s a moot point) explicitly and states which characters are allowed in a URL. This includes little more than English letters and numbers from the lower half of the ASCII chart.[/tt]

That leads me to believe legal URI’s can only use valid characters (which the line I commented out filters for).

To solve this, we would need to encode the characters percent encoding (again see the page I linked to).

Excerpted:

[tt] GET /wiki/%D7%A2%D7%9E%D7%95%D7%93_%D7%A8%D7%90%D7%A9%D7%99 HTTP/1.1
Host: he.wikipedia.org

In order to understand what this percent encoding means, you need to know a bit about Unicode. Basically, the Unicode URL is encoded in UTF8 and each byte of the UTF8-encoded string is encoded using percent encoding. The browser apparently recognized this specific encoding scheme (which isn’t documented anywhere I could fine) and displays nice internationalized URLs for the user.[/tt]

This is both a valid URI as far as the RFC’s, and passes the UTF8 through unmangled.

Further reference here also - http://en.wikipedia.org/wiki/Percent-encoding

Again, relevant bits excerpted:
[tt]
The generic URI syntax mandates that new URI schemes that provide for the representation of character data in a URI must, in effect, represent characters from the unreserved set without translation, and should convert all other characters to bytes according to UTF-8, and then percent-encode those values. This requirement was introduced in January 2005 with the publication of RFC 3986. URI schemes introduced before this date are not affected.[/tt]

Having read this, then we should simply be doing this.
Checking if the string is UTF-8
If it is, then converting the UTF8 text into % encoding, this should just be a matter of adding a

$alias=rawurlencode ($alias); to the previous line. Then the A-Z filter would be irrelevant, and B unecessary.

Thoughts?

The basic concepts and what I’ve explained are also available at the W3C site here -
http://www.w3.org/International/articles/idn-and-iri/