We launched new forums in March 2019—join us there. In a hurry for help with your website? Get Help Now!
    • 5699
    • 46 Posts
    computersolutions.cn Reply #31, 15 years, 3 months ago
    I’ve commented out that line and its all ok - now my UTF8 characters work

    Offending line: $alias = preg_replace(’/[^\.%A-Za-z0-9 _-]/’, ’’, $alias);

    Next time, I’ll check the code first before pulling hair out!
    Chinese is now working for me!

    Eg - http://www.citrixtrainingseries.com/cn/Locations/北京.html

    Don’t see how to login or register in the bug ticket section, so can one of you add this as a Bug.
    Probably could go into Bug 874, as its more or less related.
      • 9130
      • 171 Posts
      I’d just like to confirm this bug, I tried using aliases in Hebrew a few days ago and had similar results.
        • 3749
        • 24,544 Posts
        Good catch. The login link is at the upper right. The "Signup" link is below the login button. It would help a lot if you could identify the file and line number. I think this bug also exists in Revolution. I couldn’t paste your alias in Revo successfully.
          Did I help you? Buy me a beer
          Get my Book: MODX:The Official Guide
          MODX info for everyone: http://bobsguides.com/modx.html
          My MODX Extras
          Bob's Guides is now hosted at A2 MODX Hosting
          • 5699
          • 46 Posts
          computersolutions.cn Reply #34, 15 years, 3 months ago
          @Bobray, sorry, not seeing that at all for the signup.

          http://modxcms.com/bugs/

          Just see login, no signup. My forum login isn’t working for that.

          I’ve already said the filename in the above post, but I’ll repeat again -

          Modx 0.963

          Line 707* in manager/processors/save_content.processor.php
          In stripalias()

          Line - $alias = preg_replace(’/[^\.%A-Za-z0-9 _-]/’, ’’, $alias); // strip non-alphanumeric characters

          *I may be 1 or 2 line numbers off, I added/removed debug code at various points in the code to find out where things b0rked, before I saw the obvious answer!

          Regards,

          Lawrence.
            • 27305
            • 173 Posts
            I had issues with the utf-8 and upgrading modx, the problem as far as I could detect was me using phpmysql, never backup the database with this, do it yourself on the server. I have now stopped using phpmysql gui.
              I made my first site with modx
              ------------------------
              Shopping blog
              Sky+ HD
            • Sorry, somehow I missed the part about just the alias being affected. The strip alias function does this intentionally. What needs to happen is this behavior needs to be moved into a plugin and then custom transliteration can be performed there rather than in the core. The core should only enforce not using characters that are illegal in URLs. Some work has already been done on this but it needs to be finalized and applied; I’ll prioritize this work for inclusion in Evolution 1.0 and Revolution 2.0 releases.
                • 3749
                • 24,544 Posts
                Quote from: computersolutions.cn at Feb 15, 2009, 08:19 AM

                @Bobray, sorry, not seeing that at all for the signup.

                http://modxcms.com/bugs/


                Ah . . . The "bugs & requests" link at the uppr right of this page is supposed to take you to:

                http://svn.modxcms.com/jira/
                  Did I help you? Buy me a beer
                  Get my Book: MODX:The Official Guide
                  MODX info for everyone: http://bobsguides.com/modx.html
                  My MODX Extras
                  Bob's Guides is now hosted at A2 MODX Hosting
                  • 5699
                  • 46 Posts
                  computersolutions.cn Reply #38, 15 years, 2 months ago
                  @OpenGeek -

                  Should we not be re-encoding UTF8 non Ascii into its url encoded state then, as opposed to filtering out completely?

                  Take a look here for some discussion on this.

                  http://www.gooli.org/blog/unicode-and-permalinks/



                  • I think you are confusing yourself a little, UTF-8 does not equal Unicode, but is one way of representing Unicode that is backwards compatible with ASCII. The filtering you are talking about is called transliteration, and depending on your language, how (or even if) it is performed needs to be different. Thus, IMO, we need cultural specific plugins to handle the transliteration, with a default that simply makes sure no illegal characters are used in the url (i.e. to use Unicode characters in the URL you would have to install a plugin to allow it, and one of those could simply be responsible for url_encoding rather than transliteration).
                      • 5699
                      • 46 Posts
                      computersolutions.cn Reply #40, 15 years, 2 months ago
                      @OpenGeek - as far as I read things, I think i’m correct (although the answer is wrong (different solution below) to remove the line, but I had the right reasons!)

                      Also, to clarify: the filtering I’m talking about is this filtering:
                      $alias = preg_replace(’/[^\.%A-Za-z0-9 _-]/’, ’’, $alias); // strip non-alphanumeric characters

                      Explanation, and my possible solution below:

                      [tt]RFC3986 defines the syntax for URLs (actually URIs, but that’s a moot point) explicitly and states which characters are allowed in a URL. This includes little more than English letters and numbers from the lower half of the ASCII chart.[/tt]

                      That leads me to believe legal URI’s can only use valid characters (which the line I commented out filters for).

                      To solve this, we would need to encode the characters percent encoding (again see the page I linked to).

                      Excerpted:

                      [tt] GET /wiki/%D7%A2%D7%9E%D7%95%D7%93_%D7%A8%D7%90%D7%A9%D7%99 HTTP/1.1
                      Host: he.wikipedia.org

                      In order to understand what this percent encoding means, you need to know a bit about Unicode. Basically, the Unicode URL is encoded in UTF8 and each byte of the UTF8-encoded string is encoded using percent encoding. The browser apparently recognized this specific encoding scheme (which isn’t documented anywhere I could fine) and displays nice internationalized URLs for the user.[/tt]

                      This is both a valid URI as far as the RFC’s, and passes the UTF8 through unmangled.

                      Further reference here also - http://en.wikipedia.org/wiki/Percent-encoding

                      Again, relevant bits excerpted:
                      [tt]
                      The generic URI syntax mandates that new URI schemes that provide for the representation of character data in a URI must, in effect, represent characters from the unreserved set without translation, and should convert all other characters to bytes according to UTF-8, and then percent-encode those values. This requirement was introduced in January 2005 with the publication of RFC 3986. URI schemes introduced before this date are not affected.[/tt]

                      Having read this, then we should simply be doing this.
                      Checking if the string is UTF-8
                      If it is, then converting the UTF8 text into % encoding, this should just be a matter of adding a

                      $alias=rawurlencode ($alias); to the previous line.  Then the A-Z filter would be irrelevant, and B unecessary.

                      Thoughts?

                      The basic concepts and what I’ve explained are also available at the W3C site here -
                      http://www.w3.org/International/articles/idn-and-iri/