• [PLUGIN] SEO Strict URLs#

  • JeremyL Reply #1, 5 years, 8 months ago

    Reply
    SEO Strict URLs Enforces the use of strict urls to prevent duplicate content.

    Official Repository Release: http://www.modxcms.com/SEO-Strict-URLs-1.0-1337.html New support thread: http://modxcms.com/forums/index.php/topic,12452.0.html
    ApoXX, has taken this plugin to a new level and now it seems stable enough for an official release. From my understanding it will become part of the core in a soon to be future release.

    Note that the code only works on 0.9.5+

    If you find any bugs please let us know.

    Current Features:
    # 301 Redirect from /index.php?id=8 to /alias.html
    # 301 Redirect from /page, /page/ to /page.html
    # 301 Redirect from non domain.com url to www.domain.com url (requires .htaccess edit)
    # If you switch Friendly URLs off, it redirects /page.html and /page to /index.php?id=48 - this enforces the options that have been selected.
    # 301 Redirect /{site_start}, /{site_start}.html to root folder / (no page in url)
    # Menu links pointing to "www.domain.com/{start_page}" are changed to "www.domain.com" (Can turn off)
    # If a page is marked as a folder in the modx system, it shows as /dir/ and not /dir.html (can turn off)
    # 301 Redirect /folder to /folder/ to mimic apache, only happens when file is marked as folder (can be turned off)
    # Individual URL and link rewriting. For example, you can set rss feed to default to feed.rss, not feed.rss.html. Also, menu links can link directly to the folder name rather than having to go through an extra level of redirects.

    ToDo
    # Any suggestions?

    Change Log
    Version 1.0 (summary of all versions up to 1.0)
    --------------
    # FIXED: All known bugs as of 02/24/2007
    # ADDED: Added ability to enforce alternate url extensions such as .cc or .xml

    Version 0.7
    --------------
    # ADDED: If a page is marked as a folder in the modx system, it shows as /dir/ and not /dir.html (can turn off)
    # ADDED: 301 Redirect /folder to /folder/ to mimic apache, only happens when file is marked as folder (can be turned off)

    Version 0.6
    -------------
    # FIXED BUG: When installed in modx is subfolder calling /index.html does infinite redirect
    # FIXED BUG: When installed in modx is subfolder any redirect adds an extra folder to a redirect url


    Install:


    Plugin name: SEO Strict URLs
    Description: <strong>1.0.0</strong> Enforces the use of strict urls to prevent dup content

    Plugin configuration:
    &editDocLinks=Edit document links;int;1 &makeFolders=Rewrite containers as folders;int;1 &override=Enable manual overrides;int;0 &overrideTV=Override TV name;string;seoOverride


    On Install: Check the OnWebPageInit & OnWebPagePrerender boxes in the System Events tab.

    For overriding documents, create a new template variabe (TV) named seoOverride (or whatever was defined in the configuration) with the following options:

    Input Type: DropDown List Menu
    Input Option Values: Disabled==-1||Base0||Append1||Folder==2
    Default Value: -1
    NOTE: You must set "Enable manual overrides" in plugin configuration to 1 to enable this TV

    // Strict URLs
    // version 1.0
    // Enforces the use of strict URLs to prevent duplicate content.
    // By Jeremy Luebke @ www.xuru.com
    // Contributions by Brian Stanback @ www.stanback.net
    
    // On Install: Check the "OnWebPageInit" & "OnWebPagePrerender" boxes in the System Events tab.
    // Plugin configuration: &editDocLinks=Edit document links;int;1 &makeFolders=Rewrite containers as folders;int;1 &override=Enable manual overrides;int;0 &overrideTV=Override TV name;string;seoOverride
    
    // For overriding documents, create a new template variabe (TV) named seoOverride with the following options:
    //    Input Type: DropDown List Menu
    //    Input Option Values: Disabled==-1||Base Name==0||Append Extension==1||Folder==2
    //    Default Value: -1
    // NOTE: You must set "Enable manual overrides" in plugin configuration to 1 to enable this TV
    
    //  # Include the following in your .htaccess file
    //  # Replace "example.com" &  "example\.com" with your domain info
    //  RewriteCond %{HTTP_HOST} .
    //  RewriteCond %{HTTP_HOST} !^www\.example\.com [NC]
    //  RewriteRule (.*) http://www.example.com/$1
     [R=301,L] 
    
    // Begin plugin code
    $e = &$modx->event;
    
    if ($e->name == 'OnWebPageInit') 
    {
       $myProtocol = ($_SERVER['HTTPS'] == 'on') ? 'https' : 'http';
       $s = $_SERVER['REQUEST_URI'];
       $parts = explode("?", $s);  
    
       $documentIdentifier = ($modx->documentIdentifier) ? $modx->documentIdentifier : $modx->config['error_page'];  // Set error page document ID if page is not found
       $alias = $modx->aliasListing[$documentIdentifier]['alias'];
       if ($makeFolders) $isfolder = (count($modx->getChildIds($documentIdentifier, 1)) > 0) ? 1 : 0;
    
       if ($override && $overrideOption = $modx->getTemplateVarOutput($overrideTV, $documentIdentifier))
       {
          switch ($overrideOption[$overrideTV])
          {
             case 0:
                $isoverride = 1;
                break;
             case 1:
                $isfolder = 0;
                break;
             case 2:
                $makeFolders = 1;
                $isfolder = 1;
          }
       }
    
       if ($isoverride)
       {
          $strictURL = preg_replace('/[^\/]+$/', $alias, $modx->makeUrl($documentIdentifier));
       }
       elseif ($isfolder && $makeFolders)
       {
          $strictURL = preg_replace('/[^\/]+$/', $alias, $modx->makeUrl($documentIdentifier)) . "/";
       }
       else
       {
          $strictURL = $modx->makeUrl($documentIdentifier);
       }
    
       $myDomain = $myProtocol . "://" . $_SERVER['HTTP_HOST'];
       $newURL = $myDomain . $strictURL;
       $requestedURL = $myDomain . $parts[0];
    
       if ($documentIdentifier == $modx->config['site_start'])
       {
          if ($requestedURL != $modx->config['site_url'])
          {
             // Force redirect of site start
             header("HTTP/1.1 301 Moved Permanently");
             $qstring = preg_replace("#(^|&)(q|id)=[^&]+#", '', $parts[1]);  // Strip conflicting id/q from query string
             if ($qstring) header('Location: ' . $modx->config['site_url'] . '?' . $qstring);
             else header('Location: ' . $modx->config['site_url']);
             exit;
          }
       }
       elseif ($parts[0] != $strictURL)
       {
          // Force page redirect
          header("HTTP/1.1 301 Moved Permanently");
          $qstring = preg_replace("#(^|&)(q|id)=[^&]+#", '', $parts[1]);  // Strip conflicting id/q from query string
          if ($qstring) header('Location: ' . $strictURL . '?' . $qstring);
          else header('Location: ' . $strictURL);
          exit;
       }
    }
    elseif ($e->name == 'OnWebPagePrerender')
    {
       if ($editDocLinks)
       {
          $myDomain = $_SERVER['HTTP_HOST'];
          $furlSuffix = $modx->config['friendly_url_suffix'];
          $baseUrl = $modx->config['base_url'];
          $o = &$modx->documentOutput; // get a reference of the output
    
          // Reduce site start to base url
          $overrideAlias = $modx->aliasListing[$modx->config['site_start']]['alias'];
          $overridePath = $modx->aliasListing[$modx->config['site_start']]['path'];
          $o = preg_replace("#((href|action)=\"|$myDomain)($baseUrl)?($overridePath/)?$overrideAlias$furlSuffix#", '${1}' . $baseUrl, $o);
    
          if ($override)
          {
             // Replace manual override links
             $sql = "SELECT tvc.contentid as id, tvc.value as value FROM " . $modx->getFullTableName('site_tmplvars') . " tv ";
             $sql .= "INNER JOIN " . $modx->getFullTableName('site_tmplvar_templates') . " tvtpl ON tvtpl.tmplvarid = tv.id ";
             $sql .= "LEFT JOIN " . $modx->getFullTableName('site_tmplvar_contentvalues') . " tvc ON tvc.tmplvarid = tv.id ";
             $sql .= "LEFT JOIN " . $modx->getFullTableName('site_content') . " sc ON sc.id = tvc.contentid ";
             $sql .= "WHERE sc.published = 1 AND tvtpl.templateid = sc.template AND tv.name = '$overrideTV'";
             $results = $modx->dbQuery($sql);
             while ($row = $modx->fetchRow($results))
             {
                $overrideAlias = $modx->aliasListing[$row['id']]['alias'];
                $overridePath = $modx->aliasListing[$row['id']]['path'];
                switch ($row['value'])
                {
                   case 0:
                      $o = preg_replace("#((href|action)=\"($baseUrl)?($overridePath)?|$myDomain$baseUrl$overridePath/?)$overrideAlias$furlSuffix#", '${1}' . $overrideAlias, $o);
                      break;
                   case 2:
                      $o = preg_replace("#((href|action)=\"($baseUrl)?($overridePath)?|$myDomain$baseUrl$overridePath/?)$overrideAlias$furlSuffix/?#", '${1}' . rtrim($overrideAlias, '/') . '/', $o);
                      break;
                }
             }
          }
    
          if ($makeFolders)
          {
             // Replace container links
             foreach ($modx->documentListing as $id)
             {
                if (count($modx->getChildIds($id, 1)))
                {
                      $overrideAlias = $modx->aliasListing[$id]['alias'];
                      $overridePath = $modx->aliasListing[$id]['path'];
                      $o = preg_replace("#((href|action)=\"($baseUrl)?($overridePath)?|$myDomain$baseUrl$overridePath/?)$overrideAlias$furlSuffix/?#", '${1}' . rtrim($overrideAlias, '/') . '/', $o);
                }
             }
          }
       }
    }
    


  • Boby Reply #2, 5 years, 8 months ago

    Reply
    Looks interesting, but I think this should be optional:
    If a page is marked as a folder in the modx system, it should be only seen as /dir/ and not /dir.html (can turn off)

    Think about documents with comments, they need to be marked as directory because comments are stored in as child document.

    All the rest seems pretty cool


  • PaulGregory Reply #3, 5 years, 8 months ago

    Reply
    @Boby: That's probably covered by the "(can turn off)" bit.

    @JeremyL: Please remind people that they need to check the "OnWebPageInit" box in the System Events tab.

    I haven't changed the .htaccess - as far as I can comprehend, this is only needed for the "Redirect from non domain.com url to www.domain.com url" bit, which I don't need.

    I can confirm the main features work as advertised, and I have identified 2 additional features that could be considered plus points.

    1) If you switch Friendly URLs off, it redirects /page.html and /page to /index.php?id=48 - so essentially, this enforces the options that have been selected.
    2) Visiting www.domain.com redirects to www.domain.com/index.html - a good and useful anti-duplication feature.

    And there are 2 negative results - although to be fair these are both predictable from the above post information.

    3) I have a page with the alias feed.xml that I want to refer to as feed.xml rather than feed.xml.html - obviously this plugin redirects. However, I imagine I could either hack the plugin to truncate the $strictURL before any second ".", to solve that, or (possibly safer) replace .xml.html with .xml and .css.html with .css - if this is something other people will want to do, it would be worth including that in the main plugin.

    4) This broke my form. I assume this is due to the way that anything extra in the URL is considered to be "wrong", and is the same as the BUG in the todo list. This is therefore a fairly major oversight and really needs resolving ASAP. I think you need to do two things:
    i) collect all the &x=y values, miss out and add them back onto the $newURL.
    ii) strip the &x=y bits from $s before running the comparison.

    But a good start, and neither obstacle unsurmountable.

    EDIT: Have reread the bug text "BUG: Make blog.html?start=2 look like blog2.html. Currently blog.html?start=2 does not work." I disagree. Getting blog2.html to load the blog page and pass it is a massively complicated thing to do and is a feature not a bug fix. Getting blog.html?start=2 to work is the priority, as per 4) above.

    EDIT 2: To clarify, this works fine with POST requests, but obviously GET requests are killed.


  • JeremyL Reply #4, 5 years, 8 months ago

    Reply
    @PaulGregory - What version are you testing on?


    Think about documents with comments, they need to be marked as directory because comments are stored in as child document.

    Yea, that's the reason I was going to add the can turn off bit because I knew some people wouldn't want this. But you have to remember, many people will try and link to the directory a page is in. So if the blog pages are in /blog/, they will link there just by habit. It will 301 redirect to /blog.html, but I would personally rather have a blog post look like an index page in a folder then to have people linking to the wrong place.

    need to check the "OnWebPageInit" box in the System Events tab.

    Done

    1) If you switch Friendly URLs off, it redirects /page.html and /page to /index.php?id=48 - so essentially, this enforces the options that have been selected.
    2) Visiting www.domain.com redirects to www.domain.com/index.html - a good and useful anti-duplication feature.

    1) Great idea. I'll add it to the list. Not sure if a 404 will be thrown up first and prevent the feature though. Will take some testing.
    2) I was actually doing it the opposite way "# 301 Redirect /index, /index.html to root folder / (no page in url)". SE's see www.comain.com/ as the root of the whole website. They see /index.html as a different document. It would be a bad idea to 301 redirect the root of the website to a document in my opinion. Instead /index.html will redirect back to the root www.comain.com/.

    I'll try and come up with some good solutions for #3 & #4 when I get home this afternoon from work.


  • PaulGregory Reply #5, 5 years, 8 months ago

    Reply
    The one I tested it on happened to be 0.9.5 rev 1392.

    But you misunderstand. 1 & 2 are not feature requests, they are features that I have discovered already work whilst testing.

    So you can take out "If you switch Friendly URLs off, it redirects /page.html and /page to /index.php?id=48 - this enforces the options that have been selected." from the ToDo list and move it to the features list.

    Quote from: JeremyL at Sep 22, 2006, 09:00 AM
    2) I was actually doing it the opposite way "# 301 Redirect /index, /index.html to root folder / (no page in url)". SE's see www.comain.com/ as the root of the whole website. They see /index.html as a different document. It would be a bad idea to 301 redirect the root of the website to a document in my opinion. Instead /index.html will redirect back to the root www.comain.com/.

    Ah, I see that on your to-do now. But whatever your future intention, the current situation is that it goes domain.com > domain.com/index.html.

    To make the site_home page go the other way, you'd probably have to build the target strict URL of the site_home, and change the headers accordingly. You have to be careful here as it's easy to end up with an infinite loop; indeed in testing a possible solution my browser detected the infinite loop and no longer follows 301 redirects for my test domain. I'm hoping a reboot will solve that but in the meantime have reverted the change I made.

    --

    I got 3 to work fine for me just adding two simple replaces after the target URL is built:
       $strictURL = $modx->makeUrl($modx->documentIdentifier);
       $strictURL = str_replace(".css.html",".css",$strictURL);
       $strictURL = str_replace(".xml.html",".xml",$strictURL);
    

    It's not pretty, but on further consideration it's probably faster than any generically useful way. So it is probably worth including this hack as a tip in the documentation, but not inculding it in the actual plugin.

    Incidentally, it's probably worth pointing out that I hardwire references to feed.xml and stylesheet.css where I use them because [~11~] adds the .html (which I don't want)!

    Finally,
    # 301 Redirect /folder to /folder/ to mimic apache
    That too should be optional, I'd prefer /news to /news/



  • JeremyL Reply #6, 5 years, 8 months ago

    Reply
    But you misunderstand. 1 & 2 are not feature requests, they are features that I have discovered already work whilst testing.

    Thats what I get for being in a hurry this morning and skimming the post
    Incidentally, it's probably worth pointing out that I hardwire references to feed.xml and stylesheet.css where I use them because [~11~] adds the .html (which I don't want)!

    Yea, that's what I need to think about. I actually consider this a bug in the FURL logic. Maybe a bug report on 0.9.5 might get it fixed before launch. I need to study what non html pages are being rewritten and what extensions are being forced on them.

    To make the site_home page go the other way, you'd probably have to build the target strict URL of the site_home, and change the headers accordingly. You have to be careful here as it's easy to end up with an infinite loop; indeed in testing a possible solution my browser detected the infinite loop and no longer follows 301 redirects for my test domain. I'm hoping a reboot will solve that but in the meantime have reverted the change I made.

    I was actually thinking of a simple conditional statement to check and see if the URI was www.domain.com/index* and doing the redirect and if it was domain.com/ then the main plugin code would be skipped.

    *Deleted Unfinished SentenceNot sure how that happened and can't remember what I was saying before. Oh well*


  • PaulGregory Reply #7, 5 years, 8 months ago

    Reply
    I assume the rest of your post will come along in a moment.

    MODx's FURL system - it just needs an option of "Don't add default extension to aliases with a . in them". I might see if I can suggest an update to the code before 0.9.5 hits. It has to be an option because previous versions have promoted the use of . in aliases (one page is called script.aculo.us.html, for example). Actually, I'd vote in favour of incorporating your plugin features into MODx. They really help MODx live up to the claim of SEO CMS.

    Quote from: JeremyL at Sep 22, 2006, 10:53 AM
    I was actually thinking of a simple conditional statement to check and see if the URI was www.domain.com/index* and doing the redirect and if it was domain.com/ then the main plugin code would be skipped. I think that
    Well, you can't literally check for index.* because that might not be the name of the home page. That's why I tried building a comparison string based on $modx->makeUrl(1); - although again, the home page might not be 1 - I just couldn't find the proper variable for site_home. I think this line will work, I just got something simple wrong and can't do anymore tests for a bit.


  • JeremyL Reply #8, 5 years, 8 months ago

    Reply
    Scratch this post, i found it.


  • opengeek Reply #9, 5 years, 8 months ago

    Reply
    The configuration setting is called site_start and is available programmatically as $modx->config['site_start'] or via templates as [(site_start)]


  • JeremyL Reply #10, 5 years, 8 months ago

    Reply
    I updated the code. It now redirects to www.domain.com/ and not /index.html.

    The internal links that ModX seems to be writing point to /index.html. I need to now add something to parse the doc and have the internal links pointing to the right url (domain.com).

    I think I'll need to brainstorm and study some code to figure out where these urls in the menus are written and where that can be rewritten. Any recommendations are welcome.