SEO-friendly Pagination with getPage

☆ A M B ☆
318 Posts

freejung Reply #1, 13 years, 5 months ago

Update 09/20/11:

Google has now introduced the rel next and rel prev convention for pagination. http://www.google.com/support/webmasters/bin/answer.py?answer=1663744

Given this, I would now recommend that you do use the technique described here, along with the next and prev links as specified by Google to achieve optimal SEO pagination.

Here is a bit of a hack I set up to do the next and prev links. I'm sure there's a better way to do it, but this works.

The idea is to check the page.nav placeholder for the anchor text of the next and previous links, and set placeholders for the appropriate next and prev link elements accordingly. This assumes you're using the SEO-friendly pagination technique outlined below. If not, you'll have to modify this code to point the next and prev links to the URLs of the appropriately paginated pages. In this code, my next and prev nav links have the anchor text » and « respectively. You should modify the code by replacing those with the actual anchor text of your next and previous nav links. Also, replace "html" with whatever you are using for a suffix.

Otherwise, just put this code in a snippet that you call right after the getPage call (which should be at the very beginning of the template, and you should set getPage to output to a placeholder). Make sure you put a property set on this snippet that contains the property pageVarKey, which should be the same as the pageVarKey property you are using for getPage. The way I did this was to just add this code to the paginationSEO snippet described below.

$properties =& $scriptProperties;
$properties['page'] = (isset($_REQUEST[$properties['pageVarKey']]) && ($page = intval($_REQUEST[$properties['pageVarKey']]))) ? $page : 1;

$st_nav = $modx->getPlaceholder('page.nav');
if ($st_nav != '') {
    $pos1 = strpos($st_nav,'«');
    if($pos1 === false) {
         $st_prevLink = '';
    }
    else {
         $prevPageNo = $properties['page']-1;
         $st_prev_alias = ($properties['page']>2) ? $modx->resource->get('alias') . '-' . $prevPageNo : $modx->resource->get('alias');
         $st_prevLink = '<link rel="prev" href="/' . $st_prev_alias . '.html" />';
    }
    $pos2 = strpos($st_nav,'»');
    if($pos2 === false) {
         $st_nextLink = '';
    }
    else {
         $nextPageNo = $properties['page']+1;
         $st_next_alias = $modx->resource->get('alias') . '-' . $nextPageNo;
         $st_nextLink = '<link rel="next" href="/' . $st_next_alias . '.html" />';
    }
}else{
$st_prevLink = '';
$st_nextLink = '';
}

$modx->setPlaceholder('paginated-prev-link', $st_prevLink);
$modx->setPlaceholder('paginated-next-link', $st_nextLink);

Update 06/21/11:

After extensive analysis of the Google Panda update, my recommendation for SEO purposes is in most cases you should not use the technique described here. Post-Panda, you may not want Google to index your paginated pages as separate URLs, as this may be seen as "shallow" content. If your paginated pages contain lots of unique, valuable content you might still use this technique, but in that case you might simply be better off putting the unique content on another URL altogether.

My current recommendation in most cases would be to leave GetPage completely alone and let search engines decide how to handle your paginated pages.

Disclaimer: take this advice with a grain of salt and use at your own risk, nobody really knows how Panda works yet so it's all guesswork at this point. Here's a great article about it on Search Engine Land:

http://searchengineland.com/why-google-panda-is-more-a-ranking-factor-than-algorithm-update-82564

/Update

GetPage rocks! It's a very powerful and easy way to do pagination. However, unfortunately, the default implementation is not completely ideal from an SEO perspective. I've set up more search-optimized pagination with getPage, and I wanted to share my method in case anyone finds it useful or wants to improve upon it.

Disclaimer: I am very experienced with SEO, so I'm confident that the SEO practices in my implementation are sound, but I'm a total n00b at MODx development, so I'm sure there's a better way to do this. This method seems to be working for me, but YMMV. Also, it involves modifying your .htaccess file, and I can't guarantee that my rewrite rule won't conflict with other rules you already have set up, so if you've customized your .htaccess, exercise caution and check for possible conflicts.

There are three main problems with pagination using getPage, from an SEO perspective:

1: non-SEO-friendly URLs

The URLs of getPage pages look like /widgets.html?page=2. For ideal SEO, each page should have a unique plain URL with no query string.

Solution:

First, we need a RewriteRule in .htaccess to rewrite a friendly URL to the page URLs. Add a rule like the following to your .htaccess, _above_ the rule for normal MODx friendly URLs (replace html with whatever extension you use for text/html if different):

RewriteRule ^(.*)-page-([0-9]+)\.html$ $1.html?page=$2 [L,QSA]

This is similar to the friendly URL rule for normal MODx friendly URLs. You can replace the word "page" in the first part with whatever you want, or, if you are certain that you will never use an alias ending in a number, you can just leave out the -page- part altogether. The variable "page" on the right side should be replaced with whatever you are using for the pageVarKey property in getPage, if different.

Now the page at /widgets.html?page=2 can be accessed at /widgets-page-2.html.

Now we have to modify the page navigation templates so that they link to the friendly URLs. To do this, make a new property set for getPage and replace

[[+href]]

with

[[*alias]]-page-[[+pageNo]].html

in all of the navigation template properties.

2: Duplicate page 1 content

For good SEO, each page should only be accessed through a single URL. If more than one URL accesses the same content, this creates a duplicate page and can dilute your link juice, as well as leading to the wrong URL version being indexed and ranked.

The setup above will link back to page 1 from the subsequent pages by linking to /widgets-page-1.html. That's a problem, because this will display content identical to /widgets.html. This is also a problem with the default implementation of getPage, and it would be ideal if getPage were modified so that the subsequent pages link back to page 1 with just the plain alias. However, I don't want to modify getPage, because I want my setup to be compatible with subsequent versions.

Solution:

The only solution I can think of at the moment is to 301 redirect widgets-page-1.html to widgets.html. This is not an ideal solution, because a small amount of link juice is lost when you link through a 301, but it's much better than linking to duplicate pages. Simply add another line to .htaccess above the other rule like this:

RewriteRule ^(.*)-page-1\.html$ $1.html [R=301,L]

Update 09/20/11: It's occurred to me that you could also accomplish this by writing a snippet to pull the page.nav placeholder and remove the "page-1" from it. Maybe I'll give that a try and post the code if it works.

3: Duplicate page titles, descriptions, and canonical links

For ideal SEO, each page of the site should have a unique title, meta description, and canonical link. However, all pages created with getPage will have the same head section, thus creating duplication. I wrote a simple snippet to solve this problem by using the page number from the $REQUEST to set placeholders for the title, description, and canonical link, using template variables to provide an array of titles and descriptions.

First you should set up template variables called "pagination-titles" and "pagination-descriptions" to hold the titles and descriptions. The values of these should be a list of titles and descriptions you want to use for your pages, separated by double-commas, like this: "page 1 title,,page 2 title,,page 3 title,,page 4 title" etc.

If you have a lot of pages, you probably want to set up some sort of sensible default values based on the pagetitle or some other variable. I use a TV called "primary-keyword" to store the primary keyword for each page, and I base my default titles and descriptions off of that.

Snippet: "pagniationSEO"

It has only two properties: called "pageVarKey" which should be set to the same as your pageVarKey. Default should be set to "page" which is the default for pageVarKey; and "page" which should be set to 0 by default as in getPage.


$properties =& $scriptProperties;
$properties['page'] = (isset($_REQUEST[$properties['pageVarKey']]) && ($page = intval($_REQUEST[$properties['pageVarKey']]))) ? $page : 1;

$st_titles = ',,' . $modx->resource->getTVValue('pagination-titles');//add an empty value for the 0th element
$ar_titles = explode(',,', $st_titles);
$st_paginated_title = (isset($ar_titles[$properties['page']])) ? $ar_titles[$properties['page']] : $ar_titles[1];

$st_descriptions = ',,' . $modx->resource->getTVValue('pagination-descriptions');
$ar_descriptions = explode(',,', $st_descriptions);
$st_paginated_description = (isset($ar_descriptions[$properties['page']])) ? $ar_descriptions[$properties['page']] : $ar_descriptions[1];

$st_paginated_alias = ($properties['page']!=1) ? $modx->resource->get('alias') . '-' . $properties['page'] : $modx->resource->get('alias');

$modx->setPlaceholder('paginated-title', $st_paginated_title);
$modx->setPlaceholder('paginated-description', $st_paginated_description);
$modx->setPlaceholder('paginated-alias', $st_paginated_alias);

This creates placeholders [[+paginated-title]], [[+paginated-description]] and [[+paginated-alias]] that can be used to construct the appropriate head tags. If the number of pages exceeds the provided number of titles/descriptions it just re-uses the values for the first page.

Replace the title, description and canonical link in the head section of your template with this:

<meta name="description" content="[[+paginated-description]]"/>
<title>[[+paginated-title]]</title>
<link rel="canonical" href="[[++site_url]][[+paginated-alias]].html" />

This framework could be extended to provide other unique content for each page, such as headers and so forth, as needed. [ed. note: esnyder last edited this post 12 years, 7 months ago.]

☆ A M B ☆
318 Posts

freejung Reply #2, 13 years, 5 months ago

NOTE: I just realized that the code I originally posted will not work for more than 10 pages (I never have more than 10 pages, so it doesn’t matter to me). I edited the post above to fix that -- the regex in the first RewriteRule now matches a page number with any number of digits.

☆ A M B ☆
536 Posts

Last Of The Romans Reply #3, 13 years, 5 months ago

cool stuff, is that working for Evo?

palma non sine pulvere

183 Posts

mikrobi Reply #4, 13 years, 5 months ago

Great job. But do you really need to put that much effort into it?

I wouldn’t treat every widget.html?page=xy as a single page which should be stored in the index of a search engine. Often it makes no sense to provide different meta information because the main content of every widget.html?page=xy remains the same: For instance, I’m using getPage for pagination of users comments. The main content remains always the same. Only a few comments below the content change when you switch to another page.

For me it makes more sense to add a rel="nofollow" or rel="noindex" to the pagination anchor tags, so that a search engine bot will ignore the pagination tags!?

Add-On to easily manage your multilingual sites: Babel

☆ A M B ☆
318 Posts

freejung Reply #5, 13 years, 5 months ago

lastoftheromans, no, isn't getPage for Revo only? Anyway, this method is for Revo, I'm not sure how you would go about doing pagination in Evo, I haven't tried. [ed. note: esnyder last edited this post 12 years, 7 months ago.]

☆ A M B ☆
318 Posts

freejung Reply #6, 13 years, 5 months ago

mikrobi, you're quite right, in many cases it might not make sense to have your subsequent pages indexed as separate "pages" by a search engine. In that case you might not need to do anything at all. SEs will probably handle your pages just fine as-is.

Personally I would not put a nofollow tag on the pagination links, because that would waste link juice. It would still count as a link for purposes of counting the number of links on the page, but it would not pass juice to the subsequent pages, resulting in an overall loss of juice.

Instead, I would simply put a canonical link at the top of the page that points to the main URL of the page. That would provide a strong hint to search engines to treat the whole thing as one "page" rather than as a collection of pages. I have a page on my site where the page is called with different query strings to produce very slightly different content, and I put a canonical link on it, and in Google it's just indexed as a single page.

You could do this by simply using the same type of canonical link tag that you would use for any normal page in MODx (most people should use this code in all of their templates anyway, regardless of whether they do any pagination), which should look like this:

<link rel="canonical" href="[[++site_url]][[*id:isnot=`[[++site_start]]`:then=`[[~[[*id]]]]`]]" />

Update 09/20/11 Google does not recommend using the canonical link element to point to the first page. It apparently works, but Google doesn't want you to do it. Instead, they recommend using rel next and rel prev links as described here: http://www.google.com/support/webmasters/bin/answer.py?answer=1663744

However, in many cases the subsequent pages might be worth indexing. Mine is such a case -- each subsequent page contains a unique list of sub-pages, and I want the paginated pages and the subpages all to be indexed as individual "pages."

If you do have unique content on each page, such as if you are listing articles or in my case pictures, if you don't get the subsequent pages indexed, you may be missing out on chances to rank for variations of your keywords.

So yeah, if you don't care about indexing the subsequent pages, don't worry about it. If you do want them indexed, this method will probably accomplish that well. [ed. note: esnyder last edited this post 12 years, 7 months ago.]

☆ A M B ☆
318 Posts

freejung Reply #7, 13 years, 5 months ago

Another note: if you are already using getPage to do pagination and you want to switch to using this method, you will probably want to figure out a way to 301 redirect your existing paginated pages to the new friendly URLs. Check to see if the paginated pages are being indexed in Google by doing a search like site:example.com and look to see if they are listed as separate pages. If they are, you should redirect them.

This is actually pretty tricky, you can’t just redirect the old pages straight to the new ones because it would create an infinite loop.

I think it should be possible to do it by chosing a new value for your pageVarKey and then redirecting the old paginated URLs to the new friendly ones with a 301 redirect. However, I’m struggling with the exact code to do this redirect in .htaccess as it involves redirecting a URL with a query string to one without a query string. If you’re not sure how to do this, hold off on implementing this method and I’ll post code to do it as soon as I can.

☆ A M B ☆
318 Posts

freejung Reply #8, 12 years, 10 months ago

I’ve updated the original post to reflect the massive changes to SEO practice caused by the recent Google Panda update. The basic upshot is that I no longer recommend using the technique described in the original post in most cases, unless you have exceptionally unique and valuable content on your paginated pages.

☆ A M B ☆
318 Posts

freejung Reply #9, 12 years, 7 months ago

I've updated the original post again with recommendations and code taking into account Google's current stance on pagination. In the current environment, I would recommend using this method to do pagination, along with the rel next and rel prev links as described by Google. I posted code above to do this.

834 Posts

MarkG Reply #10, 12 years, 1 month ago

I was having an issue with multiple category pages

/library?start=10
/library?start=20
/library?start=30
/library?start=40

So I'm trying a robots.txt exclusion to see if that works.

Disallow: /library?start=*

Thoughts?

Content Creator and Copywriter