We launched new forums in March 2019—join us there. In a hurry for help with your website? Get Help Now!
    • 23183
    • 102 Posts
    Im just busy writing my robots.txt but not sure what pages I should disallow, is it worth disallowing page like reigistration success, logg in success, logout etc?

    Also I am using FURLS, should I use these in robots.txt or page IDs? I am trying to stop robots from crawling anything rating would the flowwing work:

    User-agent: *
    Disallow: /*?page=*&atrGrp=*

    or would it be

    User-agent: *
    Disallow: /*&atrGrp=*

    Edit: Does this look about right if im using FURLs?

    User-agent: *
    Disallow: /manager/
    Disallow: /thanks-for-logging-in
    Disallow: /reg-success
    Disallow: /logout
    Disallow: /tos
    Disallow: /profile
    Disallow: /*&atrGrp=*
    Disallow: /*/*&atrGrp=*

    should I add the login and register pages to the disallow list?
      • 10525
      • 247 Posts
      Strange why no-one ever answered this question in nearly 4 years... I thought it would have gathered more interest.

      Anyway, I need to know pretty much the same answer. Namely, what is the required/correct format for listing an individual page in the robots.txt file?

      Is it simply:
      Disallow: /index.php?id=xxx

      More importantly, is it possible for me to disallow all of the resources within a single parent resource? I have a resource container called 'sandbox'. In there I keep all my development/test/experimental resources. These may be published, but are always hidden from the menu. I just want to be able to hide from robots anything that I dump in this container.

      Is this possible?
        • 42046
        • 436 Posts
        Simply using Disallow: /sandbox/ will stop anything in your sandbox container from being indexed.
          • 10525
          • 247 Posts
          You sure Dan? Wouldn't that just disallow a physical directory of that name, as opposed to a generated one?

          Sottwell said this:

          http://forums.modx.com/thread/28404/revolution-robots#dis-post-404014

          but it doesn't touch on containers.
            • 42046
            • 436 Posts
            The friendly urls are the ones that are crawled and appear in Google or wherever. Play around with Google Webmaster Tools, robots.txt will exclude container contents. Doesn't matter if it's a real directory or not.

            A more bullet proof way of locking down a test area of a site though is just to password protect it with htaccess.
              • 10525
              • 247 Posts
              Cheers Dan.
                • 27106
                • 147 Posts
                Quote from: absent42 at Nov 02, 2013, 06:14 PM
                A more bullet proof way of locking down a test area of a site though is just to password protect it with htaccess.
                The .htaccess method looks good but will require you to log in to see these pages even with the manager open. I suspect it also means that /thanks-for-logging-in and /reg-success and others like them will become inaccessible. There is a conceptual difference between pages which you want unindexed and pages which you want unpublished; "Not published" should be a subset of "not indexed".

                So here's another idea:

                1. Create a directory for resources you want to be available to the system but not seen by Google, within site searches etc. Within this directory, create a sub-directory for pages you do not want published. (This structure pays off in other ways too. For instance, you can then use getResources to write a chunk listing the pages that you want SimpleSearch not to search through.)
                2. Create a new document called "robots" with content type "text". You now have a robots.txt file.
                3. Inside the document, put the code below:
                User-agent: *
                Disallow: /manager/
                Disallow: /[[~666]]
                Sitemap: http://mysite.com[[~11]]

                ... where 666 is the id of your "not indexed" directory (and 11 is the id of your sitemap, if you have one).

                You now have one directory that most search engines won't index, no matter what you call it or where you put it.

                Does this make sense? Does it have gotchas I haven't spotted?
                  David Walker
                  Principal, Shorewalker DMS
                  Phone: 03 8899 7790
                  Mobile: 0407 133 020
                  • 9995
                  • 1,613 Posts
                  nice one smiley
                    Evolution user, I like the back-end speed and simplicity smiley
                    • 27106
                    • 147 Posts
                    Quote from: PH!L at Nov 14, 2009, 12:17 PM
                    Im just busy writing my robots.txt but not sure what pages I should disallow, is it worth disallowing page like reigistration success, logg in success, logout etc?

                    Just to clarify my earlier comment, under the system I outlined, pages like registration success, log in success, logout etc should go in your "Not indexed"" directory.
                      David Walker
                      Principal, Shorewalker DMS
                      Phone: 03 8899 7790
                      Mobile: 0407 133 020
                      • 19872
                      • 1,078 Posts
                      I got Redactor turned off by unchecking rich-text for the resource.

                      However—Google is giving me a 404 error, even though I can navigate to the robots.txt file with my browser. Is it perhaps better to just create robots.txt file using a DW or TextWrangler and upload it to the server?

                      User-agent: *
                      Disallow: /manager/
                      Sitemap: http://mydomain.com/sitemap.xml

                      If I type robots.txt into the field that says enter URL if blocked and then click test—a green ALLOWED displays. I'm also able to see the contents if I click on See live robots.txt.

                      Very confusing. Can't really tell whether my robots.txt file accessible or not?