-
- 247 Posts
Strange why no-one ever answered this question in nearly 4 years... I thought it would have gathered more interest.
Anyway, I need to know pretty much the same answer. Namely, what is the required/correct format for listing an individual page in the robots.txt file?
Is it simply:
Disallow: /index.php?id=xxx
More importantly, is it possible for me to disallow all of the resources within a single parent resource? I have a resource container called 'sandbox'. In there I keep all my development/test/experimental resources. These may be published, but are always hidden from the menu. I just want to be able to hide from robots anything that I dump in this container.
Is this possible?
-
- 436 Posts
Simply using Disallow: /sandbox/ will stop anything in your sandbox container from being indexed.
-
- 436 Posts
The friendly urls are the ones that are crawled and appear in Google or wherever. Play around with Google Webmaster Tools, robots.txt will exclude container contents. Doesn't matter if it's a real directory or not.
A more bullet proof way of locking down a test area of a site though is just to password protect it with htaccess.
-
- 147 Posts
Quote from: absent42 at Nov 02, 2013, 06:14 PMA more bullet proof way of locking down a test area of a site though is just to password protect it with htaccess.
The .htaccess method looks good but will require you to log in to see these pages even with the manager open. I suspect it also means that /thanks-for-logging-in and /reg-success and others like them will become inaccessible. There is a conceptual difference between pages which you want unindexed and pages which you want unpublished; "Not published" should be a subset of "not indexed".
So here's another idea:
- Create a directory for resources you want to be available to the system but not seen by Google, within site searches etc. Within this directory, create a sub-directory for pages you do not want published. (This structure pays off in other ways too. For instance, you can then use getResources to write a chunk listing the pages that you want SimpleSearch not to search through.)
- Create a new document called "robots" with content type "text". You now have a robots.txt file.
- Inside the document, put the code below:
User-agent: *
Disallow: /manager/
Disallow: /[[~666]]
Sitemap: http://mysite.com[[~11]]
... where 666 is the id of your "not indexed" directory (and 11 is the id of your sitemap, if you have one).
You now have one directory that most search engines won't index, no matter what you call it or where you put it.
Does this make sense? Does it have gotchas I haven't spotted?
David Walker
Principal, Shorewalker DMS
Phone: 03 8899 7790
Mobile: 0407 133 020
-
- 1,613 Posts
nice one
Evolution user, I like the back-end speed and simplicity
-
- 147 Posts
Quote from: PH!L at Nov 14, 2009, 12:17 PMIm just busy writing my robots.txt but not sure what pages I should disallow, is it worth disallowing page like reigistration success, logg in success, logout etc?
Just to clarify my earlier comment, under the system I outlined, pages like registration success, log in success, logout etc should go in your "Not indexed"" directory.
David Walker
Principal, Shorewalker DMS
Phone: 03 8899 7790
Mobile: 0407 133 020