This article follows Mod_Rewrite and .htaccess which explains how an .htaccess file can be used to prevent search engines from indexing non-www web pages that contain exactly the same content as those with-www in front. By hiding the 'duplicate content' we avoid the risk of a 'downgrading' effect by Google and other search engines.
Exactly the same principle applies to web page addresses like:
http://www.mysite.com/index.php
http://www.mysite.com/subfolder/index.php
when we want the content to be displayed only on:
http://www.mysite.com/
http://www.mysite.com/subfolder/
This can be done by using ModRewrite to permanently redirect (eg):
http://www.mysite.com/index.php
to
http://www.mysite.com/
The file index.php continues to exist on the website but there's no need for 'index.php' to appear in the page address for its content to be displayed. The same applies to 'index.html', 'default.html' (etc) and to 'index' pages located in sub-folders, eg '/subfolder/index.php' or '/subfolder/another/index.php'. Those filenames should never normally be displayed to the visitor. The process of hiding them is sometimes referred to as the canonicalization of index pages.
The .htaccess file
For websites running on Apache web server (most websites do), a Mod_Rewrite module can be enabled to allow an .htaccess file to be installed in the root folder, containing rules on how web page requests should be rewritten 'behind the scenes' by the 'rewriting engine'. The Mod_Rewrite rules to achieve the effect we want here are:
#
Options +FollowSymLinks
RewriteEngine On
#
# REDIRECT /folder/index.php to /folder/
RewriteCond %{THE_REQUEST}
(on same line) ^[A-Z]{3,9}\ /([^/]+/)*index\.php\ HTTP/
RewriteRule ^(([^/]+/)*)index\.php$
(on same line) http://www.mysite.com/$1 [R=301,L]
#
Piece by piece…
A line beginning with hash (#) is ignored by the web server and is useful to split up the rules visually, and to add comments.
Options +FollowSymLinks
RewriteEngine On
For the rewriting engine to work, we need to enable Options FollowSymLinks and set RewriteEngine On (this is for security).
# REDIRECT /folder/index.php to /folder/
RewriteCond %{THE_REQUEST}
(on same line) ^[A-Z]{3,9}\ /([^/]+/)*index\.php\ HTTP/
RewriteRule ^(([^/]+/)*)index\.php$
(on same line) http://www.mysite.com/$1 [R=301,L]
The .htaccess file is eliminating the duplicate content problem by redirecting the visitor (and Google) from all the site's web page addresses that contain the superfluous index.php to the folder name (directory) in which they reside. Exactly the same content is presented as if the index.php file itself was being viewed, but index.php doesn't appear in the browser's address bar.
How the Mod_Rewrite works
(1) RewriteCond
Looking first at RewriteCond, we need to specify the conditions under which the RewriteRule will be processed by the server, and here, we want our rule to apply to any 'index.php' page requested on the domain. This prevents the .htaccess file from triggering an 'infinite loop' on the server, in which the RewriteRule keeps repeating itself. If the request contains 'index.php' (as in the condition we've referenced), it has not yet been rewritten. If it has been rewritten, it won't contain 'index.php' and the RewriteRule won't be applied.
%{THE_REQUEST}
In this part, {THE_REQUEST}, is a standard server variable, in this instance the page requested by the visitor, because that's what we're going to try to match in the second part. In RewriteCond, a server variable is preceded by $ to denote an Apache variable.
^[A-Z]{3,9}\ /([^/]+/)*index\.php\ HTTP/
This second part is known as the 'condition'. The ^ caret defines the start, and is followed by a regular expression. Looking at the regular expression in detail:
[A-Z]{3,9}\ matches from 3 to 9 occurences of any uppercase letter (eg 'GET') followed by an \ escaped space.
/([^/]+/)* matches a forward slash followed by any quantity of [one or more characters not preceded by a forward slash but ending with a forward slash], eg '/subfolder1/subfolder2/'.
index\.php\ matches 'index.php' - the backslashes are required to 'escape' (i) the dot metacharacter (to make it into a real dot) and (ii) the space before 'HTTP/'.
HTTP/ matches 'HTTP/'.
Why do we need all this? Because we're testing our condition against {THE_REQUEST} - the entire client request header for an 'index' page, which is typically something like:
GET /index.php HTTP/1.1
or
GET /subfolder1/index.php HTTP/1.1
(2) RewriteRule
Looking now at the RewriteRule, it contains three essential parts.
^(([^/]+/)*)index\.php$
This first part is the 'thing' that we want to be re-written by the web server. The ^ caret symbol defines the start, (([^/]+/)*) is a designated variable (using brackets) containing a regular expression that matches a forward slash followed by any quantity of [one or more characters not preceded by a forward slash but ending with a forward slash], eg '/subfolder1/subfolder2/', index\.php matches 'index.php', and the $ symbol defines the end.
http://www.mysite.com/$1
This second part is what we want the server to process behind the scenes. It consists of the domain's root folder (homepage) plus the designated variable from the first part, expressed as $1.
In the above example, the designated variable (([^/]+/)*) is added by the server, after the page has been requested, as $1 to the end of http://www.mysite.com/. If the requested 'index' page is the site's homepage, the $1 variable will be empty and the server will simply process http://www.mysite.com/. If the requested 'index' page is in a subfolder and the designated variable's value is '/folder1/', the server will process http://www.mysite.com/folder1/.
[R=301,L]
This third part, the flag, designates any special instructions that might be needed, in this instance R=301 for redirect permanently and L for 'last rule' so that no other rules are processed for the specified rewrite condition.
The full rewrite rule is thus:
RewriteRule ^(([^/]+/)*)index\.php$
(on same line) http://www.mysite.com/$1 [R=301,L]
The RewriteRule in action
Here, again, is the full .htaccess file:
#
Options +FollowSymLinks
RewriteEngine On
#
# REDIRECT /folder/index.php to /folder/
RewriteCond %{THE_REQUEST}
(on same line) ^[A-Z]{3,9}\ /([^/]+/)*index\.php\ HTTP/
RewriteRule ^(([^/]+/)*)index\.php$
(on same line) http://www.mysite.com/$1 [R=301,L]
#
In plain English, it's saying that "if someone tries to open a folder's 'index.php' page, redirect them to a version of the folder without 'index.php', and if the visitor is Google(bot), mention the fact that this is permanent."
See this in action by typing http://www.patricktaylor.com/index.php into an HTTP viewer. The first receiving header is HTTP/1.1 301 Moved Permanently and the second receiving header is HTTP/1.1 200 OK. And of course it can be tested by attempting to view http://www.patricktaylor.com/index.php in your browser.
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment