Monday, July 20, 2009

Mod_Rewrite and .htaccess

Grabbing code snippets off the web and re-using them on one's own websites is easy enough to do. Every web designer solves a problem this way at one time or another. Having done so, why not take a little trouble to understand what the code is doing? This article looks at a simple example of a code snippet and attempts to demystify some of the so-called voodoo surrounding rewriting URLs with .htaccess.

Mod_Rewrite is an Apache web server module that is often installed on shared web hosting packages. If the module is available, a special file named .htaccess can be uploaded to the server, containing rules on how web page requests should be handled 'behind the scenes' by the 'rewriting engine'. The .htaccess file is normally placed in a website's root folder to apply its effect to all pages on the domain.
Why have an .htaccess file?

An .htaccess file is important to any webmaster who is interested in a good ranking in search engines, especially Google. It has many uses, the most basic being to prevent search engines from indexing different pages (URLs) that contain exactly the same content.
A simple .htaccess example: the canonical URL

Consider two web pages:

http://www.mysite.com/
http://mysite.com/

Technically, these two URLs are different pages, but they contain exactly the same content when viewed. If Google indexes both, there's a risk that one, or the other, or both, will be 'downgraded' by Google as 'duplicate content'. With the .htaccess file, this can be prevented by nominating only one as the 'canonical' homepage. Here's an example of what to put in the file:

#
Options +FollowSymLinks
RewriteEngine On
#
# REDIRECT to canonical url
RewriteCond %{HTTP_HOST} ^mysite\.com [NC]
RewriteRule ^(.*)$ http://www.mysite.com/$1 [R=301,L]
#

Piece by piece…

A line beginning with hash (#) is ignored by the web server and is useful to split up the rules visually, and to add comments.

Options +FollowSymLinks
RewriteEngine On

For the rewriting engine to work, we need to enable Options FollowSymLinks and set RewriteEngine On (this is for security).

# REDIRECT to canonical url
RewriteCond %{HTTP_HOST} ^mysite\.com [NC]
RewriteRule ^(.*)$ http://www.mysite.com/$1 [R=301,L]

The 'canonical URL' is the preferred internet address for a web page, and in the above instance is any page at http://www.mysite.com/. The .htaccess file is removing the duplicate content problem by redirecting the visitor (and Google) from the non-www version to the with-www version. This means that only canonical URLs will ever be accessible - for all the pages on the domain, not just the homepage.
How the Mod_Rewrite works

(1) RewriteCond

Looking firstly at RewriteCond, we need to specify the conditions under which the RewriteRule will be processed by the server, and here, we want our rule to apply only when a visitor (or Google) attempts to view http://mysite.com/any-page (without www).

%{HTTP_HOST}

In this first part, {HTTP_HOST}, is a standard server variable, in this instance the site's host (domain name), because that's what we're going to try to match in the second part. In RewriteCond, a server variable is preceded by $ to denote an Apache variable.

^mysite\.com

This second part is known as the 'condition'. The ^ caret symbol defines the start and mysite\.com is the pattern to be matched, in this instance http://mysite.com without www. The backslash before the dot is required to 'escape' it, because in a regular expression, the dot is a special 'metacharacter'. Escaping the dot converts it back to a normal character - a plain dot.

[NC]

This third part is known as the flag. [NC] stands for no case (case-insensitive).

The full rewrite condition is thus:

RewriteCond %{HTTP_HOST} ^mysite\.com [NC]

(2) RewriteRule

Looking now at the RewriteRule, it contains three essential parts.

^(.*)$

This first part is the 'thing' that we want to be re-written by the web server. The ^ caret symbol defines the start, (.*) is a designated variable (using brackets) containing a regular expression that matches any combination of characters, and the $ symbol defines the end.

http://www.mysite.com/$1

This second part is what we want the server to process behind the scenes. It consists of the canonical URL, plus the designated variable from the first part, expressed as $1. If we had two designated variables we could use $1 and $2.

In the above example, the (.*) (any combination of characters, eg: 'about-us.html') is added by the server, after the page has been requested, as $1 to the end of http://www.mysite.com/ to make http://www.mysite.com/about-us.html.

[R=301,L]

This third part, the flag, is an integral part of the rule writing process because it designates any special instructions that might be needed, in this instance R=301 for redirect permanently and L for 'last rule' so that no other rules are processed for the specified rewrite condition.

The full rewrite rule is thus:

RewriteRule ^(.*)$ http://www.mysite.com/$1 [R=301,L]

The RewriteRule in action

Here, again, is the full .htaccess file:

#
Options +FollowSymLinks
RewriteEngine On
#
# REDIRECT to canonical url
RewriteCond %{HTTP_HOST} ^mysite\.com [NC]
RewriteRule ^(.*)$ http://www.mysite.com/$1 [R=301,L]
#

In plain English, it's saying that "if someone tries to open any page on our website without entering www at the front, redirect them to a version of the page with the www, and if the visitor is Google(bot), mention the fact that this is permanent."

The redirect can be tested by typing a web page address like http://patricktaylor.com/mod_rewrite-htaccess into an HTTP viewer. The first receiving header is HTTP/1.1 301 Moved Permanently and the second receiving header is HTTP/1.1 200 OK. And of course the addition of www can be tested by pasting http://patricktaylor.com/ into your browser's address bar.

A general note: on some shared web hosting accounts, the .htaccess file can't be seen when the root folder is opened in an FTP client. This can often be corrected by enabling server side filtering in the FTP client program and setting the remote filter as -rtaF. The precise details of how to do this will vary from one program to another.

1 comment: