Saturday, February 9, 2008

Web site performance - don't forget the cache-control /content expiration (and a partial explanation about what is browsing from the HTTP perspective)

cache-control is a directive inside the http header instructing the web-browser and/or proxy server how to handle the data transffered. The main use of this directive is for cache handling.

usually when you surf to a web page for the first time the process is as follows at the 7th OSI level:
- the browser sends an http/s request for a specific URI. for example this is the request sent to google when requesting the advanced-search page:

GET /advanced_search?hl=iw HTTP/1.1
Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, application/vnd.ms-excel, application/vnd.ms-powerpoint, application/msword, application/x-icq, application/x-shockwave-flash, application/xaml+xml, application/vnd.ms-xpsdocument, application/x-ms-xbap, application/x-ms-application, application/x-silverlight, */*
Accept-Language: en-us
Accept-Encoding: gzip, deflate
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; InfoPath.1; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30; .NET CLR 3.0.04506.590; .NET CLR 3.0.04506.648; .NET CLR 3.5.21022)
Host: www.google.com
Connection: Keep-Alive
Cookie: PREF=ID=XXXXXXXXXXXXXXXXXX


- the web server gets the request and responds. if everything went well the HTTP header would contain the following response code "HTTP/1.1 200 OK" which is followed by the HTML content of the requested page.

- Once the client finished getting the content the browser will parse the data and extract all of the assets (javascript, flash, images etc.) it needs to get from the server in order to assemble and render the page. Since in the scenario this is the first time the user surfs to this page it has none of the assets stored on his harddrive in the browsers "temporary internet files". the browser opens connections to the server in some level of parallelism (IE opens 2 tcp connection while firefox acts not very politely and opens 6 tcp connections) and requests the assets. this is how it looks like when the browser requests google logo image:
GET /intl/en_com/images/logo_plain.png HTTP/1.1

- The server sends back the image and within the http header of the response it has the expires directive:
Expires Sun, 17 Jan 2038 19:14:07 GMT

- If the user's browser is using the default configuration it will store the google logo png file and will hold the expiration date. if the user will surf to google search page again before the expiration date the browser will not request the image from the web server. It will be taken from the harddrive (or memory - depends on the browser and the settings).

- If the user will surf again the the search page after the date of the expiration the browser will detect that it has the google logo in his cache but its time validity expired. the browser will add to the request the "If-None-Match" or an "If-Modified-Since" header field with the ID of the asset:
If-None-Match "8e9bc4e4e50c71:76f"

- The server will examine the request and will decide if the image on the server has the same ID (meaning not changed) or a new ID. if the asset has already changed the server will send the new asset with the "HTTP/1.1 200 OK" code. if the asset is still the same the server will respond with:

HTTP/1.1 304 Not Modified Date Sat, 09 Feb 2008 11:29:39 GMTEtag "8e9bc4e4e50c71:76f"

this tells the browser that the asset is still valid.


now as you can see the performance hit cause by forgetting the use of cache-control could be severe and lead to many severe problems such as unnecessary round-trips, slowness of the site, throughput problems and overall - a crappy user experience.

So how can you set this cache control?

It can be done within your HTML code by adding meta tags such as meta http-equiv="Expires" or you can do it within your web server. On the IIS console just right click on the asset (for specific asset policy) or on a folder containing the assets and select: properties->HTTP-Headers->enable content expiration and choose the correct policy.
In apache web server you can do it in the httpd.conf file - you can read a much more detailed explanation about apache configuration here

Just remember - use it wisely. If you over cache your site your site visitors would suffer from strange behaviors and not up-to-date data/look-and-feel. You need to organize your assets hierarchy in the web server in a way that it will not be hard for you to set a different policy for each asset type.

No comments: