The main purpose when I started to look at Google App Engine (3 days ago) was to use it as a “CDN for the rest of us”, a way to cache static content (initially) and have this content distributed along all the infrastructure of Google (maybe the most powerful cloud rigth now)
What we want?:
- Create a CDN easy to update and free of charge for static resources (images, css, js)
- Consume as less bandwidth as possible leveraging the If-Modified-Since/Last-Modified/304 Not Modified model
Hands-on:
The first approach, of course, was to look on Google for some help, the post of Andreas Krohn helped a lot to start.
But I want to go further and take care of modern browsers If-Modified-Since requests, then the google framework and a little of Python comes to the rescue.
Note: I’m assuming you’ve already installed the Python environment and the Google App Engine SDK
First of all let me give you two little .bat files that are useful:
Start the test webserver (test.bat):
dev_appserver.py c:\ipsojobscloud
Upload your application to the cloud (update.bat):
appcfg.py update c:\ipsojobscloud
Note: simply change c:\ipsojobscloud for the folder you are working in and contains your app.yaml
Then I’ve setup the app.yaml, it’s very simple (16 lines):
application: ipsojobscloud
version: 1
runtime: python
api_version: 1
handlers:
- url: /favicon.ico
static_files: favicon.ico
upload: favicon.ico
- url: /images/favicon.ico
static_files: favicon.ico
upload: favicon.ico
- url: /.*
script: cacheheaders.py
This app.yaml simply tells the GAE the name of the application (ipsojobscloud) the version we’re working on (use only the major release number, GAE automatically takes care of the .x when you upload).
Then we specify two handlers for the favicon.ico static file and a catch-all handler that redirects our requests to the Python script cacheheaders.py
With that environment set, we simply code the cacheheaders.py file, let’s see it in detail:
The skeleton of the file is:
import wsgiref.handlers
from google.appengine.ext import webapp
class MainPage(webapp.RequestHandler):
def get(self, dir, file, extension):
...
def main():
application = webapp.WSGIApplication([(r'/(.*)/([^.]*).(.*)’, MainPage)], debug=False)
wsgiref.handlers.CGIHandler().run(application)
if __name__ == “__main__”:
main()
Here we are importing the webapp framework and setting the class MainPage, in the main section the only change in the sample GAE is
the regular expression that we used to match our requests, the expression r’/(.*)/([^.]*).(.*)’ is telling that we are using regular expressions (r)
, then take one slash, followed by an arbitray number of characters and another slash /(.*)/ the parentesis tells the regular expression to keep the string beetween the two slashes as a variable. The next part ([^.]*). takes all caracters except a dot and puts them in to the second variable and finally, we’ll take the rest of the input as a variable with (.*)
This regular expression is designed to only capture paths like /images/helloworld.gif where variables are images, helloworld and gif respectively
Note: Of course that’s not a complete solution, we can only have one folder depth, but it’s a good readers exercice to improve that
The part that you need to know is that when a request arrives it’s mapped to the get function with the parameters dir, file and extension (and don’t forget the first “self” parameter)
Let’s see the code of the get function in detail:
First, check the validity of the parameters received and set the correct content-type based on the extension:
def get(self, dir, file, extension):
if (dir!='js' and dir!='css' and dir!='images'):
self.error(404)
return
if (extension!='js' and extension!='css' and extension!='jpg' and extension!='png' and extension!='gif'):
self.error(404)
return
if extension=='js':
self.response.headers['Content-Type'] = ‘application/x-javascript’
elif extension==’css’:
self.response.headers['Content-Type'] = ‘text/css’
elif extension==’jpg’:
self.response.headers['Content-Type'] = ‘image/jpeg’
elif extension==’gif’:
self.response.headers['Content-Type'] = ‘image/gif’
elif extension==’png’:
self.response.headers['Content-Type'] = ‘image/png’
Note: the firts two ifs are completely optional, we check if the dir variable is in our valid list of dirs (js, css, images) and if the extension of the file is in our allowed list (js, css, jpg, png, gif), you have to change that check or completely remove it at your convenience.
And now the tricky part:
try:
import os
import datetime
path = dir+'/'+file+"."+extension
info = os.stat(path)
lastmod = datetime.datetime.fromtimestamp(info[8])
if self.request.headers.has_key(’If-Modified-Since’):
dt = self.request.headers.get(’If-Modified-Since’).split(’;')[0]
modsince = datetime.datetime.strptime(dt, “%a, %d %b %Y %H:%M:%S %Z”)
if modsince >= lastmod:
# The file is older than the cached copy (or exactly the same)
self.error(304)
return
else:
# The file is newer
self.output_file(path, lastmod)
else:
self.output_file(path, lastmod)
except:
self.error(404)
return
First we import some packages (os, datetime), then create a variable “path” with the full path of the file we want to retrieve
path = dir+'/'+file+"."+extension
Then, take the info of the file from the Operating System and keep the last modified date into lastmod variable, note that if an error occurs (non existing file for example, the except part will be executed, returning a 404 not found response to the browser).
In the following lines we scan the headers of the request, looking for an If-Modified-Since header, if we found it take the date part
if self.request.headers.has_key('If-Modified-Since'):
dt = self.request.headers.get('If-Modified-Since').split(';')[0]
modsince = datetime.datetime.strptime(dt, “%a, %d %b %Y %H:%M:%S %Z”)
Then compare the last modification date of the file against the ifmodifiedsince date and act accordingly, note that self.error(304) will return a response code 304 (Not-Modified) to the browser:
if modsince >= lastmod:
# The file is older than the cached copy or the same
self.error(304)
return
else:
# The file is newer
self.output_file(path, lastmod)
The self.output_file(path, lastmod) is a function we have defined to avoid code duplication:
def output_file(self, path, lastmod):
import datetime
try:
self.response.headers['Cache-Control']=’public, max-age=31536000′
self.response.headers['Last-Modified'] = lastmod.strftime(”%a, %d %b %Y %H:%M:%S GMT”)
expires=lastmod+datetime.timedelta(days=365)
self.response.headers['Expires'] = expires.strftime(”%a, %d %b %Y %H:%M:%S GMT”)
fh=open(path, ‘r’)
self.response.out.write(fh.read())
fh.close
return
except IOError:
self.error(404)
return
As you can see we imported datetime to manipulate dates and try to do the following:
- Set the header Cache-Control, to be as much cacheable as posible
- Set the header Last-Modified (IMPORTANT ! when we send for the first time the file to the browser it keeps the Last-Modified date of the file, this value is the value that will send in the next If-Modified-Since requests, when we usually will respond 304 not-modified!)
- Calculate an expires date in the future (we’ve put 365 days)
- Set the Expires header with this value (last-modified+365 days)
- Open the file and send it to the output and finally close the file
- return, because when we output the file we’re done
Note: If something happens we returned an standard response of Not Found (404)
Conclusions:
We’ve improved the latency in the requests of static files putting them into the cloud, and keep the bandwidth used in the cloud to a minimum answering correctly to the If-Modified-Since requests and only in about 70 lines of code
One of the advantatges of Google App Engine above Amazon S3 is that GAE is free up 5 million page views a month, that give us a good chance to try this kind of features without spending cash.
You can see the speed improvement on-line in all the ipsojobs.com pages rigth now !
Some screenshots taken from firebug:
First request:

Second request:

Detail of a request:

Full source of cacheheaders.py:
import wsgiref.handlers
from google.appengine.ext import webapp
class MainPage(webapp.RequestHandler):
def output_file(self, path, lastmod):
import datetime
try:
self.response.headers['Cache-Control']=’public, max-age=31536000′
self.response.headers['Last-Modified'] = lastmod.strftime(”%a, %d %b %Y %H:%M:%S GMT”)
expires=lastmod+datetime.timedelta(days=365)
self.response.headers['Expires'] = expires.strftime(”%a, %d %b %Y %H:%M:%S GMT”)
fh=open(path, ‘r’)
self.response.out.write(fh.read())
fh.close
return
except IOError:
self.error(404)
return
def get(self, dir, file, extension):
if (dir!=’js’ and dir!=’css’ and dir!=’images’):
self.error(404)
return
if (extension!=’js’ and extension!=’css’ and extension!=’jpg’ and extension!=’png’ and extension!=’gif’):
self.error(404)
return
if extension==’js’:
self.response.headers['Content-Type'] = ‘application/x-javascript’
elif extension==’css’:
self.response.headers['Content-Type'] = ‘text/css’
elif extension==’jpg’:
self.response.headers['Content-Type'] = ‘image/jpeg’
elif extension==’gif’:
self.response.headers['Content-Type'] = ‘image/gif’
elif extension==’png’:
self.response.headers['Content-Type'] = ‘image/png’
try:
import os
import datetime
path = dir+’/'+file+”.”+extension
info = os.stat(path)
lastmod = datetime.datetime.fromtimestamp(info[8])
if self.request.headers.has_key(’If-Modified-Since’):
dt = self.request.headers.get(’If-Modified-Since’).split(’;')[0]
modsince = datetime.datetime.strptime(dt, “%a, %d %b %Y %H:%M:%S %Z”)
if modsince >= lastmod:
# The file is older than the cached copy (or exactly the same)
self.error(304)
return
else:
# The file is newer
self.output_file(path, lastmod)
else:
self.output_file(path, lastmod)
except:
self.error(404)
return
def main():
application = webapp.WSGIApplication([(r'/(.*)/([^.]*).(.*)’, MainPage)], debug=False)
wsgiref.handlers.CGIHandler().run(application)
if __name__ == “__main__”:
main()