Avid readers may have notice a new function I added relatively silently this week, the how-on-earth-did-anyone-find-this-article function. For eksample, if you go to
this article about coffee, you will see that someone has found this article by googleing "hvordan lage aprikoslikør", at least at the time of writing. So how can I know that?
As you open a web page, what actually happens is that your browser sends something called an http request to the web server. This request contains a bunch of information, most importantly which page you want to open, but also some other stuff. Some of that information is required, and if it is missing, it will result in an error, whereas some is optional. For example, most browsers allow you to select which languages you prefer to see websites in. Not that many webpages are actually available in more than one language, or at least not in both Norwegian and English, but calcuttagutta currently is, so you can test this if you want.
Another piece of optional information, and the one which concerns us here, is the http_referer. If you click a link to go to a new page, the http request will contain information about which page you came from. If you don't want to give up this information, it is possible to disable it, but most people probably neither know nor care. So for example, if you googled "hvordan lage aprikoslikør", your http_referer would look like this
http://www.google.com/search?hl=en&source=hp&q=hvordan%20lage%20aprikoslik%C3%B8r&aq=f&aqi=&aql=&oq=&gs_rfai=
which is exactly as what is says in your address bar when you do this search. The first part of this url
http://www.google.com/
is pretty straightforward, it means you came from google. To also cover people using the regional varieties of google, I check if the first part of the referer contains
http://www.google.
and if it did, we are most likely dealing with a google search. Now, the next part of the url,
search?hl=en&source=hp&q=hvordan%20lage%20aprikoslik%C3%B8r&aq=f&aqi=&aql=&oq=&gs_rfai=
is the interesting bit. First, there is search, which is presumably the name of the piece of code which handles the actual search. Then there is a question mark, and a bunch of ampersands, marked in bold by me. The question mark indicates the beginning of something called a querystring, and is probably not technically part of the url. If url is even the proper, technical term,
url, uri, iri, I get confused sometimes. But yes, the querystring contains information which is passed along to the server. It consists of a list of variable and value pairs, separated by ampersands. The one we are interested in is the variable q, which contains the string you searched for, and in this case has the value
q=hvordan%20lage%20aprikoslik%C3%B8r
For silly historical reasons, urls usually doesn't contain non-standard characters like ø, and not spaces, which are instead replaced by %20 for space, and, apparently, %C3%B8 for ø. It is, however, an easy matter to replace these with the proper characters, and that is the story of how I know what you searched for when you found this page.
For those interested, the actual code looks like this:
from urllib import unquote_plus
if request.META.has_key('HTTP_REFERER'):
if request.META['HTTP_REFERER'].startswith('http://www.google.'):
queries = request.META['HTTP_REFERER'].split('?')[1].split('&')
for query in queries:
if query.split('=')[0] == 'q':
article.google_count += 1
article.last_google_hit = unquote_plus(query.split('=')[1])
article.save()
-Tor Nordam
Comments