Insuma XML integration tutorial

Table of Contents

1. Basics
2. Search Form & XML-Query Generation
3. Connecting to the Search Engine
4. Compiling Complex Queries
5. Outputting XML-Results to HTML-Page. DOM Version
6. Outputting XML-Results to HTML-Page. SAX version
7. Making it Nice
8. Download


1. Basics

What is it all for and how should it work? Until we know the answers to these questions, we can't go any further.

Fortunately we have another way. While it may be a little bit trickier, the solution provided by Insuma is much more powerful. The search engine, InsumaFocus, does only what it's intended for - the search. All the rest, including the formatting and sending results to a user, is done by the site owner. Moreover, queries aren't limited to input strings because the XML interface allows us to construct the queries, however complicated they may be.

What do we need? First of all, we need a search form. It may be as simple as the usual input string or as complicated as a page full of inputs, checkboxes and radiobuttons. It all depends on your needs. The data from the search form (the CGI-query) is sent to a CGI-script (all examples given will be written in Python, but you may use any language you like: Perl, PHP, ASP, or even C). That script should do the following. First, translate CGI parameters to the XML query format (detailed description may be found in the Insuma XML Handbook), then send it to the InsumaFocus search engine and retrieve the results (also in XML-format). After you have completed those steps, reformat and output the results to a user's browser (in HTML). It should also provide result paging; search engines never output all results in one big document since there may be too many of them. Rather, they split them in groups of 10 to 20 entries, allowing a user to move to and fro between the pages.

NB: In the following examples, there are some line breaks added to make long lines readable. Because of them, you may encounter an error if you try copy-pasting those lines to your scripts. Complete listings without extra line-breaks are provided in the Download section.

Back to Contents


2. Search Form & XML-Query Generation

Let's start from the very beginning. Say, we want to create a search form that only has an input string and a search button. The HTML code may look like this:

search.html:
<html>
<head>
  <title>Test search page v.1</title>
  <meta http-equiv="Content-type" content="text/html; charset=iso-8859-1">
</head>
<body>
  <h1>This is a page with a search form.</h1>
  <form action="/cgi-bin/client.py">
  Enter your query here:
    <input name="query" size="20" />
    <input type="hidden" name="show_max" value="10" />
    <input type="hidden" name="highlighting" value="on" />
    <input type="hidden" name="ranking" value="on" />
    <input type="submit" value="Go!" />
  </form>
</body>
</html>

Here are a few lines of interest. First is the action attribute of the form tag. As you probably know, it designates the script that the form data is sent to. In our case, it is the script that does cgi-to-xml translation, requests and fetches information from InsumaFocus, translates it to HTML and outputs it to a user. This script is the main part of our work. Of course, the name and the path to this script depend on you. Just keep in mind, usually (in most web-server installations) scripts should be located in a special folder (often called /cgi-bin) in order to be executable. Outside of that folder, scripts usually can't be executed.

Also, take a note of the hidden field named "show_max." This field will affect the number of documents found by InsumaFocus to be printed in one page. We could hardcode this number in the client.py script, but when we pass this value as a parameter we have greater flexibility.

Ranking and highlighting parameters might also be hardcoded. They are supposed to determine whether the document rank (a float value between 0 and 1 that shows document relevance) should be printed and whether the found words should be highlighted in the summary.

Now, we should handle the data sent to our script client.py. Converting a search query to the unicode is a good idea. We suppose that data is sent to the script in Latin-1 encoding, because it's the encoding used in search.html. Some parameters are supposed to be integer, so we use int() function. In case of an error or absence of a parameter, we set it to default value.

client.py (beginning):
#!/usr/bin/python

import sys
sys.stderr = sys.stdout

print 'Content-Type:text/html'
print

import cgi
import types
import urllib2
from xml.sax import make_parser

# configuration options:

# encoding:
ENC = 'iso-8859-1'
# attributes to be requested:
ATTRS_SHOWN = 'author title'
# search engine URL:
ENGINE = 'http://www.insuma.de/cgi-bin/xml/web2xml.py'
# default values:
DEF_hilight = 'on'
DEF_ranking = 'on'
DEF_entries = '10'


DEF_start_after = '1'
# (they're strings and not integers due to the way we use them.)

try:
    data = cgi.parse()
except:
    print "Can't parse sent data!"
    sys.exit()

query      = unicode(data.get('query', [''])[0], ENC)
entries    = int(data.get('show_max', [DEF_entries])[0])
start_after = int(data.get('start_after', [DEF_start_after])[0])
hilight    = data.get('highlighting', [DEF_hilight])[0]
ranking    = data.get('ranking', [DEF_ranking])[0]

if not query:
    print "No query specified."
    sys.exit()

if hilight != 'on': hilight = 'off'
if ranking != 'on': ranking = 'off'
# (they're either 'on' or 'off', nothing else.)

To make this code a bit clearer, let's talk about how cgi.parse() works. If cgi module can't parse the data or a query wasn't specified, the script will terminate with an error message. Otherwise, we will get a dictionary (i.e. an associative array), keys of which are variable names and values are lists of sent values. For example, the string

category=search+engines&name=altavista&name=askjeeves

will be parsed as

'category' => ['search engines']
'name'     => ['altavista', 'askjeeves']

Then, the get method of a dictionary helps in retrieving values in case we don't really know whether a key is present or not. dictionary.get(key, default_value) will return dictionary[key] if it is defined, and default_value if it isn't. We set these default values to be one-element lists in order to make them match the type of the dictionary elements.

The data encoding will probably match the encoding we set for search.html.

Then, we form an xml query. Mainly, it's just string concatenation. In our oversimplified case, it should be something like this:

xml_query = """<query max_results="%i" start_from="%i" show_attrs="%s">
    <condition attr="body" predicate="match" value="%s" />
</query>""" %(entries, start_after, ATTRS_SHOWN, query)

Note, here we introduce the start_after parameter. It will be used for paging the search results. For example, in case we want to display results from 21st to 30th, we should set max_results to ten (for some historical reason this parameter has three different names in different situations: show_max as CGI-parameter, entries inside of the client.py and max_results in XML, and it means the number of matching documents per page), and start_after to 21.

We want the substring matches to be higher in the result list than other matches. For example, if we search for "million dollars," we wish "one thousand ways to earn one million dollars" to be found earlier than "one million ways to loose one thousand dollars". For this purpose, a substring match is great. We will add a substring match in the query:

xml_query = """<query max_results="%i" start_from="%i" show_attrs="%s">
   <or>
   <condition attr="body" predicate="match" value="%s"/>
      <and>
      <condition attr="body" predicate="match" value="%s"/>
      <condition attr="body" predicate="bmatch" value="&apos;%s&apos;"/>
      </and>
   </or>
</query>""" %(entries, start_after, ATTRS_SHOWN, query, query, query)

Remember that match predicate is required for getting the results sorted in the proper way. That's a feature of InsumaFocus search engine, we have to take it as given. The score of the documents with substring match will be much higher than of the ones without. Consequently, these documents will be delivered in the first pages of the search.

Back to Contents


3. Connecting to the search engine

Now we should send our request to the search script. It expects a request in multipart form, just as files are sent at HTTP upload. Generally, the HTTP-request we send to the server should have the following look:

POST http://www.insuma.de/cgi-bin/xml/web2xml.py

Content-Length: 400
Content-Type: multipart/form-data; boundary="TeRMiNaToR"

--TeRMiNaToR
Content-Disposition: form-data; name="xml_query"
Content-Type: text/xml

<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE query SYSTEM "insuma_query.dtd">
<query show_attrs="title description">
    <condition predicate="match" attr="body" value="search terms"/>
</query>

--TeRMiNaToR--

Here TeRMiNaToR is a string used as boundary marker. You may use any string you like, provided that it's very unlikely to appear in a query.

In order to use urllib2 library, which provides objects for sending and retrieving data over HTTP connections, we should form a Request object and initialize it with the data we have:

import urllib2

data = """--TeRMiNaToR
Content-Disposition: form-data; name="xml_query"
Content-Type: text/xml

<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE query SYSTEM \"insuma_query.dtd">
""" + xml_query.encode(ENC) + """

--TeRMiNaToR--"""

request = urllib2.Request(ENGINE)
request.add_header('Content-type',
        'multipart/form-data; boundary="TeRMiNaToR"')
# (the Content-Length header is set automatically)
request.add_data(data)

Now we are ready to connect.

try:
    answer = urllib2.urlopen(request)
except urllib2.HTTPError, a:
    print a.read()
    sys.exit()
except urllib2.URLError:
    print "<strong>Can't connect to the search engine!</strong>"
    print "</body></html>"
    sys.exit()

In this piece of code we connect to the search engine. In the case of success, we get a file-like object answer from which we can read data. If the request fails, we either get an HTTPError exception and then just output what it returns (HTTPError returns a file-like object, that may contain some useful information about problems we met), or print a simple error message in case of URLError (generic exception for urllib2).

Remark on backwards compatibility. Someone may wish to use examples from this tutorial for his or her InsumaStarter installation. It is definitely possible. The only field of InsumaStarter search form that isn't covered in the tutorial is collectionID. There's really no need to hardcode the example in collectionID, as the collection is uniquely specified by the search engine URL. However, in case someone still wants to use collectionID's and doesn't want to hardcode the URL (for example, he or she wants to use one script for searching in different collections), the complete script example (stored in this file) makes it possible to use them. All required information on usage is available in the comments.

Back to Contents


4. Compiling complex queries

Sometimes we want more flexibility than a simple input line can give us. For example, one may need to perform the search in meta-tags (headers of the document) and not in the document body.. Another usage is limiting documents by date. Of course, complex queries greatly depend on what attributes are stored in the Insuma database and what a user needs.

Example 1.

Let's speak about default Insuma's attributes for HTML document. They are:

For details read the Insuma XML Handbook.

Let's write a simple example: a page for searching in title, headers or keywords either in an exact or morphological way.

search1.html:
<html>
<head>
  <title>Test search page</title>
  <meta http-equiv="Content-type" content="text/html; charset=iso-8859-1">
</head>
<body>
<form action="http://www.example.com/script.cgi">
    <table>
    <col align=right /><col />
    <tr><th align="center">
      Field
      </th><th>
      Morphological
      </th></tr>
    <tr><td>
      Headers: <input name="headings" size="20" />
      </td><td>
      <input type="checkbox" name="headings_morpho" />
      </td></tr>
    <tr><td>
      Title: <input name="title" size="20" />
      </td><td>
      <input type="checkbox" name="title_morpho" />
      </td></tr>
    <tr><td>
      Keywords: <input name="keywords" size="20" />
      </td><td>
      <input type="checkbox" name="keywords_morpho" />
      </td></tr>
    </table>
    <input type="radio" name="logic" checked="checked"
            value="or">Any of them<br>
    <input type="radio" name="logic" value="and">All of them<br>
    <input type="hidden" name="show_max" value="10" />
    <input type="submit" value="Go!" />
  </form>
</body>
</html>

So far, so good. We have three input fields with checkboxes attached to them. How should we construct an xml-query of these fields? We should look at whether input fields are filled and whether the corresponding checkboxes are checked. Then, check logic and construct the query. It should look something like this:

fields = ['title', 'headings', 'keywords']
# this is the names of all possible fields
query_fields = {}
# this will be the fields a user filled

try:
    sent_data = cgi.parse()
except:
    print "Can't parse sent data!"
    sys.exit()
    
entries    = int(data.get('show_max', [DEF_entries])[0])
start_after = int(data.get('start_after', [DEF_start_after])[0])

for f in fields:
    str = unicode(sent_data.get(f, [''])[0], ENC)
    if str <> '':
        if sent_data.get(f + '_morpho', [0])[0] <> 0:
            query_fields[f + '_morpho'] = str
        else:
            query_fields[f] = str


# Construct the XML query using the values extracted from
# the CGI form fields.

logic = sent_data.get('logic', ['or'])[0]
if logic not in ('or', 'and'):
    logic = 'or'

if len(query_fields) == 0:
    print 'No search terms specified. Please return to fill them in.'
    sys.exit()

def condition(field, value):
    return u'<condition attr="%s" predicate="match" value="%s" />' %\
            (field, value)

query_header = """<query max_results="%i" start_from="%i"
               show_attrs="title score author">""" % (entries, start_after)
               
query_footer = """</query>"""

query_conditions = ''
for k, v in query_fields.items():
    query_conditions += condition(k, v)
    
if len(query_fields) > 1 and logic == 'or':
    xml_query = query_header +\
            '<or>' + query_conditions + '</or>' + query_footer
else:
    xml_query = query_header +  query_conditions + query_footer

A couple of remarks. We check whether logic is or, because and tag may be omitted. In general, a query consists of and-ed conditions, each of which may be in turn simple condition, and or or expression. As the default html collection has only textual fields, I use only match predicate in the examples. The only option is to replace match with bmatch, but they're much alike. Whereas the simple query interface is always the same, complex search pages greatly depend on one's needs and abilities. Something useful in one collection may be totally useless or even impossible in another.

And then, everything as before:

data = """--TeRMiNaToR
Content-Disposition: form-data; name="xml_query"
Content-Type: text/xml

<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE query SYSTEM \"insuma_query.dtd">
""" + xml_query.encode(ENC) + """

--TeRMiNaToR--"""

etc...

Example 2.

Now we will discuss making query strings with boolean operators. For example, we want to provide three input lines. One is for keywords that should all be found in a document, one for keywords of which at least one is required to be found, and one input for words that should not be found in a document. The following HTML describes the appropriate form:

search2.html:
<html>
<head>
  <title>Test search page</title>
  <meta http-equiv="Content-type" content="text/html; charset=iso-8859-1">
</head>
<body>
<form action="http://www.example.com/script2.cgi">
    All of these: <input name="all" size="20" /><br />
    Any of these: <input name="any" size="20" /><br />
    None of these: <input name="none" size="20" /><br />
    <input type="hidden" name="show_max" value="10" />
    <input type="submit" value="Go!" />
  </form>
</body>
</html>

Now, our script will get three text lines: all, any and none.

try:
    sent_data = cgi.parse()
except:
    print "Can't parse sent data!"
    sys.exit()

entries    = int(data.get('show_max', [DEF_entries])[0])
start_after = int(data.get('start_after', [DEF_start_after])[0])

all        = unicode(sent_data.get('all', [''])[0]), ENC)
any        = unicode(sent_data.get('any', [''])[0]), ENC)
none       = unicode(sent_data.get('none', [''])[0]), ENC)

How should the query look? We want to use boolean operators, so we need bmatch predicate in our query. For example, if all = "Lennon Mercury", any = "Knopfler Clapton", and none = "Jackson Madonna", then the condition in the query should look like this:

<condition attr="body" predicate="bmatch"
    value="+Lennon +Mercury Knopfler Clapton -Jackson -Madonna" />

But there is a little problem: bmatch predicate doesn't support ranking. So in order to get the results in reasonable order, we should add a simple match predicate to the existing bmatch predicate:

<condition attr="body" predicate="match"
    value="Lennon Mercury Knopfler Clapton" />

Now there's nothing complicated left. We should only split the query strings to words, then add boolean operators to them and compile the query.

import string

all_bool = '+' + string.join(all.split(), (' +'))
# first, we split line to words. Then, join them together
# using ' +' as the separator. And add a '+' to the beginning.
any_bool = any
# we needn't change anything
none_bool = '-' + string.join(none.split(), (' -'))

match_condition = """<condition attr="body" predicate="match"
    value="%s %s" />""" % (all, any)
bmatch_condition = """<condition attr="body" predicate="bmatch"
    value="%s %s %s" />""" % (all_bool, any_bool, none_bool)

query_header = """<query max_results="%i" start_from="%i"
               show_attrs="title score author">""" % (entries, start_after)
query_footer = """</query>"""
query = query_header + match_condition + bmatch_condition + query_footer
Back to Contents


5. Outputting XML-results to HTML-page. DOM version.

Once we've got an XML-answer from the search engine, the next problem is to parse and output these results. Fortunately, the biggest part of our work is already done by others. Different XML-parsers are available in all known programming languages, including Python, which we're using. There's a bunch of XML-parsers in Python, we'll use the xml.dom.minidom library. It's one of the DOM parsers, i.e. it creates a DOM (Document Object Model) of an XML-document and it's reasonably lightweight. We won't need anything complicated, so the simplest library is enough for our work. However, don't forget that DOM parsers are generally slow because they can't start working before the transmittion of data is finished. They may also require a rather big amount of memory, especially when you work with big documents (in our case, it's search results with big total_hits values).

Suppose that buffer string contains the XML document generated by the search engine, that is the search result. We parse it using the parseString() function from the xml.dom.minidom library:

from xml.dom.minidom import parseString

buffer = answer.read()
# we read all content of the "answer"
# in a string "buffer"

result = parseString(buffer)

In the XML structure generated by the search engine we have a document object that contains an object tree. The root element is result, which has three attributes (hits, total_hits and total_hits_exact, and we don't need any of them right now), and a few document sub-elements. Technically, the number of document sub-elements is equal to hits attribute value, but we really don't need this fact at the moment.

Then we go one level deeper. Every document element has one required attribute, url, and three optional (we'll speak of them later). It also contains some sub-elements that depend on the search engine configuration. They are for providing meta-information on documents, such as author name, document summary, keywords, etc. The document body may also be passed as a body sub-element of the document. We will code our script to be flexible enough so that you won't spend much time migrating from one Insuma installation to another. Now all we need to do is walk through all the document records in our search result, grab URLs and meta-information from them and print them to a browser.

Since Python is an object-oriented language, it would be nice to describe some classes. We probably need a class for storing and handling those documents with all their attributes and sub-elements. Such a class may look like this:

cDoc.py:

from __main__ import ENC
from string import join

def getElementContent(node):
    s = ''
    for e in node.childNodes:
        if e.nodeType == node.TEXT_NODE:
            s = s + e.data
    return s


class cDocument:
    def __init__(self, doc):
        self.url = doc.getAttribute('url')
        self.list = {}
        for e in doc.childNodes:
            if e.nodeType <> e.ELEMENT_NODE: continue
            name = e.tagName
            if name not in self.list.keys():
                self.list[name] = []
            self.list[name].append(getElementContent(e))

    def print_doc(self):
        print '<div>'
        print '<strong>URL:</strong> <a href="' +\
	        self.url.encode(ENC) + '">' + self.url.encode(ENC) +\
		'</a><br>'
        for key, val in self.list.items():
            print '<strong>' + key.encode(ENC) + '</strong>: '
            print join(val, '; ').encode(ENC)
            print '<br>'
        print '</div><br>'

What is it all about? In the first place goes getElementContent() function, that is introduced only because the parser tends to split text nodes in parts in case they contain newlines or entities. Technically, this function gathers all the content from all the text sub-nodes of a given element. We then use it to join together text nodes that were split by newlines.

Next, the class is described. By now, it has two methods: one for initialization, and one for printing. At initialization step, our class takes a node (we will feed document nodes to it), stores its url attribute in a property, and creates a dictionary, that contains tag names of each sub-element of the given element as a key, and list of contents of those sub-elements as value (remember, that, e.g., a document may have more than one author). So, if we've got a document element, that has url attribute equal to 'http://www.microsoft.com', and has two author sub-elements with values 'Bill Gates' and 'Steve Ballmer', and one summary sub-element with value 'Windows are here to stay', then our cDocument instance cDoc will look like this:

cDoc.url = 'http://www.microsoft.com'
cDoc.list['author'] = ['Bill Gates', 'Steve Ballmer']
cDoc.list['summary'] = ['Windows are here to stay']

So, we've got the class initializer, that can determine, what information and meta-information is available, and store it to its properties. Of course, it's possible only because we know the structure of InsumaFocus' XML result format, and because it's kept simple. Besides, in real life we don't really need to make our script that smart, because the set of information returned depends only on the Insuma installation and on the search query sent to engine. As far as your script is intended only to work at your own Insuma installation, you will normally know the complete set of information fields (title, summary, keywords, etc.) to be returned.

Besides the initializer, we also have a printing method. There isn't much to say about it: it just prints the information, separating multiple entries (such as two authors in the example) with semicolons, and adding some fancy markup. It will generate the following output:

URL: http://www.microsoft.com
author: Bill Gates; Steve Ballmer
summary: Windows are here to stay

After that, we need just a few keystrokes to bring to the browser window information about all documents found by the InsumaFocus search engine and described in result structure:

print 'Content-Type:text/plain'
print
print """<html>
<head>
<title>Search Page</title>
</head>
<body>"""


from xml.dom.minidom import parseString
from cDoc.py import *
documents = result.getElementsByTagName('document')
for entry in documents:
	doc = cDocument(entry)
	doc.print_doc()

print '</body></html>'

Parsing the results using another approach (SAX) is described in the next section.

back to contents


6. Outputting XML-results to HTML-page. SAX version.

Another way of XML parsing is SAX (it stands for "Simple API for XML"). SAX-parsers do not create a document model in the computer's memory; instead, they just read XML and react when something happens: an element starts or ends, etc. They aren't nice in case you want to manipulate with document a lot, but they are very fast: a parser can start working even before the search engine has finished searching (but, of course, after it has found something).

We will use xml.sax library. It is rather simple to use: we should only describe a handler class.

tut_handler.py:

error_class = ''
error_message = ''

from xml.sax import handler

# we will use a few global variables
# from the main script
from __main__ import ENC, start_after


class insuma_handler (handler.ContentHandler):
    def __init__(self):
        """These are the attributes of the insuma_handler class:"""
        
        self.total_hits = 0
        self._total_hits_exact = 0
        self.hits = 0
        self.text_buf = ''
        self.level = 0
        self.counter = start_after
        
       
    def set_attributes(self, attrs):
        """This method sets attribute values"""
        
        self.total_hits = int(attrs.get('total_hits', '0'))
        self.hits = int(attrs.get('hits', '0'))
        self.total_hits_exact = int(attrs.get('total_hits_exact', '0'))
        
        
    def startElement(self, name, attrs):
        global error_class
        if name == 'result':
            self.set_attributes(attrs)
        elif name == 'error':
            self.level = -1
            error_class = attrs['class']
        elif name == 'document':
            self.level = 1
            print str(self.counter) + '. <a href="' +\
	            attrs['url'].encode(ENC) + '">' +\
		    attrs['url'].encode(ENC) + '</a><br>'
            self.counter += 1
        elif self.level == 1:
            print '<strong>' + name + ':</strong> '
            self.level = 2
            self.text_buf = ''
 
    def endElement(self, name):
        if self.level == 2:
           print self.text_buf
        if self.level > 0:
           self.level -= 1
           print '<br>'
        elif self.level == -1:
            raise SAXException('Insuma error', None)

    def characters(self, content):
        global error_message
        if self.level == 2:
            self.text_buf += content.encode(ENC)
        elif self.level == -1:
            error_message += content

(In the source file, you can find a lot of comments.)

As you can see, startElement() and endElement() methods have name parameter, that contains the current element (i.e. tag) name. We use level attribute to understand where we are. Characters() method is invoked when the parser meets some character data; we just store the text to a buffer, and print this buffer when an element ends (converting it from Unicode to Latin-1). The reason we can't print the text directly from characters() method is characters event is triggered every time an entity is met, and because of this, keyword highlighting (which we'll discuss later) would be broken. When highlighting is on, the XML may contain something like this:

<summary>This is a text with a &lt;keyword&gt;</summary>

After parsing, our script will print the following html:

<strong>summary:</strong>
This is a text with a <strong>keyword</strong>

And in case we print directly from characters, we would get following:

<strong>summary:</strong>
This is a text with a 
<
strong
>
keyword
<
/strong
>

That is the only reason not to print the data immediately, but rather to store it in a buffer for a while.

To invoke and use this module we should put following lines in the main script:

import tut_handler

parser = make_parser()
handler = tut_handler.insuma_handler()
parser.setContentHandler(handler)

parser.parse(answer)
back to contents


7. Making it nice

Pages & navigation.

The first thing we haven't done yet is paging the results. Paging can be achieved easily, with a little help from the total_hits attribute of the result element. The following code should be self-explanitory:

# for DOM version:
e = result.getElementsByTagName('result')[0]
total_hits = int(e.attributes['total_hits'].nodeValue)
# result variable isn't the <result> element:
# it's the root element, the only child of which 
# is the <result> element.

# in SAX version we should add following lines:
# to insuma_handler.__init__() method:
#   self.total_hits = 0
# to insuma_handler.startElement() method:
#   self.total_hits = int(attrs.get('total_hits', '0'))
# to the main script, after the parsing is done:
#   total_hits = handler.total_hits
    
def print_pager():
    """ This function prints page navigation:
    <<Previous | 1 | 2 | 3 | 4 | Next>>"""
    
    # if there's only one page of results:
    if total_hits <= entries:
        return ''
    
    map = {'query': query.encode(ENC), 'show_max': str(entries)}
    link = 'client.py?' + urlencode(map)
    num_of_pages = total_hits/entries + 1
    if start_after > 1:
        print """<a href="%s&start_after=%i">&lt;&lt;Previous</a> |""" %\
	        (link, start_after - entries)
    for i in range(num_of_pages):
    # same as "for (i=0; i<num_of_pages, i++)"
        if start_after <> i*entries + 1:
            print """<a href="%s∧start_after=%i">%i</a> """
                                    % (link, i*entries + 1, i + 1)
        else:
            print str(i + 1) + ' '
        #if not last page, print "pipe" character:
        if i < num_of_pages - 1:
            print '| '
        
    if total_hits >= entries + start_after:
        print """| <a href="%s∧start_after=%i">Next&gt;&gt;</a>"""
                                % (link, entries + start_after)


print_pager()

This will output the navigation bar of the following kind:

<<Previous | 1 | 2 | 3 | 4 | Next>>

Rating the results.

You probably want to show relevance rates for the documents found. The scores may be retrieved using score field: you just add it to the show_attrs attribute of the query. The score is a float value between 0 and 1.

<query max_results="%i" start_from="%i" show_attrs="author title score">

In our script, we should add following somewhere after retrieving the value of ranking, but before compiling the query:

if ranking == 'on':
    ATTRS_SHOWN += ' score'

Then just handle the score as any other attribute of the result. For example, our script will output the line with its value:

URL: http://www.example.com/insuma/
author: Alex Babanin
title: Insuma example page
score: 0.12345

Highlighting.

At the moment, highlighting can't be turned on or off, it is always present in the summary field, and only there. So, if you include summary in show_attrs attribute, you'll get a summary with search terms highlighted. In our script we should add following:

if hilight == 'on':
    ATTRS_SHOWN += ' summary'

Error handling.

We have already done some error handling; checking whether the connection is established, etc. But there are other errors; those discovered by the search engine. They are reported in xml structure much alike the one in which search results come. The top-level element is also result; it contains a single error element that has a class attribute and a human-readable content, that is a plain-text description or debugging information of the error. Class may be either inernal or input. The first shows the engine encountered an internal error, and the second shows an error is triggered by a malformed query.

Handling of the errors is rather simple in case of a DOM-parser. We should only check whether the result element contains document or error elements, and then either print search results or an error message. In the case of a SAX-parser, error handling is a little bit trickier, though still easy.

We should add a few lines to the insuma_handler class we described above:

class insuma_handler (handler.ContentHandler):
    def __init__(self):
        (not changed)
        
    def startElement(self, name, attrs):
        global error_class
        if name == 'result':
            self.set_attributes(attrs)
        elif name == 'error':
            self.level = -1
            error_class = attrs['class']
        elif name == 'document':
            self.level = 1
            print str(self.counter) + '. <a href="' +\
	            attrs['url'].encode(ENC) + '">' +\
		    attrs['url'].encode(ENC) + '</a><br>'
            self.counter += 1
        elif self.level == 1:
            print '<strong>' + name + ':</strong> '
            self.level = 2
    
    def endElement(self, name):
        if self.level > 0:
           self.level -= 1
           print '<br>'
        elif self.level == -1:
            raise xml.sax.SAXException('Insuma error', None)

    def characters(self, content):
        global error_message
        if self.level == 2:
            print content.encode(ENC)
        elif self.level == -1:
            error_message += content

We introduced two global variables for storing error, the class and the message in order to make them available later when we print the error message. We also introduced an exception; it is raised after we have stored all information about the error. Insuma_handler.level attribute, as before, is used for showing what element we are in: 1 for document, 2 for its subelements; now we add -1 for the error element.

At the moment, we raise an exception in case the result element contains an error message instead of search results. The next step is to handle this exception:

# we should replace the line
# parser.parse(answer)
# with the following code:


error_message = ''
error_class = ''
# initialization of the globals

try:
    parser.parse(answer)
except:
    if error_class == '':
        # that is, the exception wasn't raised by our code;
        # there is a real error in the xml.
        print "Insuma encountered an error!"
        print answer.read()
    else:
        print '<strong>'
        if error_class == 'internal':
            print 'The query wasn\'t processed due to an internal error:'
        else:
            print 'Insuma serach engine encountered an error in your query:'
        print '</strong><pre>'
        print error_message.encode(ENC)
        print '</pre>'
back to contents


8. Download

Here are the files from the tutorial, on a single list. The script client.py differs slightly from what we described in the tutorial, but everything works the same way. It is supposed to work with simple queries only (ones that come from search.html), and doesn't understand queries from search1.html or search2.html.