Insuma Tech FAQ
This FAQ is targeted to technical staff of the companies,
Insuma customers, and third party integrators of Insuma
software.
What is the name of Insuma's User-Agent?
The crawler of Insuma search engine introduces
itself as:
InsumaScout/1.15
where 1.15 is the current version number. You can forbid Insuma crawler to access
certain directories of your website using a robots.txt file, or within your web
publishing network. Web server sets an environment variable called HTTP_USER_AGENT
to it.
How can I regulate scores?
Score returns as a value between 0 and 1. The values between are less predictable.
The behavior of scores can be different from installation to installation depending
on word counts, amount of documents etc. You can fine tune it by changing score_beta
and score_theta parameters in the config file. The latter one is -5 and is correct
for the most of cases. Regulate score_beta by the order of 10 and see how scores
change in the result set. Here is the answer
I changed parameters in the config file, but there seems to be no effect?
If you have installation based on mod_python, you
should restart http daemon so the configuration is re-read.
My website uses java for navigation and not all pages are indexed. What do
I do?
Java navigation can couse problem for the crawler. Turn off java in your browser
and you will see your website from the point of view of the crawler. If some of
the pages or areas are unreachable, then one of the following actions can be taken:
- include
ureachable areas into the start urls in the control center;
- link the unreachable areas in HTML manner, so you can get
there from the main page without java;
- create a shadow listing at your website which lists all the contents and
include it as the start page in the control center.
A document is on the website but can not be found, why?
The crawler starts indexing your website from the start URLs you have entered
in the control center. Most often, it is the main page of your website, like http://www.mysite.co.uk/.
Make sure that the missing document can be reached from there (i.e., you can click
through until a document appears). If it is not the case, you have a so called
"crawler pocket" at your website.
A "crawler pocket" is a page, or group of pages which can't be reached through
HTTP 1.0 links. This can happen if the documents are only reachable through
some pulldown choice form, database search, or java applet.
Please add a representative URL from the "crawler pocket" in the Control Center
under start URLs so the crawler can index these documents. If documents are
not linked to each other, you may enter the entire list of the missing documents.
Some pages appear in the result set several times. What do I do?
Pages can appear several times if they can be reached with
different values in the CGI parameters, for example:
http://www.mysite.co.uk/mypage.asp?counter=234
http://www.mysite.co.uk/mypage.asp?counter=235
...
Every diferent URL (including all CGI parameters) is considered a separate page
by the crawler. Getting back to the same page with a different (value of) CGI
parameter causes so called "crawler loops" which can be infinite (until the quota
is reached). One of the following actions can be taken to prevent the above:
- Include the variable parameter into "Modify URL" section
in the control center. Eliminating the parameter will make URLs
identical, so it will not be visited twice or more.
- Disallow this area of your website in the section "Disallow
URLs" of the control center.
- Your website uses some framework (CMS) to show the contents. A customized
interface for the crawler may be necessary. Contact the manufacturer of your
framework.
If you ordered the deduplication module, you can activate it in the configuration
file crawler.conf:
deduplicate_size = 0.03
deduplicate_diff = 0.02
The deduplication algorithm will check the difference between the pages, and exclude
subsequent ones if it has the same or similar contents as the previous. The above
is the tupical setting. First step, localisation of candidates for duplicates
is done based on document size, regulated by the parameter deduplicate_size. The
second (actual) check of the contents difference is regulated by the parameter
deduplicate_diff.
Set both parameters to 0 to only deduplicate absolutely identical
documents. Set it to -1 to switch deduplication off.
The crawler should be re-started for the changes to take effect.
Re-indexing is not incremental, all content gets indexed all over. What do
I do?
If you use a CGI script to show the contents, make sure that your script posts
Last-Modified: in the HTTP header. You can check this by pointing your browser
to one of the pages and looking into the page information in the browser menu.
The date can be taken from the attributes of the file where contents reside, or
from the last modification date of the database where content of your website
may lay. If no Last-Modified: information is present, the crawler repeats the
crawling completely over all the pages.
A document or a page is not there anymore, why does it
appear in the search results?
You have to wait until the next scheduled re-indexing. Check
or setup in the control center accordingly. If the document(s) remain there after
re-indexing the following can be done:
What are the weights in the control center about?
The weights influence the sequence of the documents in the
result set.
Normally the "factory settings" of the weights
can be left unchanged.
Say, you wish to boost results matched in the original document so they come
before the results matched by morphology. You may set the weights as follows:
body: 100
body_morpho: 10
Or, you want to boost results from the field <title> to come
before matches in the <body> of the page:
title: 300
body: 100
The absolute numbers are not important, they are considered
in relation.
Normally, there is no need to regulate weight of an attribute of types integer
or string if a certain value is requested for all results. (e.g., vip=1). If
the query unites the conditions, one of which requires the value and the other
not, then weighting of the attribute makes sence. Set the weight for the attribute
vip 10 times larger than the one for the <body> to boost the results with
matching vip value. The changes of weights take effect immediately. You can
test them by searching in another window.
Why can't I find any words shorter than 4 characters?
By default, only the terms from 4 characters get indexed. This is an appropriate
setting for search in large amounts of texts with generic lexica. If your lexica
includes shorter keywords that should be found, please add the following settings
to the configuration file (normally my.cnf) of your MySQL installation:
# set the minimum word length for fulltext indexing
ft_min_word_len = 1
The change takes effect after both next reindexing and restart
of mysql daemon:
/etc/init.d/mysql restart
How can I do an advanced search, exclude terms etc?
XML syntax allows for all kinds of advanced boolean queries. Refer to
Integration Tutorial and XML Handbook for details and formal DTD. Below
are few quick examples.
Negation should be used with predicate bmatch:
<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE result SYSTEM "insuma_search_result.dtd">
<query max_results="10" start_from="1" show_attrs="title description">
<and>
<condition predicate='match' attr='body' value='John Lennon'/>
<condition predicate='bmatch' attr='body' value='John -Lennon'/>
</and>
</query>
It is important to still continue using predicate match for proper ranking, because
bmatch ranks in a integer manner (1 - keyword there, 0 - not there).
Predicate like is thought for search of words
by parts, %uff% finds "buffer", "stuff", "fluffy", %uff finds "stuff".
How can I boost a substring so the exact match is always on top (also called
"phrase search")?
You enter the phrase "Rowan Atkinson" and want to see results with this phrase
(or substring) on top, before the documents with a lot of terms "Atkinson" or
"Rowan" in them. Add a bmatch condition in your XML query, grouped by OR with
the rest. With bmatch you can match a substring using quotas. Below is an example,
which finds only documents matching the exact phrase (substring):
<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE result SYSTEM "insuma_search_result.dtd">
<query max_results="10" start_from="1" show_attrs="title description">
<and>
<condition predicate='match' attr='body' value='Rowan Atkinson'/>
<condition predicate='bmatch' attr='body' value='"Rowan Atkinson"'/>
</and>
</query>
If both keywords should be present, situated anywhere in the document,
please use bmatch once again:
<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE result SYSTEM "insuma_search_result.dtd">
<query max_results="10" start_from="1" show_attrs="title description">
<and>
<condition predicate='match' attr='body' value='Rowan Atkinson'/>
<condition predicate='bmatch' attr='body' value='+Rowan
+Atkinson'/>
</and>
</query>
How do I setup search in categories?
In your search form you want to enable search to
certain area of the website ("News", "Archive", "Members" etc.).
This can be done by using the index attribute
category.
When placing queries to the XML interface you can
limit search to certain categories (values of the
attribute category). Your search form may include
select or checkboxes allowing for search, limited
to certain category.
Standard category module allows 2 ways of defining
the categories:
- URL derived categories, in the control center
enter URL matching substrings:
/path/category1/ My_category_1
http://www.mysite.com/path/category2/ Tourism
In this case the crawler will set attribute category
of the document:
http://www.mysite.com/path/category2/museums.html
to the value Tourism, which can be later used in the
query.
- Categories passed over the meta tag, enter in the
control center:
<META NAME=MODE CONTENT="category1"> My_category_name_1
<META NAME=MODE CONTENT="category2"> (Default category name: category2)
Multiple values of the attribute category are allowed.
The contents should be reindexed for changes to take effect.
Does Insuma crawler respect robots meta-tag?
Yes. The formal syntax is:
CONTENT="ALL | NONE | NOINDEX | INDEX| NOFOLLOW | FOLLOW"
default = empty = "ALL"
"NONE" = "NOINDEX, NOFOLLOW"
The CONTENT field is a comma separated list:
- INDEX: search engine robots should include this page.
- FOLLOW: robots should follow links from this page to other pages.
- NOINDEX: links can be explored, although the page is not indexed.
- NOFOLLOW: the page can be indexed, but
no links are explored.
- NONE: robots can ignore the page.
Examples:
<META NAME="ROBOTS" CONTENT="ALL">
<META NAME="ROBOTS" CONTENT="INDEX,NOFOLLOW">
<META NAME="ROBOTS" CONTENT="NOINDEX,FOLLOW">
<META NAME="ROBOTS" CONTENT="NONE">
What are the hardware requirements for the software?
The minimal configuration is:
- 1 GHz CPU Intel Celeron or Athlon
- 40 GB HD
- 1 GB RAM
Consider better configuration which will improve
speed of search and indexing:
- More RAM will speed up search
- Hardware RAID will speed up search
- More powerful CPU (or a second CPU)
will speed up indexing, clustering, etc.
- Attach several SCSI harddrives if you have
multiple collections. This will speed up parallel
access to them.
Where you can save $$$:
Less important are upgrade from Celeron to "real" Intel
(only critical for multimedia and 3D) or huge HD
(rule of the thumb is: the index takes approximately
the place of the plain text beeing indexed).
In case if in-house installation what happens on my server
during the indexing process?
When the re-indexing is scheduled (manually from the Control
Center or automatically on a regular basis) the following
actions take place:
- The crawler starts, establishes HTTP connection to the
target webpages, and downloads them to the index database
(normally MySQL). The frequency of downloads and the
bandwidth used can be regulated through the config file.
The crawling process takes minutes for smaller websites
and can run into hours for larger portals.
- The crawled pages are parsed, attributes extracted
and added to the index. The CPU load can run high though
the software takes unused CPU only, giving priority to
other interactive processes (e.g. HTTP-server or serving
the search queries).
- The hard drive space is required for the index (permanent
use), also some hard drive space is required for temporary
saves during the re-indexing. An example of temporary
space usage: crawler downloads a PDF document and
saves it for a moment, before it is passed to the
PDF parser. The size of the used temporary cache can be
regulated. The rule of the thumb is: the index takes
approximately as much hard drive place, as the size of the
plain text (cleared from formatting) beeing indexed.
- At the end of the re-indexing, the index is optimized
for speed. Only the resources (cache, memory, ...) of the
backend database (normally MySQL) are used for it, in
accordance to its settings.
- Done. Indexer frees all memory, CPU, and hard drive
resources. The search engine is now serving search queries
only.
How can I boost selected results to the top of the hit list?
There is a boost clause available for this in XML interface.
See example below:
<query max_results="10" start_from="1" show_attrs="title relevance">
<condition attr="body" predicate="match" value="Synthese"/>
<condition attr="relevance" predicate="ge" value="0"/>
<boost predicate="eq" attr="relevance" value="2" coef="0.4"/>
<boost predicate="eq" attr="relevance" value="1" coef="0.2"/>
<query>
It will boost documents, which comply to the condition.
In the example above, all documents with relevance greater or
equal zero will be presented in the result list. Among those,
the documents with relevance values 2 and 1 will be boosted
by coefficients 0.4 and 0.2.
Boosting with a coef 0.4 means, that scores of boosted
documents will be increased by 0.4 of score of the top
document (document with max score before boosting).
|