![]() Join Up! 96813 members and counting! |
|
|||
Dynamic Document Search Engine - Part 1
M.Murali Dharan
Introduction:
I started working with PHP six months ago. I used to read many articles in Internet that gave
me better understanding on PHP. I started developing software for “Online Journals” that has
the capability of searching document’s contents. You can find articles in devarticles.com that
can perform keyword title and author search. This article gives you a brief idea of
Document-Based Search.
What is Document Search?
In a Dynamic Document Search every word in the document is parsed (read) and matched with the search words.
Results are displayed based on the matches found.
Reading every word of the article matching it with the search word over thousands or even lakhs of documents is very
difficult task. Also by default, PHP is configured to run maximum 30 seconds.
Prerequisites:
To understand this article, you should have a fair knowledge of PHP. To run examples given in your machine,
you need APACHE, PHP,
and MYSQL software installed and configured. I used PHP Version 4.3.1 and MYSQL 2.2.3.
Building Database:
The database consists of three tables. viz. Content Table, Keyword Table, Link Table. Content table holds article’s
title, and abstract. Keyword table holds keyword. Keyword field is indexed. Link table holds keyword id, content id.
The SQL Statement for creating these three tables are shown below.
Content Table:
CREATE TABLE content ( contid mediumint(9) NOT NULL auto_increment, title text, abstract longtext, PRIMARY KEY (contid) ) TYPE=MyISAM; Keyword Table:
CREATE TABLE keytable ( keyid mediumint NOT NULL auto_increment, keyword varchar(100) default NULL, PRIMARY KEY (keyid), KEY keyword (keyword) ) TYPE=MyISAM; Link Table:
CREATE TABLE link ( keyid mediumint NOT NULL, contid mediumint NOT NULL) TYPE=MyISAM Preparing Database:
An input interface with HTML form is created to enter title and document. After filling and hitting enter,
the title and the abstract is stored in the content table. The generated new content id is stored in a variable
temporarily. In the next step and ‘Upload Engine’ that parses each word in the abstract and process the whole text.
It removes common words like is, was, and, if, so, else, then etc. Then stores each word in wordmap array. See that every
word has only one
entry in the wordmap array.
For every word in the wordmap array, keyword table is parsed and math is found. If there is a match, the generated key id, and
content id generated id earlier is stored in the link table. Else, the new keyword is inserted in the keyword table and with
the generated keyword table and content id the link table is updated. And thus we finished preparing our database.
The code snippet given below explains every step of the program.
Searching keyword table for every word is a long process. This also reduces the efficiency of the program. To implement this
all the keywords in the keyword table is stored in an associative array $allWords. An associative array is one, which works
on B-Tree algorithm and very useful to perform searches. Here is the function.
Common Words:
$COMMON_WORDS is an associative array that stores an array of words, which are commonly used
in English Language. These words have to be removed while parsing the file.
$COMMON_WORDS=array(“a”=>1, “as”=>1);
You can add as many common words as you like. See source code for full list of common words.
ExtractWords() Function:
This function filters words by allowing only alphabetic characters. To implement this, I used a technique called
STATE MACHINE that filters the characters.
Alphabetic characters are taken as
STATE1 and other characters (Numeric and Special Characters)
as STATE0. Initially the machine will be in the STATE0.
While parsing letters, it encounters alphabetic characters, the machine switches to STATE1 else
it will remain in the same state. As a result we get a word with only alphabetic characters.
As a result we get a list of words stored in an array returned to the called function.
FilterCommonAndDuplicateWords() Function:
This function is called after ExtractWords() function. This parses filtered words removes common words like ‘a’,’is’,
’was’,’and’…. Other words are taken as valid words, remove duplicate among them and then stored in an associative array
$wordMap and this array is returned to the called function.
Process Form function():
This is the core part of the upload program. After finishing filtering, removing common words and duplicate words,
this function is called. First this function inserts the title and abstract in the content table. The newly generated
content id stored in
$contentId. Then it updates keyword and link table.
For every word in the
$wordMap array, if the word is already exists in keyword table, it
inserts the key id, content id in to link table. Conversely, if the word is not found, it inserts the new word in keyword
table, the generated new key id is stored in $keyId. Then it updates link table by inserting
key id content id in link table.
The following code snippet is the starting place of execution, which calls all the above functions. Here it connects
to database server and database. Initially form() function is called that allows you to enter the title and abstract
of the document.
Search Engine:
PHP script is written that makes it possible to query the database through a HTML form. This will work as any other search
engine: the user enters a word in a textbox, hits enter, and the interface presents a result page with links to the pages
which contains the word that is searched for.
In this example, the results are displayed the order in which the pages are presented is selected by the number of search
words appeared in each document.
Declare an associative array $CommonWords that contains common words like ‘is’, ‘in’, ‘was’ etc.
First convert all the search words in to lower case.
$search_keywords=strtolower(trim($keywords));
Next, we have to perform an explode operation on search words that will store each search word in an array.
The code is shown here.
$arrWords = explode(" ", $search_keywords);
Next, remove duplicate words in
$arrWords.
$arrWords = array_unique($arrWords);
In a search operation, first we have to remove the common words like ‘is’, ‘in’, ‘was’ … This refines our search criteria.
To implement this we store common words in an associative array
$CommonWords.
Next, remove common words in the search words. Search words are stored in
$searchWords and
common words are stored in $junkWords. Here is the code.
We can display results in two ways.
Type 1: Display the document if all the search words present in the document. Type 2: Display the document if any one of the search words is present.
If you want to perform the Type 1 operation, include the following code snippet in to your program.
//count no of words in the search words and store in a variable
$noofSearchWords=count($searchWords);$noofSearchWords stores the number of search words. Later after searching search words in key
word table we get results. There we can perform logical AND operation that will display our desired results.
If $noofSearchWords is equal to number of records, the next part of the program gets
executed. Else “NO SEARCH RESULT FOUND” is displayed.
In the next step, we have to search for words in $searchWords array in the keyword table. The following code snippet
will return you a list of keyids that matched query.
As discussed earlier, if you need to perform Type 1 operation, you have check whether the number of search words and
number of records in query. If they are equal, you can proceed to the next step else display search result not found.
Here is the code.
The following code searches the link table for occurrences key ids. This will return an array that contains the
content ids.
Sort the array in descending order of the key value. This will order from highest occurrences to the lowest. For example,
if the number of search words is four, the order is displayed 4 then 3 then 2 and last 1.
//Sort array in descending order of the key value
arsort($contArray,SORT_DESC);
In the next step we have to fetch title, first 200 words in content table in to an array
$FoundRef.
Finally we have to display the results in the browser. Here is the code.
The HTML page to get input from user is given below.
Function
getmicrotime() returns time in microseconds. This function is called during start and
end of the search process.Conclusion:
In this part 1, the search engine searches for the occurrence of words in the document. Part 2 is slightly modified such that
when we upload the document, the number of occurrence of each word is stored in the link table. The search engine then ranks
with the number of occurrence of each word in the document. For example, if the word ‘paging’ occurred 11 times, ‘programs’
occurred 21 times then the rank for the document is 11 + 12 = 23.
Source Code:
|