|
 
Slapping together a search engine for your database is easy with PHP and
MySQL.
Clay Johnson
So you've got a dynamic site, filled with all sorts of user inputs, whether
it be a 'phorum', or like
my own site at knowpost.com. The site htdig.org will take care of indexing and searching your html pages, but if you are like me, you have very few html pages, and must of your
"content" resides in BLOBs in your database. You can't do anything useful using a like %searchword% query, it just isn't coming back relevant.
There has to be a better way, and indeed there is, with a few easy steps.
Here's how to slap one together:
Part one: BNR--Blob Noise Reduction
The first problem with your content is that it is filled with clunky
"noisewords," like "a,the,where,look"
Things that are there to help us humans to communicate, but really don't
have anything to do with relevance.
We gotta get rid of those. I've included a big list of noisewords
(noisewords.txt) for you to use, modify
or mutilate. Essentially, what we're trying to do here is get all those
noisewords out of your data, and build
a table with two columns, the word, and its indicator (the content
associated with it). We want something that
will eventually look like this:
+------+------------+ | qid | word | +------+------------+ | 6 | links | | 5 | Fire | | 5 | topics | | 5 | related | | 5 | Shakespeare| | 4 | people | | 4 | Knowpost | | 3 | cuba | | 3 | cigar | +------+------------+
Lets create our table now--
CREATE TABLE search_table( word VARCHAR(50), qid INT)
Next, since you want to make all your data compatible, not just new data, we
need to grab your sticky
blobs, and their identifiers out of your database:
<?php $query = "SELECT blob,identifier FROM your_table"; $result = mysql_query($query); $number = mysql_numrows($result); $j = 0; WHILE ($j < $number) {
/* Your "blob" */ $body = mysql_result($result,$j,"blob");
/* Your "identifier */ $qid = mysql_result($result,$j,"qid");
/* Open the noise words into an array */
$noise_words = file("noisewords.txt"); $filtered = $body;
/* Got to put a space before the first word in the body, so that we can recognize the word later */ $filtered = ereg_replace("^"," ",$filtered);
/* Now we suck out all the noisewords, and transform whats left into an array */
/* Brought to you by poor ereg coding! */ for ($i=0; $i < count($noise_words); $i++) { $filterword = trim($noise_words[$i]); $filtered = eregi_replace(" $filterword "," ",$filtered); }
$filtered = trim($filtered); $filtered = addslashes($filtered); $querywords = ereg_replace(",","",$filtered); $querywords = ereg_replace(" ",",",$querywords); $querywords = ereg_replace("\?","",$querywords); $querywords = ereg_replace("\(","",$querywords); $querywords = ereg_replace("\)","",$querywords); $querywords = ereg_replace("\.","",$querywords); $querywords = ereg_replace(",","','",$querywords); $querywords = ereg_replace("^","'",$querywords); $querywords = ereg_replace("$","'",$querywords);
/* We should now have something that looks like 'Word1','Word2','Word3' so lets turn it into an array */ $eachword = explode(",", $querywords);
/* and finally lets go through the array, and place each word into the database, along with its identifier */
for ($k=0; $k < count($eachword); $k++) { $inputword = "INSERT INTO search_table VALUES($eachword[$k],$qid)"; mysql_query($inputword); }
/* Get the next set of data */ $j++; }
?>
That script just handles your old data. You'll want to include a similar
function to strip the noisewords
out for every time new information comes into your database, through user
input, your input, etc...
so that your search engine is updated on the fly.
Part 2: Searching the Table
Now you have an easy to-use table of keywords and their associations. How do
you query this table? Here's
what I do:
First I format each searchterms passed into the script as
'word1','word2','word3' and stick it in a string
called $querywords.
Then I throw them into this SQL query:
SELECT count(search_table.word) as score, search_table.qid,your_table.blob FROM search_table,your_table WHERE your_table.qid = search_table.qid AND search_table.word IN($querywords) GROUP BY search_table.qid ORDER BY score DESC";
Set that query to $search, and print out the results like so:
<?php
$getresults = mysql_query($search); $resultsnumber = mysql_numrows($getresults);
IF ($resultsnumber == 0) {
PRINT "Your search returned no results. Try other keyword(s).";
} ELSEIF ($resultsnumber > 0) {
PRINT "Your search returned $resultsnumber results<BR>Listing them in order of relevance<BR><BR>"; for($count = 0; $count < $resultsnumber; $count++) { $body = mysql_result($getresults,$count,"blob"); $qid = mysql_result($getresults,$count,"qid"); //tighten up the results $body2print = substr($body, 0, 100); $cnote = $count+1; PRINT "$cnote. <a href=yourcontent.php3?qid=$qid> <i>$body2print...</i></a><BR>"; } }
?>
Presto, you've got keyword searching for your database, complete with
relevancy ranking. It may not be Google
or altavista.
It may not support all those fancy boolean operators, or excite's (*cough*)
conceptual mapping technology. But
it works, its quick
and enough to handle your user's demand.
Here's that list of noisewords:
noisewords.txt -------------- a about after ago all almost along also am an and answer any anybody anywhere are aren't around as ask at bad be been before being best better between big but by can can't come could couldn't day did didn't do does don't down each either else even ever every everybody everyone far find for found from get go going gone good got had has have haven't having her here hers him his home how href I if in into is isn't it its know large less like little looking look many me more most must my near never new news no none not nothing of off often old on once only or other our ours out over page please question rather recent she should sites small so some something sometime somewhere than true thank that the their theirs them then there these they this those though through thus time times to too under until untrue up upon use users version very via want was way web were what when where which who whom whose why wide will with within without world worse worst would www yes yet you your yours How
--clay
|