Over at Developer.com I recently penned an article titled Implement Data Indexing and Search with Lucene and Solr, which introduced readers to the powerful Apache Lucene text search engine library. In my opinion, one of the most important takeaways of that article was the understanding that Lucene makes document-based text search possible but it is not itself a search application. An end user cannot simply plug into it and begin sifting through a pile of electronic documents!
Rather, you take advantage of Lucene either by writing custom Java code capable of indexing the desired documents and providing a search interface (as I demonstrated in the aforementioned article using Lucene’s bundled demo), or by using one of the several Lucene implementations written in languages such as Perl, Python, or Ruby. You also have the option of looking towards one of the search platform implementations built atop Lucene, such as Solr (also introduced in the Developer.com article).
In this article I’ll show you how to undertake the former approach using PHP’s most prominent Lucene implementation, which also happens to be part of the Zend Framework: the Zend_Search_Lucene component.
Introducing Zend_Search_Lucene
The Zend_Search_Lucene component is a PHP 5-based Lucene implementation capable of indexing and searching several document types, among them HTML, Excel 2007, PowerPoint 2007, Word 2007, and XML. Additionally, you can use this component to supplant MySQL’s useful but limited full-text search feature. Like all Zend Framework components, you can tightly integrate Zend_Search_Lucene into your Zend Framework applications, or use it separately within any PHP application. I’ve documented the latter approach in the PHPBuilder.com article, Running PHP and Zend Framework Scripts from the Command Line. For the purposes of this demonstration I’ll show you how to integrate the component into a Zend Framework application.
PHP + Lucene: Indexing a Database
Suppose you created an online service for job seekers, allowing them to generate an appealing downloadable resume simply by entering their contact information, education and employment history into a Web form. In addition to resume generation, you enter the job seeker’s information into a database, which employers can search in return for a small monthly fee.
You’d like to tout employers’ ability to perform power searches that allow them to comb over every conceivable characteristic of job seekers’ resumes, including being able to retrieve only resumes containing a specific term or phrase, and those that specifically do not contain a particular term or phrase. Sounds like a job made for Lucene, thanks to its powerful query parser syntax (see the Zend_Search_Lucene documentation for a list of minor differences from the original Lucene implementation)!
In order to make a newly uploaded resume immediately available to prospective employers, we’ll index each job seeker’s information at the time it’s added to the database. To do so, we’ll index the searchable data and an associated identifier that links that data to its database record.
... insert resume data into database // Retrieve the last insert ID $id = $db->lastInsertId(); $index = Zend_Search_Lucene::open('/var/www/dev.example.com/lucene-index'); $doc = new Zend_Search_Lucene_Document(); $doc->addField(Zend_Search_Lucene_Field::UnIndexed('dbid', $id)); $doc->addField(Zend_Search_Lucene_Field::Text('name', $form->getValue('name'))); $doc->addField(Zend_Search_Lucene_Field::UnStored('education', $form->getValue('education'))); $doc->addField(Zend_Search_Lucene_Field::UnStored('experience', $form->getValue('experience'))); $index->addDocument($doc);
This snippet begins by opening the Lucene index using the static
open()
method. Unfortunately, the Zend_Search_Lucene component requires you to use a separate static method named create()
in order to create the method. Therefore, you’ll want to run the create()
method separately before opening the index.Additionally, this snippet adds four fields to the index:
dbid
— represents the primary key associated with the record just added to the databasename
— contains the job seeker’s nameeducation
— contains the job seeker’s provided experienceexperience
— contains the job seeker’s supplied experience
Each of these fields are identified by a specific field type. The
Unindexed
type identifies a non-searchable field that is returned with search results. The Text
type can both be searched and is returned with the search results. The UnStored
type identifies data that should be tokenized and indexed, but not stored in its entirety within the index. This is useful when you’re using Zend_Search_Lucene in conjunction with a database.Still other field types exist; be sure to consult the documentation and Zend_Search_Lucene source code for more details.