Using PHP to parse RSS
Save the above xml as a file called phpbuilder.rss, as we’re going to use it in the following examples. You can of course use any existing RSS feed out there, but for demonstration purposes it’ll be easier if your example is exactly the same as the one I’m using. Let’s look at the built-in PHP functions we’ll be using.
* xml_parser_create()
* xml_set_element_handler($xml_resource,$start_element_function,$end_element_function)
* xml_set_character_data_handler($xml_resource,$character_data_handler)
* xml_parse($xml_resource,$data)
The first, xml_parse_create() creates an XML parser and returns a resource handle, used by the other functions. Since an XML feed is made up of a number of elements, and these could differ from feed to feed, we need to apply logic as we traverse through the feed. Elements could contain multiple sub-elements and attributes. To assist with this, the xml_set_element_handler() defines functions to be called dependant on whether an element has opened or closed. It takes three arguments, the first being the XML resource handle returned by xml_parser_create, while the second is is the name of a function automatically called when a new element is reached during parsing, and the third the name of the function automatically called when the end of an element is reached during parsing. (The latter two arguments can also be arrays containing an object name and method reference, which we don’t look at here). Calling a function when we reach the start and end of an element is all very well, but we also need to perform some logic when we’re actually parsing the characters. xml_set_character_data_handler is the function that determines this. The first of its two arguments is of course the XML resource handle, and the second the name of the function called during parsing.
xml_parse() is the function to call to actually start parsing the feed, and takes the XML resource handle as the first argument, and a string containing the portion of the feed as the second. An optional third argument can assist the logic by indicating whether the string is the last piece of data in this parse, but we also don’t look at that this time. Let’s start with a simple skeleton to see how this all works:
<?php$rssFeeds = array (‘phpbuilder.rss’);
// for now we’ll just have the one file, but this can later be expanded
//Loop through the array (just one element for now) and read the feedforeach
($rssFeeds as $feed) { readFeeds($feed);}
// The function to be called when a start element is read. For now we’ll
// just echo some outputfunction
startElement($xp,$name,$attributes) {
echo “Start $name <br>”;}
function endElement($xp,$name) {
echo “End: $name<br>”;}
function readFeeds($feed) {
$fh = fopen($feed,’r’);
// open file for reading
$xp = xml_parser_create();
// Create an XML parser resource
xml_set_element_handler($xp, “startElement”, “endElement”);
// defines which functions to call when element started/ended
while ($data = fread($fh, 4096)) {
if (!xml_parse($xp,$data)) {
return ‘Error in the feed’; } }}?>
If you run this script, it’ll output the following:
Start RSSStart CHANNELStart PUBDATEEnd: PUBDATEStart DESCRIPTIONEnd:
DESCRIPTIONStart LINKEnd: LINKStart TITLEEnd: TITLEStart WEBMASTEREnd:
WEBMASTERStart LANGUAGEEnd: LANGUAGEStart ITEMStart TITLEEnd: TITLEStart
LINKEnd: LINKStart DESCRIPTIONEnd: DESCRIPTIONEnd: ITEMStart ITEMStart
TITLEEnd: TITLEStart LINKEnd: LINKStart DESCRIPTIONEnd: DESCRIPTIONEnd:
ITEMEnd: CHANNELEnd: RSS
Note how startElement() and endElement() are called. It’s important you understand this, which is why I’ve created this skeleton, and not particularly useful, piece of code first. This mechanism may be tricky to understand at first, but it quite fundamental to the way XML is parsed.
Now let’s add the character data handler function. This is the one that actually does something with the data between the open and close tags. Make the changes shown in bold below:
<?php
$rssFeeds = array (‘phpbuilder.rss’);
//Loop through the array, reading the feeds one by one
foreach ($rssFeeds as $feed) {
readFeeds($feed);
}
function startElement($xp,$name,$attributes) {
echo “Start $name
“;
}
function endElement($xp,$name) {
echo “End: $name
“;
}
function characterDataHandler($xp,$data) {
echo “Data: $data
“;
}
function readFeeds($feed) {
$fh = fopen($feed,’r’);
// open file for reading
$xp = xml_parser_create();
// Create an XML parser resource
xml_set_element_handler($xp, “startElement”, “endElement”);
// defines which functions to call when element started/ended
xml_set_character_data_handler($xp, “characterDataHandler”);
while ($data = fread($fh, 4096)) {
if (!xml_parse($xp,$data)) {
return ‘Error in the feed’;
}
}
}
?>
Start
RSSData:Data:Start
CHANNELData:Data:Start
PUBDATEData: Thu, 29 Sep 2006 15:16:13
GMTEnd: PUBDATEData:Data:Start
DESCRIPTIONData: Newest Articles and How-To’s on PHPBuilder.comEnd:
DESCRIPTIONData:Data:Start LINKData: https://phpbuilder.comEnd:
LINKData:Data:Start TITLEData: PHPBuilder.com New ArticlesEnd:
TITLEData:Data:Start WEBMASTERData: [email protected]:
WEBMASTERData:Data:Data:Start LANGUAGEData: en-usEnd: LANGUAGEData:Data:Start
ITEMData:Data:Start
TITLEData: In Case You Missed It…The Week of September 26, 2006End:
TITLEData:Data:Start
LINKData: https://phpbuilder.com/columns/weeklyroundup20060926.php3End:
LINKData:Data:Start
DESCRIPTIONData: In Case You Missed It…The Week of September 26, 2006End:
DESCRIPTIONData:Data:End: ITEMData:Data:Start ITEMData:Data:Start
TITLEData: In Case You Missed It…The Week of September 19, 2006End:
TITLEData:Data:Start
LINKData: https://phpbuilder.com/columns/weeklyroundup20060919.php3End:
LINKData:Data:Start
DESCRIPTIONData: In Case You Missed It…The Week of September 19, 2006End:
DESCRIPTIONData:Data:End: ITEMData:Data:End:
CHANNELData:End: RSS
Note that the empty Data: rows are from extra spaces in the phpbuilder.rss file. Now that we have a good idea how the mechanism works, let’s do some useful parsing. We’re going to keep things easy to follow, if not particularly elegant, and use global variables to keep track of what’s happening. Rewrite the three functions as follows:
function startElement($xp,$name,$attributes) {
global $item,$currentElement; $currentElement = $name;
//the other functions will always know which element we’re parsing
if ($currentElement == ‘ITEM’) {
//by default PHP converts everything to uppercase
$item = true;
// We’re only interested in the contents of the item element.
////This flag keeps track of where we are
}}
function endElement($xp,$name) {
global $item,$currentElement,$title,$description,$link;
if ($name == ‘ITEM’) {
// If we’re at the end of the item element, display
// the data, and reset the globals
echo “<b>Title:</b> $title<br>”;
echo “<b>Description:</b> $description<br>”;
echo “<b>Link:</b> $link<br><br>”;
$title = ”;
$description = ”;
$link = ”;
$item = false; }}
function characterDataHandler($xp,$data) {
global $item,$currentElement,$title,$description,$link;
if ($item) {
//Only add to the globals if we’re inside an item element.
switch($currentElement) {
case “TITLE”:
$title .= $data;
// We use .= because this function may be called multiple
// times for one element.
break;
case “DESCRIPTION”:
$description.=$data;
break;
case “LINK”:
$link.=$data;
break; } }}
Here’s what the above changes attempt to do. We need to know which particular element we’re working with at any one time. Inside startElement(), we create a global variable, $currentElement, which will be set every time startElement() is called. It will be assigned a string containing the name of the current element. By default PHP uses what it calls case folding, which means that it automatically makes everything uppercase. Then, in characterDataHandler(), we’ll check what this variable is set to (using the switch statement), and assign the data, or the contents of the tag, to an appropriately named variable (one of $title, $description or $link). These will also be global, as they will be used for display purposes in endElement(). For now we’ll only worry about these three compulsory elements – you can easily extend this to include other, optional, elements later. Here’s what the script incorporating the above changes outputs:
Title: In Case You Missed It…The Week of September 26, 2006
Description: This week Elizabeth brings us news of an upcoming free
webcast called Design Patterns in PHP, the schedule for the Fall Zend conference,
security alerts for Moveable Type and phpBB, the release of Zend Platform 2,
XAMPP for Linux, the latest PEAR/PECL releases and much more!
Link: https://phpbuilder.com/columns/weeklyroundup20060926.php3
Title: In Case You Missed It…The Week of September 19, 2006
Description: This week Elizabeth brings us news of the release of PEAR
1.4, Zend Studio 5 Beta, a security vulnerability with PHP-Nuke, the release
of a SimpleTest plugin for PHPEclipse, a patch for phpMyAdmin, the latest
PEAR/PECL releases and much, much more!
Link: https://phpbuilder.com/columns/weeklyroundup20060919.php3
Conclusion
It’s been more convoluted than you may perhaps have expected (compare this to reading and parsing a file!) but you should now be able to successfully make some use of a basic RSS feed. In part 2, next week, we look at combining multiple feeds, and making use of some of the other elements. Until then!