#native_company# #native_desc#
#native_cta#

Building a PHP RSS Aggregator

By Voja Janjic
on February 6, 2013

RSS stands for Really Simple Syndication. It is a Web format that allows website owners to distribute their latest and frequently updated content in a standardized way. RSS feed is actually an XML document that can be easily read by using RSS reader software or built-in functions in programming languages, such as PHP or Java. In this article, the focus will be on building a RSS aggregator in PHP.

RSS File Structure

First, let’s introduce the structure of an RSS feed. As I mentioned earlier, the feed is actually a standardized XML file. Although there are several different versions of RSS standard, the document would typically have the following structure:

<?xml version="1.0" encoding="UTF-8" ?>
<rss version="2.0">
<channel>
        <title>RSS feed title</title>
        <description>RSS feed descriptions</description>
        <link>http://www.yourdomain.com</link>
       
        <item>
                <title>Article title</title>
                <description>Article description.</description>
                <link>http://www.yourdomain.com/article.html/</link>
                <guid>unique string per item – usually the url</guid>
                <pubDate>Mon, 28 Sep 2012 15:30:00 +0000 </pubDate>
                <content:encoded> <![CDATA[Article <b>description</b>]]> </content:encoded>
        </item>
 
</channel>
</rss>

At the beginning of the document, the encoding and RSS version are defined. After that, there is some data describing the feed itself, such as the title and the description of the RSS feed. Further below, feed items are defined. The number of items is not limited. Items usually have the following data: title, description, link, guid and pubDate. Title and description represent the title and description of the content, link is the full URL to that content, guid is a unique identification of the particular article (usually the URL) and pubDate is the time when content is published. As well as the other dates and times in RSS, pubDate conforms to the Date and Time Specification of RFC 822.

There are many other tags that can be used in RSS feeds, but they are optional and rarely used. However, there is one more tag that has become often used nowadays: content:encoded. The purpose of this tag is to overcome limitations of the description tag, which allows only plain text. Unlike the description tag, content:encoded can contain HTML tags, which is shown in the example above.

RSS Aggregator in PHP

The main task of an RSS aggregator is to combine data from multiple feeds and display them as one list. Although this can be done on the fly, it would be quite slow, so we would rather use the database table to store and fetch data:

CREATE  TABLE IF NOT EXISTS `feed` (
  `id` INT(10) NOT NULL ,
  `title` VARCHAR(255) NOT NULL ,
  `url` VARCHAR(255) NOT NULL ,
  `last_access` INT(10) NOT NULL ,
  `frequency` INT(5) NOT NULL ,
  PRIMARY KEY (`id`) ,
  UNIQUE INDEX `url_UNIQUE` (`url` ASC) )
ENGINE = MyISAM
 
CREATE  TABLE IF NOT EXISTS `article` (
  `id` INT(10) NOT NULL ,
  `title` VARCHAR(255) NOT NULL ,
  `content` TEXT NOT NULL ,
  `url` VARCHAR(255) NOT NULL ,
  `pub_date` INT(10) NOT NULL ,
  `insert_date` INT(10) NOT NULL ,
  PRIMARY KEY (`id`) ,
  UNIQUE INDEX `url_UNIQUE` (`url` ASC) )
ENGINE = MyISAM;

The first table is used for maintaining the list of RSS feeds that will be fetched, while the second one contains the fetched articles. Now, let’s write a script that will fetch the content of all RSS feeds from the list and insert it into the database. This PHP script should be set as a cron job and should contain the following code:

<?php
                                $q = mysql_query("SELECT * FROM feed");
                                $i = 0;
                                while($row=mysql_fetch_array($q))
                                {
                                                $feeds[$i] = $row;
                                                $i++;
                                }
                                
                                $now = time();
                                foreach($feeds as $key=>$feed)
                                {
                                                //check if RSS feed should be fetched
                                                if($feed['last_access']+ ($feed['frequency'] *60)> $now)
                                                {
                                                                $content = file_get_contents($row['url']); // this could be done with cURL
                                                                if(!$content)
                                                                {
                                                                                continue;
                                                                }
                                                                $xml = new SimpleXmlElement($content); 
                                                                
                                                                //loop through all RSS items
                                                                foreach($xml->channel->item as $entry)
                                                                {
                                                                                //store the item into the database
                                                                                $item['title'] = $entry->title;
                                                                                $item['content'] = $entry->description;
                                                                                $item['guid'] = $entry->link;
                                                                                $item['pub_date'] =strtotime($entry->pubdate);
                                                                                $item['insert_date'] = time();
                                                                                
                                                                                
                                                                                $insert_q = "INSERT IGNORE INTO article (title,content,url,pub_date,insert_date) VALUES (
                                                                                                                                '".$item['title']."',
                                                                                                                                '".$item['content']."',
                                                                                                                                '".$item['guid']."',
                                                                                                                                '".$item['pub_date']."',
                                                                                                                                '".$item['insert_date']."'
                                                                                                                                )";
                                                                                mysql_query($insert_q);
                                                                }
                                                }
                                                
                                                //change feed last access
                                                mysql_query("UPDATE feed SET last_access='".time()."' WHERE id='".$feed['id']."'");
 
?>

First, the list of RSS feeds is fetched from the database and stored into an array. I have used the basic PHP functions for connecting to the database, but feel free to replace them with your own database manipulation class. After that, we loop through the array and check which feeds should be fetched. A feed should be fetched if the number of seconds elapsed between the current time and the last access time is larger than the feed’s access frequency. Notice that RSS feed frequency is stored in minutes, so it has to be multiplied by 60. If this condition is true, the feed’s actual XML content is fetched by using one of the PHP functions for getting the source code of an URL. In the code above, I have used the file_get_contents function, which returns false on failure. That is why we need to check if there is an error in the process:

$content = file_get_contents($row['url']); 
if(!$content)
{

	continue;
}

This code would just skip the feed that cannot be accessed. If you want to change the error handling, just alter the code in the if clause. This function can be replaced by cURL:

$content = connect($row['url']);
if(!$content)
{

	continue;
}

Where the function “connect” is:

function connect($url)
{
        $ch = curl_init($url);
 
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
        
 
        $response = curl_exec($ch);
        curl_close($ch);
        return $response;         
}

Upon retrieving the XML source of the RSS feed, it needs to be parsed. There is a built-in class in PHP that will be used to parse XML: SimpleXmlElement. We create an instance of the mentioned class, loop through all items in the feed and store them into the database. In the end, last access time is updated.

This is the code for a background PHP script. Use the MySQL queries to display the data in any way you want.

Creating an Advanced RSS Aggregator

The PHP code above shows you how to build a basic RSS aggregator. However, there are a few more features to consider if you are building an aggregator, such as: limiting the number of characters in the description, stripping HTML from the description, storing the content:encoded tag, implementing stop words etc.