Build a PHP Link Scraper with cURL

By PHP Builder Staff

on January 14, 2010

We’re going to do WHAT?

You heard me! We’re going to build a robot that scrapes links from web pages and dumps them in a database. Then it reads those links from the database and follows them,
scraping up the links on those pages, and so on ad infinitum (or until your server times out or your database fills up, whichever comes first).

I actually built this a few years ago because I had grandiose visions of becoming the next Google. Clearly, that did not happen, mostly because my localhost, database, and
bandwidth are not infinite. Yet this little robot has quite interesting applications and uses if you really have the time to play with and fine-tune it. I did not really explore those
options but I encourage you to do so. To begin, let’s have a look at the groundwork.

The cURL Component

cURL (or “client for URLS”) is a command-line tool for getting or sending files using URL syntax. It was first used in 2007 by Daniel Stenberg as a way to transfer files via protocols such as HTTP, FTP, Gopher, and many others, via a command-line interface. Since then, many more contributors has participated in further developing cURL, and the tool is used widely today.

As an example, the following command is a basic way to retrieve a page from example.com with cURL:


curl www.example.com

Using cURL with PHP

PHP is one of the languages that provide full support for cURL. (Find a listing of all the PHP functions you can use for cURL here.) Luckily, PHP also enables you to use cURL without invoking the command line, making it much easier to use cURL while the server is executing. The example below demonstrates how to retrieve a page called example.com using cURL and PHP.


<?php

$ch = curl_init("http://www.example.com/");
$fp = fopen("example_homepage.txt", "w");

curl_setopt($ch, cURLOPT_FILE, $fp);
curl_setopt($ch, cURLOPT_HEADER, 0);

curl_exec($ch);
curl_close($ch);
fclose($fp);
?>

The Link Scraper

For the link scraper, you will use cURL to get the content of the page you are looking for, and then you will use some DOM to grab the links and insert them into your database. I assume you can build the database from the information below; it is really simple stuff.


$query = mysql_query("select URL from links where visited != 1);

if($query)
{
	while($query = mysql_fetch_array($result))
	{

$target_url = $query['url'];
$userAgent = 'ScraperBot';

Next, grab the URL from the database table inside a simple while loop.


$ch = curl_init();
curl_setopt($ch, cURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, cURLOPT_URL,$target_url);

After instantiating cURL, you use curl_setopt() to set the USER AGENT in the HTTP_REQUEST, and then tell cURL which page you are hoping to retrieve.


curl_setopt($qw, cURLOPT_FAILONERROR, true);
curl_setopt($qw, cURLOPT_FOLLOWLOCATION, true);
curl_setopt($qw, cURLOPT_AUTOREFERER, true);
curl_setopt($qw, cURLOPT_RETURNTRANSFER,true);
curl_setopt($qw, cURLOPT_TIMEOUT, 20);

You’ve set a few more HEADERS with curl_setopt(). This time, you made sure that when an error occurs the script will return a failed result, and you set the timeout of each page followed to 20 seconds. Usually, a standard server will time out at 30 seconds, but if you run this from your localhost you should be able to set up a no-timeout server.


$html= curl_exec($qw);
if (!$html)
{
	echo "ERROR NUMBER: ".curl_errno($ch);
	echo "ERROR: ".curl_error($ch);
	exit;
}

Grab the actual page by sending the HEADERS along while executing the cURL request using curl_exec(). If an error occurs, it will be reported to PHP by the number and description inside curl_errno() and curl_error, respectively. Obviously, if such an error exists, you exit the script.


$dom = new DOMDocument();
@$dom->loadHTML($html);

Next, you create a document model of your HTML (that you grabbed from the remote server) and set it up as a DOM object.


$xpath = new DOMXPath($dom);
$href = $xpath->evaluate("/html/body//a");

Use XPATH to grab all the links on the page.


for ($i = 0; $i length; $i++) {
	$data = $href->item($i);
	$url = $data->getAttribute('href');
	$query = "INSERT INTO links (url, gathered_from) VALUES ('$url', '$gathered_from')";
	mysql_query($query) or die('Error, insert query failed');
	echo "Successful Link Harvest: ".$url;
	}
}

Dump all the links into the database, as well as the URL they are gathered from, just so you never go back there again. A more intelligent system might have a separate table for URLs already visited, as well as a normalized relationship between the two. Going a step further than just grabbing the links enables you to harvest images or entire HTML documents as well. This is kind of where you start when building a search engine. Although I now know of better ways, they aren’t half as much fun.

Creating your own search engine may seem naively ambitious, but I hope this little bit of code did inspire you a bit. If so, I implore you to harvest information and content from other places responsibly.

Until next time,

Marc Steven Plotz

Download: PHP_LinkScraper_source.zip

The cURL Component

Using cURL with PHP

The Link Scraper

Related Results via Envato Market

Related Content