#native_company# #native_desc#
#native_cta#

PHP Simple HTML DOM Parser: Editing HTML Elements in PHP

By Vojislav Janjic
on September 7, 2011

Simple HTML DOM parser is a PHP 5+ class which helps you manipulate HTML elements. The class is not limited to valid HTML; it can also work with HTML code that did not pass W3C validation. Document objects can be found using selectors, similar to those in jQuery. You can find elements by ids, classes, tags, and much more. DOM elements can also be added, deleted or altered. Although this is a very powerful and easy-to-use DOM editing PHP script, you have to be careful about memory leaks. I will explain how to avoid leaks later in this tutorial.

Getting Started with PHP Simple HTML DOM Parser

After uploading the class file, the Simple HTML DOM class instance has to be created. The DOM class object can be created in three ways:

  1. Load HTML from an URL
  2. Load HTML from a string
  3. Load HTML from file

<?php
// Create a DOM object; You must do this for each way
$html = new simple_html_dom();

// Load HTML from an URL 
$html->load_file('http://www.mydomain.com/');

// Load HTML from a string
$html->load('<html><body>Hello world!</body></html>');

// Load HTML from a HTML file 
$html->load_file('path/to/file/test.html');
?>

If you want to load HTML from a string, but you need more control over the HTTP request, I suggest using cURL to fetch HTML to a string, and then load the DOM class object from a string.

Find HTML Elements with PHP Simple HTML DOM Parser

HTML DOM elements are found using the find function. It returns an object or an array of objects. These objects are similar to the first object, so you can use all class functions on them. Let’s see some examples:

<?php
// Find all elements by tag name e.g. a. Note that it returns an array with object elements.
$a = $html->find('a');

// Find (N)th element, where the first element is 0. It returns object or null if not found.
$a = $html->find('a', 0);

// Find the element where the id is equal to a certain value e.g. div with id="main"
$main = $html->find('div[id=main]',0); 

// Find all elements which have the id attribute. E.g. find all divs with attribute id.
$divs = $html->find('div[id]');

// Find all elements that have attribute id
$divs = $html->find('[id]');
?>

DOM elements can be also found using selectors, again, similar to those in jQuery:

<?php
// Find all elements where id=container. Have in mind that two elements with the same ids is not valid HTML.
$ret = $html->find('#container');

// Find all elements where class=foo
$ret = $html->find('.foo');

// You can also find two or more elements by tag name
$ret = $html->find('a, img'); 

// Find two or more elements by tag name where certain attribute value exists e.g. find all anchors and images with the attribute title.
$ret = $html->find('a[title], img[title]');
?>

Descendant selectors are also allowed:

<?php
// Find all <li> in <ul> 
$ret = $html->find('ul li');

//find all <li> with class="selected" in <ul>
$ret = $html->find('ul li.selected');
?>

Parent, child and sibling elements can be selected by using built-in functions:

<?php
// returns the parent of a DOM element
$e->parent;

// returns element children in an array
$e->children;

// returns a specified child, by number, starting from zero. If child is not found, null is returned.
$e->children(0);

// returns first child of an element, or null if not found.
$e->first_child ();

// returns last child of an element
$e->last _child ();

// returns previous sibling of an element
$e->prev_sibling ();

//returns next sibling of an element
$e->next_sibling ();
?>

Use Attribute Operators

Attribute selectors can also be used with simple regular expressions:

  • [attribute] – Select HTML DOM elements which have a certain attribute
  • [attribute=value] – Select all elements which have the specified attribute with a certain value
  • [attribute!=value]– Select all elements which don’t have the specified attribute with a certain value
  • [attribute^=value] – Select all elements with the specified attribute whose value begins with the specified value
  • [attribute$=value] – Select all elements with the specified attribute whose value ends with the specified value
  • [attribute*=value] – Select all elements with the specified attribute whose value contains the specified value

Access DOM Element Attributes with PHP Simple HTML DOM Parser

Attributes are actually object variables:

<?php
// In this example we select href attribute of an anchor. If the attribute is a non-value attribute (e.g. checked, selected…), it will return true or false.
$link = $a->href;
?>

Or:

<?php
$link = $html->find('a',0)->href;
?>

Each object has four special attributes:

  1. tag – returns the tag name
  2. innertext – returns inner HTML of an element
  3. outertext – returns outer HTML of an element
  4. plaintext – returns plain text (without HTML tags)

Edit HTML Elements with PHP Simple HTML DOM Parser

Editing attributes is similar to reading their value:

<?php
// Change or set attribute value (if an attribute is a non-value attribute (e.g. checked, selected…), set its value to true or false).
$a->href = 'http://www.mydomain.com';

// Remove an attribute.
$a->href = null;

// Check if attribute exists
if(isset($a->href)) {
 //do something here
}
?>

There are no special functions to append or remove elements, but there is a workaround:

<?php
// Wrap an element
$e->outertext = '<div class="wrap">' . $e->outertext . '<div>';

// Remove an element 
$e->outertext = '';

// Append an element
$e->outertext = $e->outertext . '<div>foo<div>';

// Insert an element
$e->outertext = '<div>foo<div>' . $e->outertext;
?>

To save the DOM document, just put the DOM object into a variable:

<?php
$doc = $html;

// Display the page
echo $doc;
?>

Avoid PHP Simple HTML DOM Parser Memory Leak

There is one thing you need to be careful about when using Simple HTML DOM parser: memory leak. Leaks can slow down your website, or even make it unusable for a few minutes. So, to prevent this, each object should be cleared before loading a new one. It is not a problem to work with 2 or 3 objects at a time, but if you load many objects without clearing the previous ones, it can be a problem. Objects are cleared in the following way:

<?php
$html->clear();
?>

Author’s Note

Simple HTML DOM parser is a very powerful script which you can use to access HTML DOM through PHP, but don’t abuse it. If you are fetching data from external sites, use the content according to the rules of fair use.