Sr. Web Developer
mediabistro.com
US-NY-New York

Justtechjobs.com Post A Job | Post A Resume

DOM XML: An Alternative to Expat
Overview: An alternative to expat.
There are many xml tutorials for php on the web, but few show how to parse xml using DOM. I would like to take this opportunity to show there is an alternative to the widespread SAX implementation for php programmers.
DOM (Document Object Model) and SAX (Simple API for XML) have different philosophies on how to parse xml. The SAX engine is extremely event-driven. When it comes across a tag, it calls an appropriate function to handle it. This makes SAX very fast and efficient. However, it feels like you're trapped inside an eternal loop when writing code. You find yourself using many global variables and conditional statements.
On the other hand, the DOM method is somewhat memory intensive. It loads an entire xml document into memory as a hierarchy. The upside is that all of the data is available to the programmer organized much like a family tree. This approach is more intuitive, easier to use, and affords better readability.
In order to use the DOM functions, you must configure php by specifying the '--with-dom' argument. They are not a part of the standard configuration. Here is a sample compilation.
%> ./configure --with-dom --with-apache=../apache_1.3.12
%> make
%> make install
How DOM structures XML
Since DOM loads an entire xml string or file into memory as a tree, this allows us to manipulate the data as a whole. To show what xml looks like as a tree, take this xml document as an example.
<?xml version="1.0"?>

<book type="paperback">
	<title>Red Nails</title>
	<price>$12.99</price>
	<author>
		<name first="Robert" middle="E" last="Howard"/>
		<birthdate>9/21/1977</birthdate>
	</author>
</book>
The data would be structured like this.
DomNode book
	|
	|-->DomNode title
	|		|
	|		|-->DomNode text
	|
	|-->DomNode price
	|		|
	|		|-->DomNode text
	|
	|-->DomNode author
			|
			|-->DomNode name
			|
			|-->DomNode birthdate
					|
					|-->DomNode text
Any text enclosed within tags are really nodes in themselves. For instance, "Red Nails" is a child node of title, "$12.99" is a child node of price.
The Objects Used In DOM
At this point, you are probably wondering what is a DomNode. This is a good place to start talking about the objects that are included in the module. There are five objects defined by DOM: DomDocument, DomNode, DomAttribute, DomDtd, and DomNamespace. We are going to be focusing primarily on the DomDocument and DomNode objects because they are the most useful.
The Node object
Here is an overview of what the DomNode object contains.
class DomNode
	properties:
		name
		content
		type
	methods:
		lastchild() 
		children() 
		parent() 
		new_child( $name,$content ) 
		getattr( $name ) 
		setattr( $name,$value ) 
		attributes() 
The properties need some elaboration.
  • The name property is the actual tag name of the node. A node which refers to the the title tags would have the name of 'title'.
  • The content property is usually empty. However, text nodes use this property to hold text.
  • The type property is a constant which defines exactly what kind of object the node is. There can be several types of DomNode objects. A list of constants are online at http://www.php.net/manual/ref.domxml.php. For example, a DomNode containing text would have a type of XML_TEXT_NODE.
The methods need to be explained, as well.
  • lastchild() returns the last entry from a node's children.
  • parent() returns a node's parent. For instance, the parent of our title node would be 'book'.
  • children() returns an array of a node's child nodes. For example, the children of node author would be 'name' and 'birthdate'.
  • new_child() takes a name and some content as arguments and adds a new DomNode to its children.
  • getattr() and setattr() both deal with attributes. One fetches the value, the other sets it.
  • attributes() returns an array of DomAttribute objects.
The DomDocument object
The DomDocument object is also important.
class DomDocument
	properties:
		version 
		encoding 
		standalone
		type
	methods:
		root() 
		children() 
		add_root( $node ) 
		dtd() 
		dumpmem() 
The properties are pretty self explanatory.
  • 'version' refers to the xml version of the document.
  • 'encoding' refers to the text encoding.
  • 'standalone' is a boolean value determining whether the document is standalone or not.
  • The 'type' property has already been explained. A Document object will most likely have the type of XML_DOCUMENT_NODE.
The methods are pretty simple too.
  • root() returns the root node of a document. If we loaded our sample xml file as a DomDocument object, the root node would refer to 'book'.
  • children() works just as it did in DomNode.
  • add_root() adds a new root node to the xml document. You would use this if you wanted to supplant the 'book' node with another node.
  • dtd() returns the xml document's dtd.
  • dumpmem() returns a string representation of the xml data.
The DomDocument Object Returned By xmltree()
Xmltree(), a function which I haven't introduced yet, returns a type of DomDocument object which may give you trouble. This object has no methods, just properties in place of methods. It has a true tree structure to it.
class DomDocument
	properties:
		version
		encoding
		standalone
		name
		content
		type
		attributes
		children
It is just as easy to use. For instance, instead of using a method to get a node's children, just access its 'children' property. 'children' and 'attributes' are both arrays.
The Other Objects
I will list the other objects and their properties and methods just for reference. We won't be dealing with them in this article.
class Attribute
	properties:
		name
		content
	methods:
		name()

class Dtd
	properties:
		extid
		sysid
		name

class Namespace
Using the Objects
The DOM module only has three functions, xmldoc(), xmldocfile(), and xmltree(). The rest of the time, we will be dealing with the objects. All functions return DomDocument objects. Here are examples of how you load xml data into your php script:

<?php

# to load xml from a string
# use either of these
$doc = xmldoc( $xmlstr );
$tree = xmltree( $xmlstr );

# to load xml from a file
$doc = xmldocfile( $xmlfile );

?>
All functions will throw an error, if the xml cannot be parsed correctly. DOM will not validate xml for you. You must find another way of doing that. Perhaps through another program like xmllint.
A Simple Example
Let's start with a simple example to tie everything together.

<?php

# make an example xml document to play with
$xmlstr = "<" . "?" . "xml version=\"1.0\"" . "?" . ">";
$xmlstr .=
"
<employee>
    <name>Matt</name>
    <position type=\"contract\">Web Guy</position>
</employee>
"
;

# load xml data ($doc becomes an instance of
# the DomDocument object)
$doc = xmldoc($xmlstr);

# get root node "employee"
$employee = $doc->root();

# get employee's children ("name","position")
$nodes = $employee->children();

# let's play with the "position" node
# so we must iterate through employee's
# children in search of it
while ($node = array_shift($nodes))
{
    if (
$node->name == "position")
    {
        
$position = $node;
        break;
    }
}

# get position's type attribute
$type = $position->getattr("type");

# get the text enclosed by the position tag
# shift the first element off of position's children
$text_node = array_shift($position->children());

# access the content property of the text node
$text = $text_node->content;

# echo out the position and type
echo "position: $text<BR>";
echo
"type: $type";

?>
The example should print out the following:
position: Web Guy
type: contract
The while loop is essential for finding the position node. The employee node really has five children nodes: three text, one name, and one position. The text nodes contain the newlines at the end of the lines. This may seem strange at first, but DOM considers any string (even those containing only whitespace) as text and makes an appropriate node for them.
If you want to ensure that the employee node only has two child nodes, you will have to write the xml entry like this
.
<employee><name>Matt</name><position type="contract">Web Guy</position></employee>
A Longer Example
Here is a longer example of how to extract info from an xml doc. For example, we have a file called employees.xml containing employee entries.
<?xml version="1.0"?>

<employees company="zoomedia.com">
	<employee>
		<name>Matt</name>
		<position type="contract">Web Guy</position>
	</employee>

	<employee>
		<name>George</name>
		<position type="full time">Mad Hacker</position>
	</employee>

	<employee>
		<name>Wookie</name>
		<position type="part time">Hairy SysAdmin</position>
	</employee>
</employees>
Here's how you would extract this info in your php script.

<?php

# iterate through an array of nodes
# looking for a text node
# return its content
function get_content($parent)
{
    
$nodes = $parent->children();
    while(
$node = array_shift($nodes))
        if (
$node->type == XML_TEXT_NODE)
            return
$node->content;
    return
"";
}

# get the content of a particular node
function find_content($parent,$name)
{
    
$nodes = $parent->children();
    while(
$node = array_shift($nodes))
        if (
$node->name == $name)
            return
get_content($node);
    return
"";
}

# get an attribute from a particular node
function find_attr($parent,$name,$attr)
{
    
$nodes = $parent->children();
    while(
$node = array_shift($nodes))
        if (
$node->name == $name)
            return
$node->getattr($attr);
    return
"";
}

# load xml doc
$doc = xmldocfile("employees.xml") or die("What employees?");

# get root Node (employees)
$root = $doc->root();

# get an array of employees' children
# that is each employee node
$employees = $root->children();

# shift through the array
# and print out some employee data
while($employee = array_shift($employees))
{
    if (
$employee->type == XML_TEXT_NODE)
        continue;

    
$name = find_content($employee,"name");
    
$pos = find_content($employee,"position");
    
$type = find_attr($employee,"position","type");

    echo
"$name the $pos, $type employee<br>";
}

?>
You should see the following in your browser.
Matt the Web Guy, contract employee
George the Mad Hacker, full time employee
Wookie the Hairy SysAdmin, part time employee
Another example (adding data)
Since the xml is loaded into memory as a tree, we can easily manipulate the data. We can add branches or nodes when necessary.
Say we want to add an employee to our xml file.

<?php

# quick function for making child nodes
function make_node($parent,$name,$content)
{
    
# adds a new child node to parent node
    
$parent->new_child($name,$content);

    
# return the newly added child as a reference
    
return $parent->lastchild();
}

# load xml file and get root node
$doc = xmldocfile("employees.xml") or die("Do you even have any employees?");
$root = $doc->root();

# give the new employee a name
$newguy = make_node($root,"employee","");

# add the new guy's name
make_node($newguy,"name","New Guy");

# add his position
$position = make_node($newguy,"position","Backup Gnome");

# set the 'type' attribute
$position->setattr("type","intern");

# dump our altered xml doc to the browser
echo $doc->dumpmem();

?>
This will print the xml to the browser, so you will most likely have to 'View the Source' in order to see the data.
Conclusion
That's pretty much all there is to DOM xml. It's a simple approach to parsing and manipulating xml in your scripts. I hope this article will shed more light in this dusty corner of php.
-- Matt
References
DOM reference
http://www.w3.org/TR/
libxml, an essential library for building dom
ftp://ftp.gnome.org/pub/GNOME/stable/sources/libxml/
php domxml source
php-4.0.2/ext/domxml.