#native_company# #native_desc#
#native_cta#

Using XML – Part 6: Validation

By PHP Builder Staff
on August 28, 2008

Using XML Part 6 – Validation

This series has so far focused on XML
technologies and how they can be utilised using PHP 5. A subject we
have not touched upon yet, is XML validation. This article will
explore the application independent XML validation standards of DTD’s,
the XML Schema Language and the XSLT-based Schematron language. I will
demonstrate how to validate XML in PHP and demonstrate how PHP 5’s XSL
extension can be used to validate XML using the Schematron language.
Throughout this article I will be using the
Library XML seen in part 1 of this series. An up to date,
namespaced example including a doctype is contained within the ZIP file
accompanying this article.
The XML standard states that an XML document
must be well formed, attributes must enclosed in the double or single
quotes and character data escaped accordingly. An XML document which
does not conform to these constraints is not an XML document and this
simple structural validation is carried out by the majority of XML
parsers. This is however by no means the be all and end all of XML data
validation. Drawing a metaphor to a real world scenario, a well formed
XML document is like a well built building. It is structurally sound
but if the only requirement is that it is a building, how does one know
that the building contains the right rooms, the required number of
floors and the correct décor. If the building is not to the
correct specification, it may as well be a pile of rubble.
Back to XML, being well formed is simply not
enough to constitute a valid XML document. You must ensure that the
data and elements you need are all present, in the correct order and
contain the data you expect. This validation can be carried out at the
application level. For example in PHP, the DOM API can be used to
validate the document structure.

PHP 5:

$categories = array();

$XMLCategories $xml->getElementsByTagName('categories')->item(0);

if ($XMLCategories) {

    foreach(
$XMLCategories->getElementsByTagName('category') as $categoryNode) {

        
/* notice how we get attributes */

        
$cid $categoryNode->getAttribute('id');

            

        if(! 
$cid) continue; // id the cid attribute is not present ignore it

        
$categories[$cid] = $categoryNode->firstChild->nodeValue;

    }

} else {

    die(
'No Categories Found');

}

The above example uses the sample code from
the first article in the series and checks that the categories element
contains category elements and that each one contains an id attribute.
While this validation works, it
destroys the portable nature of XML and forces applications to agree on
a validation standard before using the XML data.
Validation using document Type Definitions (DTD’s)
DTD’s were included as part of the original
XML 1.0 specification. The DTD has been around since the day of SGML
(the standard with which XML has its roots) and as their name suggests,
a DTD defines the structure of a document; the elements that can
appear, the order in which they can appear, the attributes they can
have and the data they can contain.
A DTD is referenced in the document type
declaration of at the top of an XML document or can be included inside
the document type declaration as an in-line DTD:
<?xml version="1.0"?>
<!DOCTYPE library SYSTEM "library.dtd" [
<!ENTITY % nsp "lib:">

<!ENTITY % nss ":lib">
] >
The above document type declaration references
an external DTD, library.dtd and includes its own document type
definition. Declarations of elements, entities and attributes override
declarations made in the external DTD. In the above example, the entity
nsp is declared in the local DTD, which will thus override any
declaration of the entity in the external DTD.
The external DTD which defines the library XML is as follows:

<?xml version="1.0" encoding="iso-8859-1" ?>
<!-- The XML declaration must be present in an external DTD -->
 
<!--
By default, DTD's do not support namespaces. However, a namespace suffix and prefix can be
included as an entity in the DTD which can be overridden in the DOCTYPE declaration or by changing
the DTD.

-->

<!-- Namespace prefix should be overrriden in the local instance where it is not the default namespace-->
<!ENTITY % nsp "" >
 
<!-- Namespace suffix should be overriden in the local instance where it is no the default namespace -->
<!ENTITY % nss "" >
 
<!-- xmlns entity dclaration -->
<!ENTITY % nsdec "xmlns%nss;" >
 
<!-- each POSSIBLE namespaced element is now declared as an entity -->
<!ENTITY % library "%nsp;library" >
<!ENTITY % categories "%nsp;categories" >
<!ENTITY % authors "%nsp;authors" >
<!ENTITY % books "%nsp;books" >
<!ENTITY % book "%nsp;book" >
<!ENTITY % category "%nsp;category" >
<!ENTITY % author "%nsp;author" >
<!ENTITY % title "%nsp;title" >
<!ENTITY % publisher "%nsp;publisher" >
<!ENTITY % cover "%nsp;cover" >
<!ENTITY % synopsis "%nsp;synopsis" >
 
<!-- element and attribute list declarations now use the element
entities declared above to define theXML document -->

<!ELEMENT %library; (%categories;,%authors;, %books;) >
 
<!-- allow for possible declataion of an XML schema. Must be use the XSI namespace -->
<!ATTLIST %library;
%nsdec; CDATA #FIXED "http://www.phpbuilder.com/adam_delves/library_xml"
xmlns:xsi CDATA #FIXED "http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation CDATA #IMPLIED >

 
<!-- the categories element must contain at least one category element, indicated by + -->
<!ELEMENT %categories; ((%category;)+) >
 
<!ELEMENT %category; (#PCDATA) >
<!-- the ID attribute is declared as optional as the category element also appears as a child
element of the book element, where it should reference a valid ID from a category element
in the categories section
-->

<!ATTLIST %category; id CDATA #IMPLIED >
 
<!-- the authors element must contain at least 1 author element, indicated by + -->
<!ELEMENT %authors; ((%author;)+) >
 
<!-- the ID attribute is declared as optional as the author element also appears as a child
element of the book element, where it should reference a valid ID from a author element
in the categories section -->

<!ELEMENT %author; (#PCDATA) >
<!ATTLIST %author; id CDATA #IMPLIED >
 
<!-- the books element may contain 0 or more book elements, indicated by the * -->
<!ELEMENT %books; ((%book;)*) >
 
<!-- a sequence of elements is declared using a commas, the elements in the same order as sepcified in
the sequence. The cover and synopsis elements are optional (indicated by a ?) they may occur only
once but their occurance is optional -->

<!ELEMENT %book; ((%title;),(%publisher;),(%category;)+,(%author;)+,(%cover;)?,(%synopsis;)?) >
 
<!-- the isbn atribute is required. the hascover attribute is optional, but when specified may contain
only th vlaues, yes and no. no is the default -->

<!ATTLIST %book;
isbn CDATA #REQUIRED
hascover (yes | no) "no" >

 
<!-- these elements may only contain character data, commonents and programming instructions -->
<!ELEMENT %title; (#PCDATA) >
<!ELEMENT %publisher; (#PCDATA) >
<!ELEMENT %cover; (#PCDATA) >
<!ELEMENT %synopsis;(#PCDATA) >
DTD’s are not without their shortcomings. The library XML DTD demonstrates a few of these:
  • DTD’s do not support XML namespaces.
    Namespaces were an addition to the XML 1.0 standard created to prevent
    element name conflicts in more complex documents. Each element in the
    DTD must be referenced using the fully qualified name. Therefore the
    above DTD uses two entities, nsp (namespace prefix) and nss (namespace
    suffix), which should be overridden in the the XML document if
    namespaces are used. Each element is then declared with the prefix as
    an entity.

  • All elements and attribute declarations are
    global to the XML document. As demonstrated in the XML, the author and
    category elements are used in two contexts. The element declarations
    for these elements must therefore be flexible enough to encompass them
    both. This means the required id attribute of the category and author
    elements cannot be enforced when it is part of a category list or
    author list.

  • Unique attributes can be defined in DTD
    through the use of the ID type, but only at a global level. Each author
    and category must have a unique id attribute, but the id need only be
    unique within the context of the categories and authors list. Using the
    DTD ID type would require that all authors and categories be unique.
    This is not a limitation of DTD’s but a feature, as using ID types in
    attributes provides a way of uniquely identifying an element. This then
    allows use of the DOM function getElementById().

  • DTD’s allow you to define optional elements
    (with ?), elements that may occur zero or more times (with *), elements
    that must occur at least once (with +) and elements that must occur
    exactly once. However, they do not allow you enforce an upper limit on
    the number of elements that may occur.

  • DTD’s support several types of data including
    enumerations, as demonstrated by hascover attribute declaration for the
    book element. However, they do not allow you to declare more specific
    data types such as numbers and booleans, and do not support custom data
    types.

Validating against a DTD in PHP
There are two ways in which a DTD can be
validated using PHP 5’s DOM extension. The first and preferred way is
to validate it as it is parsed. This involves setting a flag before the
XML is loaded:

PHP 5:

$library = new DOMDocument("1.0");

$library->validateOnParse true;

 

libxml_clear_errors();

           

if (!
$doc->load($file)) {

    die(
'Error Loading Document');

}

if (libxml_get_last_error()) {

    die(
'Error Parsing Document');

}

The validateOnParse property causes the
DOMDocument object to do exactly that. Notice how the libxml
functions
are used to check for validation errors. Although it
should, the load function does not return false if DTD validation fails.
In order to have entities replaced and default attribute values set,
pass the appropriate libxml constants to the second, optional
argument of the load() function.
The second way is to use the validate() function of the DOMDocument object. Because the XML document has
already been loaded; entity replacements cannot be carried out when
using the validate method. In particular, any entity declarations
contained within the XML document type declaration are ignored and do
not override the external declarations.

 

PHP 5:

$library = new DOMDocument("1.0");

$library->validateOnParse true;

           

$library->load('library.xml');

if ($library->validate()) {

    die (
'DTD Validation failure.');

}

A DTD can also be validated when loading XML into a SimpleXML object, using the LIBXML constants:


PHP 5:

libxml_clear_errors();

$library simplexml_load_file('library.xml','',LIBXML_DTDVALID);

if (libxml_get_last_error() ) {

    die (
'Error validating / loading XML');

}

While DTD’s have their limitations they are
still the de facto standard in XML validation supported by the vast
majority of XML parsers and do not rely on the XML standard. DTD’s
still play an important part in XML validation. One of their main
strengths being that they are applied to the XML as it is parsed. This
allows for the creation of custom entities such as the &copy;
entity in HTML which are replaced as the document is loaded. Further
information on the capabilities of DTD’s can be found in the
specification.

The XML Schema Language
The XML schema language was designed to
supersede DTD’s and address their limitations, allowing further control
over validation. XML Schemas are written in XML making them extensible
and easy to understand. They also include full support for XML
namespaces. Being a W3C standard all but a few XML parsers
include support for schema validation.
The importance of namespaces in XML Schema
Namespaces play an important role in XML
schema validation. All schemas should (but are not required to) be
declared with the target namespace of the XML they are validating.
Failure to define a namespace in the schema and the XML may result in
naming conflicts. The library XML is declared using the following
namespace:
<library xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.phpbuilder.com/adam_delves/library_xml library-xml.xsd"
xmlns="http://www.phpbuilder.com/adam_delves/library_xml">
The location of the XML schema is referenced
in the root element using the schemaLocation attribute and the default
namespace used for the library XML. The full XML schema is included as
in the ZIP file which accompanies this article. Below are some of the
key features of the XML Schema language.
  • Elements can be of a simple type or complex type.

    A simple type element may contain only simple character data, similar to the #PCDATA type in DTD’s: 

    <xs:element  name="author"  type="xs:int"  maxOccurs="unbounded"  />
     
  • A complex type is an element which may
    contain attributes, other elements, a mixture of other elements and
    character data or a custom type:

        <xs:complexType>
    <!-- elements in an all group may appear 0 or 1 times in any order -->
    <xs:all>
    <!-- the minOccurs attribute effectivley makes these elements mandatory -->
    <xs:element minOccurs="1" type="lib:authorsDef" name="authors" />
    <xs:element minOccurs="1" type="lib:categoriesDef" name="categories" />
    <xs:element minOccurs="1" type="lib:booksDef" name="books" />
    </xs:all>
     
    <!-- declaration of an optional name attribute -->
    <xs:attribute name="name" type="xs:string" />
    </xs:complexType>

  • xPath expressions are used to define unique
    key constraints. They are not only limited to attribute values, they
    can also be applied to element content and any other data derived from
    an xPath expression. Unique keys are also applied to the id’s in the
    author and category lists. 

        <xs:key name="isbnUnique">
    <xs:selector xpath="lib:books/lib:book" />
    <xs:field xpath="@isbn" />
    </xs:key>
  • Reference constraints can also be applied to
    XML data. They are defined in a similar manner to the unique key
    constraints and use an xPath expression to select the data that the
    reference applies to. Notice how the qualified name (element name
    including the namespace prefix) is used to refer to the name assigned
    to the key constraint above.

    <xs:keyref name="validCategory" refer="lib:cidUnique">
    <xs:selector xpath="lib:books/lib:book/lib:category" />
    <xs:field xpath="." />
    </xs:keyref>
  • The Schema itself is not defined in
    terms of its root element. In fact it can be included as part of
    another schema or the schema itself can include definitions and
    declarations and custom types that are defined in other schemas. This is one of
    XML schema’s biggest strengths.

Validating XML Schema’s in PHP
Schema validation is carried out after the XML
document has been loaded. In PHP 5, the DOMDocument object provides the schemaValidate() method. To validate the current document against
an XML schema, simply supply it with the path of the XML schema file.
For XML documents with a schema declared in the root element, it is
possible to write a small function to carry out schema validation
automatically.

PHP 5:

$library = new SchemaDOMDocument("1.0");

$library->validateOnParse true;

 

$library->load('library.xml');

$library->validateXMLSchemas();

class SchemaDOMDocument extends DOMDocument

{

    public function 
validateXMLSchemas()

    {

        
$schemaLocation $this->documentElement->getAttributeNS('http://www.w3.org/2001/XMLSchema-instance''schemaLocation');

        if (! $schemaLocation) {

            throw new 
DOMException('No schemas found');

        }

        /* the schemaLocation contains pairs of values separated by spaces the first value in each pair

           is the name space to be validated. The second is a URI defining the location of the schema

          

           validate each namespace using the provided URI

         */

         $pairs preg_split('/s+/'$schemaLocation);

         
$pairCount count($pairs);

        

         if (
$pairCount <= 1) {

             throw new 
DOMException('Invalid schema location value.');

         }

         $valid true;

         for(
$x 1$x $pairCount$x+=2) {

             
$valid $this->schemaValidate($pairs[$x]) && $valid;

         }

        

         if(! 
$valid) {

             throw new 
DOMException('XML Schema Validation Failure');

         }

         return true;

    }