Using XML Part 6 – Validation
technologies and how they can be utilised using PHP 5. A subject we
have not touched upon yet, is XML validation. This article will
explore the application independent XML validation standards of DTD’s,
the XML Schema Language and the XSLT-based Schematron language. I will
demonstrate how to validate XML in PHP and demonstrate how PHP 5’s XSL
extension can be used to validate XML using the Schematron language.
Library XML seen in part 1 of this series. An up to date,
namespaced example including a doctype is contained within the ZIP file
accompanying this article.
must be well formed, attributes must enclosed in the double or single
quotes and character data escaped accordingly. An XML document which
does not conform to these constraints is not an XML document and this
simple structural validation is carried out by the majority of XML
parsers. This is however by no means the be all and end all of XML data
validation. Drawing a metaphor to a real world scenario, a well formed
XML document is like a well built building. It is structurally sound
but if the only requirement is that it is a building, how does one know
that the building contains the right rooms, the required number of
floors and the correct décor. If the building is not to the
correct specification, it may as well be a pile of rubble.
enough to constitute a valid XML document. You must ensure that the
data and elements you need are all present, in the correct order and
contain the data you expect. This validation can be carried out at the
application level. For example in PHP, the DOM API can be used to
validate the document structure.
PHP 5:$categories = array();
$XMLCategories = $xml->getElementsByTagName('categories')->item(0);
if ($XMLCategories) {
foreach($XMLCategories->getElementsByTagName('category') as $categoryNode) {
/* notice how we get attributes */
$cid = $categoryNode->getAttribute('id');
if(! $cid) continue; // id the cid attribute is not present ignore it
$categories[$cid] = $categoryNode->firstChild->nodeValue;
}
} else {
die('No Categories Found');
}
the first article in the series and checks that the categories element
contains category elements and that each one contains an id attribute.
While this validation works, it
destroys the portable nature of XML and forces applications to agree on
a validation standard before using the XML data.
XML 1.0 specification. The DTD has been around since the day of SGML
(the standard with which XML has its roots) and as their name suggests,
a DTD defines the structure of a document; the elements that can
appear, the order in which they can appear, the attributes they can
have and the data they can contain.
declaration of at the top of an XML document or can be included inside
the document type declaration as an in-line DTD:
<?xml version="1.0"?>
<!DOCTYPE library SYSTEM "library.dtd" [
<!ENTITY % nsp "lib:">
<!ENTITY % nss ":lib">
] >
an external DTD, library.dtd and includes its own document type
definition. Declarations of elements, entities and attributes override
declarations made in the external DTD. In the above example, the entity
nsp is declared in the local DTD, which will thus override any
declaration of the entity in the external DTD.
<?xml version="1.0" encoding="iso-8859-1" ?>
<!-- The XML declaration must be present in an external DTD -->
<!--
By default, DTD's do not support namespaces. However, a namespace suffix and prefix can be
included as an entity in the DTD which can be overridden in the DOCTYPE declaration or by changing
the DTD.
-->
<!-- Namespace prefix should be overrriden in the local instance where it is not the default namespace-->
<!ENTITY % nsp "" >
<!-- Namespace suffix should be overriden in the local instance where it is no the default namespace -->
<!ENTITY % nss "" >
<!-- xmlns entity dclaration -->
<!ENTITY % nsdec "xmlns%nss;" >
<!-- each POSSIBLE namespaced element is now declared as an entity -->
<!ENTITY % library "%nsp;library" >
<!ENTITY % categories "%nsp;categories" >
<!ENTITY % authors "%nsp;authors" >
<!ENTITY % books "%nsp;books" >
<!ENTITY % book "%nsp;book" >
<!ENTITY % category "%nsp;category" >
<!ENTITY % author "%nsp;author" >
<!ENTITY % title "%nsp;title" >
<!ENTITY % publisher "%nsp;publisher" >
<!ENTITY % cover "%nsp;cover" >
<!ENTITY % synopsis "%nsp;synopsis" >
<!-- element and attribute list declarations now use the element
entities declared above to define theXML document -->
<!ELEMENT %library; (%categories;,%authors;, %books;) >
<!-- allow for possible declataion of an XML schema. Must be use the XSI namespace -->
<!ATTLIST %library;
%nsdec; CDATA #FIXED "https://phpbuilder.com/adam_delves/library_xml"
xmlns:xsi CDATA #FIXED "http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation CDATA #IMPLIED >
<!-- the categories element must contain at least one category element, indicated by + -->
<!ELEMENT %categories; ((%category;)+) >
<!ELEMENT %category; (#PCDATA) >
<!-- the ID attribute is declared as optional as the category element also appears as a child
element of the book element, where it should reference a valid ID from a category element
in the categories section
-->
<!ATTLIST %category; id CDATA #IMPLIED >
<!-- the authors element must contain at least 1 author element, indicated by + -->
<!ELEMENT %authors; ((%author;)+) >
<!-- the ID attribute is declared as optional as the author element also appears as a child
element of the book element, where it should reference a valid ID from a author element
in the categories section -->
<!ELEMENT %author; (#PCDATA) >
<!ATTLIST %author; id CDATA #IMPLIED >
<!-- the books element may contain 0 or more book elements, indicated by the * -->
<!ELEMENT %books; ((%book;)*) >
<!-- a sequence of elements is declared using a commas, the elements in the same order as sepcified in
the sequence. The cover and synopsis elements are optional (indicated by a ?) they may occur only
once but their occurance is optional -->
<!ELEMENT %book; ((%title;),(%publisher;),(%category;)+,(%author;)+,(%cover;)?,(%synopsis;)?) >
<!-- the isbn atribute is required. the hascover attribute is optional, but when specified may contain
only th vlaues, yes and no. no is the default -->
<!ATTLIST %book;
isbn CDATA #REQUIRED
hascover (yes | no) "no" >
<!-- these elements may only contain character data, commonents and programming instructions -->
<!ELEMENT %title; (#PCDATA) >
<!ELEMENT %publisher; (#PCDATA) >
<!ELEMENT %cover; (#PCDATA) >
<!ELEMENT %synopsis;(#PCDATA) >
- DTD’s do not support XML namespaces.
Namespaces were an addition to the XML 1.0 standard created to prevent
element name conflicts in more complex documents. Each element in the
DTD must be referenced using the fully qualified name. Therefore the
above DTD uses two entities, nsp (namespace prefix) and nss (namespace
suffix), which should be overridden in the the XML document if
namespaces are used. Each element is then declared with the prefix as
an entity. - All elements and attribute declarations are
global to the XML document. As demonstrated in the XML, the author and
category elements are used in two contexts. The element declarations
for these elements must therefore be flexible enough to encompass them
both. This means the required id attribute of the category and author
elements cannot be enforced when it is part of a category list or
author list. - Unique attributes can be defined in DTD
through the use of the ID type, but only at a global level. Each author
and category must have a unique id attribute, but the id need only be
unique within the context of the categories and authors list. Using the
DTD ID type would require that all authors and categories be unique.
This is not a limitation of DTD’s but a feature, as using ID types in
attributes provides a way of uniquely identifying an element. This then
allows use of the DOM function getElementById(). - DTD’s allow you to define optional elements
(with ?), elements that may occur zero or more times (with *), elements
that must occur at least once (with +) and elements that must occur
exactly once. However, they do not allow you enforce an upper limit on
the number of elements that may occur. - DTD’s support several types of data including
enumerations, as demonstrated by hascover attribute declaration for the
book element. However, they do not allow you to declare more specific
data types such as numbers and booleans, and do not support custom data
types.
validated using PHP 5’s DOM extension. The first and preferred way is
to validate it as it is parsed. This involves setting a flag before the
XML is loaded:
PHP 5:
$library = new DOMDocument("1.0");
$library->validateOnParse = true;
libxml_clear_errors();
if (!$doc->load($file)) {
die('Error Loading Document');
}
if (libxml_get_last_error()) {
die('Error Parsing Document');
}
DOMDocument object to do exactly that. Notice how the libxml
functions are used to check for validation errors. Although it
should, the load function does not return false if DTD validation fails.
In order to have entities replaced and default attribute values set,
pass the appropriate libxml constants to the second, optional
argument of the load() function.
already been loaded; entity replacements cannot be carried out when
using the validate method. In particular, any entity declarations
contained within the XML document type declaration are ignored and do
not override the external declarations.
PHP 5:$library = new DOMDocument("1.0");
$library->validateOnParse = true;
$library->load('library.xml');
if ($library->validate()) {
die ('DTD Validation failure.');
}
PHP 5:
libxml_clear_errors();
$library = simplexml_load_file('library.xml','',LIBXML_DTDVALID);
if (libxml_get_last_error() ) {
die ('Error validating / loading XML');
}
While DTD’s have their limitations they are
still the de facto standard in XML validation supported by the vast
majority of XML parsers and do not rely on the XML standard. DTD’s
still play an important part in XML validation. One of their main
strengths being that they are applied to the XML as it is parsed. This
allows for the creation of custom entities such as the ©
entity in HTML which are replaced as the document is loaded. Further
information on the capabilities of DTD’s can be found in the
specification.
supersede DTD’s and address their limitations, allowing further control
over validation. XML Schemas are written in XML making them extensible
and easy to understand. They also include full support for XML
namespaces. Being a W3C standard all but a few XML parsers
include support for schema validation.
schema validation. All schemas should (but are not required to) be
declared with the target namespace of the XML they are validating.
Failure to define a namespace in the schema and the XML may result in
naming conflicts. The library XML is declared using the following
namespace:
<library xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="https://phpbuilder.com/adam_delves/library_xml library-xml.xsd"
xmlns="https://phpbuilder.com/adam_delves/library_xml">
in the root element using the schemaLocation attribute and the default
namespace used for the library XML. The full XML schema is included as
in the ZIP file which accompanies this article. Below are some of the
key features of the XML Schema language.
- Elements can be of a simple type or complex type.
A simple type element may contain only simple character data, similar to the #PCDATA type in DTD’s:
<xs:element name="author" type="xs:int" maxOccurs="unbounded" />
- A complex type is an element which may
contain attributes, other elements, a mixture of other elements and
character data or a custom type:<xs:complexType>
<!-- elements in an all group may appear 0 or 1 times in any order -->
<xs:all>
<!-- the minOccurs attribute effectivley makes these elements mandatory -->
<xs:element minOccurs="1" type="lib:authorsDef" name="authors" />
<xs:element minOccurs="1" type="lib:categoriesDef" name="categories" />
<xs:element minOccurs="1" type="lib:booksDef" name="books" />
</xs:all>
<!-- declaration of an optional name attribute -->
<xs:attribute name="name" type="xs:string" />
</xs:complexType> - xPath expressions are used to define unique
key constraints. They are not only limited to attribute values, they
can also be applied to element content and any other data derived from
an xPath expression. Unique keys are also applied to the id’s in the
author and category lists.<xs:key name="isbnUnique">
<xs:selector xpath="lib:books/lib:book" />
<xs:field xpath="@isbn" />
</xs:key> - Reference constraints can also be applied to
XML data. They are defined in a similar manner to the unique key
constraints and use an xPath expression to select the data that the
reference applies to. Notice how the qualified name (element name
including the namespace prefix) is used to refer to the name assigned
to the key constraint above.<xs:keyref name="validCategory" refer="lib:cidUnique">
<xs:selector xpath="lib:books/lib:book/lib:category" />
<xs:field xpath="." />
</xs:keyref> - The Schema itself is not defined in
terms of its root element. In fact it can be included as part of
another schema or the schema itself can include definitions and
declarations and custom types that are defined in other schemas. This is one of
XML schema’s biggest strengths.
document has been loaded. In PHP 5, the DOMDocument object provides the schemaValidate() method. To validate the current document against
an XML schema, simply supply it with the path of the XML schema file.
For XML documents with a schema declared in the root element, it is
possible to write a small function to carry out schema validation
automatically.
PHP 5:
$library = new SchemaDOMDocument("1.0");
$library->validateOnParse = true;
$library->load('library.xml');
$library->validateXMLSchemas();
class SchemaDOMDocument extends DOMDocument
{
public function validateXMLSchemas()
{
$schemaLocation = $this->documentElement->getAttributeNS('http://www.w3.org/2001/XMLSchema-instance', 'schemaLocation');
if (! $schemaLocation) {
throw new DOMException('No schemas found');
}
/* the schemaLocation contains pairs of values separated by spaces the first value in each pair
is the name space to be validated. The second is a URI defining the location of the schema
validate each namespace using the provided URI
*/
$pairs = preg_split('/s+/', $schemaLocation);
$pairCount = count($pairs);
if ($pairCount <= 1) {
throw new DOMException('Invalid schema location value.');
}
$valid = true;
for($x = 1; $x < $pairCount; $x+=2) {
$valid = $this->schemaValidate($pairs[$x]) && $valid;
}
if(! $valid) {
throw new DOMException('XML Schema Validation Failure');
}
return true;
}
}