This series has so far focused on XML
technologies and how they can be utilised using PHP 5. A subject we
have not touched upon yet, is XML validation. This article will
explore the application independent XML validation standards of DTD's,
the XML Schema Language and the XSLT-based Schematron language. I will
demonstrate how to validate XML in PHP and demonstrate how PHP 5's XSL
extension can be used to validate XML using the Schematron language.
Throughout this article I will be using the
Library XML seen in part 1 of this series. An up to date,
namespaced example including a doctype is contained within the ZIP file
accompanying this article.
The XML standard states that an XML document
must be well formed, attributes must enclosed in the double or single
quotes and character data escaped accordingly. An XML document which
does not conform to these constraints is not an XML document and this
simple structural validation is carried out by the majority of XML
parsers. This is however by no means the be all and end all of XML data
validation. Drawing a metaphor to a real world scenario, a well formed
XML document is like a well built building. It is structurally sound
but if the only requirement is that it is a building, how does one know
that the building contains the right rooms, the required number of
floors and the correct décor. If the building is not to the
correct specification, it may as well be a pile of rubble.
Back to XML, being well formed is simply not
enough to constitute a valid XML document. You must ensure that the
data and elements you need are all present, in the correct order and
contain the data you expect. This validation can be carried out at the
application level. For example in PHP, the DOM API can be used to
validate the document structure.
if ($XMLCategories) {
foreach($XMLCategories->getElementsByTagName('category') as $categoryNode) {
/* notice how we get attributes */
$cid = $categoryNode->getAttribute('id');
if(! $cid) continue; // id the cid attribute is not present ignore it
$categories[$cid] = $categoryNode->firstChild->nodeValue;
}
} else {
die('No Categories Found');
}
The above example uses the sample code from
the first article in the series and checks that the categories element
contains category elements and that each one contains an id attribute.
While this validation works, it
destroys the portable nature of XML and forces applications to agree on
a validation standard before using the XML data.
Validation using document Type Definitions (DTD's)
DTD's were included as part of the original
XML 1.0 specification. The DTD has been around since the day of SGML
(the standard with which XML has its roots) and as their name suggests,
a DTD defines the structure of a document; the elements that can
appear, the order in which they can appear, the attributes they can
have and the data they can contain.
A DTD is referenced in the document type
declaration of at the top of an XML document or can be included inside
the document type declaration as an in-line DTD:
The above document type declaration references
an external DTD, library.dtd and includes its own document type
definition. Declarations of elements, entities and attributes override
declarations made in the external DTD. In the above example, the entity
nsp is declared in the local DTD, which will thus override any
declaration of the entity in the external DTD.
The external DTD which defines the library XML is as follows:
<?xmlversion="1.0"encoding="iso-8859-1"?> <!-- The XML declaration must be present in an external DTD -->
<!-- By default, DTD's do not support namespaces. However, a namespace suffix and prefix can be included as an entity in the DTD which can be overridden in the DOCTYPE declaration or by changing the DTD.
--> <!-- Namespace prefix should be overrriden in the local instance where it is not the default namespace--> <!ENTITY % nsp "">
<!-- Namespace suffix should be overriden in the local instance where it is no the default namespace --> <!ENTITY % nss "">
<!-- each POSSIBLE namespaced element is now declared as an entity --> <!ENTITY % library "%nsp;library"> <!ENTITY % categories "%nsp;categories"> <!ENTITY % authors "%nsp;authors"> <!ENTITY % books "%nsp;books"> <!ENTITY % book "%nsp;book"> <!ENTITY % category "%nsp;category"> <!ENTITY % author "%nsp;author"> <!ENTITY % title "%nsp;title"> <!ENTITY % publisher "%nsp;publisher"> <!ENTITY % cover "%nsp;cover"> <!ENTITY % synopsis "%nsp;synopsis">
<!-- element and attribute list declarations now use the element entities declared above to define theXML document --> <!ELEMENT %library; (%categories;,%authors;, %books;)>
<!-- allow for possible declataion of an XML schema. Must be use the XSI namespace --> <!ATTLIST %library; %nsdec; CDATA #FIXED "http://www.phpbuilder.com/adam_delves/library_xml" xmlns:xsi CDATA #FIXED "http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation CDATA #IMPLIED >
<!-- the categories element must contain at least one category element, indicated by + --> <!ELEMENT %categories; ((%category;)+)>
<!ELEMENT %category; (#PCDATA)> <!-- the ID attribute is declared as optional as the category element also appears as a child element of the book element, where it should reference a valid ID from a category element in the categories section --> <!ATTLIST %category; id CDATA #IMPLIED >
<!-- the authors element must contain at least 1 author element, indicated by + --> <!ELEMENT %authors; ((%author;)+)>
<!-- the ID attribute is declared as optional as the author element also appears as a child element of the book element, where it should reference a valid ID from a author element in the categories section --> <!ELEMENT %author; (#PCDATA)> <!ATTLIST %author; id CDATA #IMPLIED >
<!-- the books element may contain 0 or more book elements, indicated by the * --> <!ELEMENT %books; ((%book;)*)>
<!-- a sequence of elements is declared using a commas, the elements in the same order as sepcified in the sequence. The cover and synopsis elements are optional (indicated by a ?) they may occur only once but their occurance is optional --> <!ELEMENT %book; ((%title;),(%publisher;),(%category;)+,(%author;)+,(%cover;)?,(%synopsis;)?)>
<!-- the isbn atribute is required. the hascover attribute is optional, but when specified may contain only th vlaues, yes and no. no is the default --> <!ATTLIST %book; isbn CDATA #REQUIRED hascover (yes | no)"no">
<!-- these elements may only contain character data, commonents and programming instructions --> <!ELEMENT %title; (#PCDATA)> <!ELEMENT %publisher; (#PCDATA)> <!ELEMENT %cover; (#PCDATA)> <!ELEMENT %synopsis;(#PCDATA)>
DTD's are not without their shortcomings. The library XML DTD demonstrates a few of these:
DTD's do not support XML namespaces.
Namespaces were an addition to the XML 1.0 standard created to prevent
element name conflicts in more complex documents. Each element in the
DTD must be referenced using the fully qualified name. Therefore the
above DTD uses two entities, nsp (namespace prefix) and nss (namespace
suffix), which should be overridden in the the XML document if
namespaces are used. Each element is then declared with the prefix as
an entity.
All elements and attribute declarations are
global to the XML document. As demonstrated in the XML, the author and
category elements are used in two contexts. The element declarations
for these elements must therefore be flexible enough to encompass them
both. This means the required id attribute of the category and author
elements cannot be enforced when it is part of a category list or
author list.
Unique attributes can be defined in DTD
through the use of the ID type, but only at a global level. Each author
and category must have a unique id attribute, but the id need only be
unique within the context of the categories and authors list. Using the
DTD ID type would require that all authors and categories be unique.
This is not a limitation of DTD's but a feature, as using ID types in
attributes provides a way of uniquely identifying an element. This then
allows use of the DOM function getElementById().
DTD's allow you to define optional elements
(with ?), elements that may occur zero or more times (with *), elements
that must occur at least once (with +) and elements that must occur
exactly once. However, they do not allow you enforce an upper limit on
the number of elements that may occur.
DTD's support several types of data including
enumerations, as demonstrated by hascover attribute declaration for the
book element. However, they do not allow you to declare more specific
data types such as numbers and booleans, and do not support custom data
types.
Validating against a DTD in PHP
There are two ways in which a DTD can be
validated using PHP 5's DOM extension. The first and preferred way is
to validate it as it is parsed. This involves setting a flag before the
XML is loaded:
PHP 5:
$library = new DOMDocument("1.0");
$library->validateOnParse = true;
libxml_clear_errors();
if (!$doc->load($file)) {
die('Error Loading Document');
}
if (libxml_get_last_error()) {
die('Error Parsing Document');
}
The validateOnParse property causes the
DOMDocument object to do exactly that. Notice how the libxml
functions are used to check for validation errors. Although it
should, the load function does not return false if DTD validation fails.
In order to have entities replaced and default attribute values set,
pass the appropriate libxml constants to the second, optional
argument of the load() function.
The second way is to use the validate() function of the DOMDocument object. Because the XML document has
already been loaded; entity replacements cannot be carried out when
using the validate method. In particular, any entity declarations
contained within the XML document type declaration are ignored and do
not override the external declarations.
PHP 5:
$library = new DOMDocument("1.0");
$library->validateOnParse = true;
$library->load('library.xml');
if ($library->validate()) {
die ('DTD Validation failure.');
}
A DTD can also be validated when loading XML into a SimpleXML object, using the LIBXML constants:
The XML schema language was designed to
supersede DTD's and address their limitations, allowing further control
over validation. XML Schemas are written in XML making them extensible
and easy to understand. They also include full support for XML
namespaces. Being a W3C standard all but a few XML parsers
include support for schema validation.
The importance of namespaces in XML Schema
Namespaces play an important role in XML
schema validation. All schemas should (but are not required to) be
declared with the target namespace of the XML they are validating.
Failure to define a namespace in the schema and the XML may result in
naming conflicts. The library XML is declared using the following
namespace:
The location of the XML schema is referenced
in the root element using the schemaLocation attribute and the default
namespace used for the library XML. The full XML schema is included as
in the ZIP file which accompanies this article. Below are some of the
key features of the XML Schema language.
Elements can be of a simple type or complex type.
A simple type element may contain only simple character data, similar to the #PCDATA type in DTD's:
A complex type is an element which may
contain attributes, other elements, a mixture of other elements and
character data or a custom type:
<xs:complexType> <!-- elements in an all group may appear 0 or 1 times in any order --> <xs:all> <!-- the minOccurs attribute effectivley makes these elements mandatory --> <xs:element minOccurs="1"type="lib:authorsDef"name="authors"/> <xs:element minOccurs="1"type="lib:categoriesDef"name="categories"/> <xs:element minOccurs="1"type="lib:booksDef"name="books"/> </xs:all>
<!-- declaration of an optional name attribute --> <xs:attribute name="name"type="xs:string"/> </xs:complexType>
xPath expressions are used to define unique
key constraints. They are not only limited to attribute values, they
can also be applied to element content and any other data derived from
an xPath expression. Unique keys are also applied to the id's in the
author and category lists.
Reference constraints can also be applied to
XML data. They are defined in a similar manner to the unique key
constraints and use an xPath expression to select the data that the
reference applies to. Notice how the qualified name (element name
including the namespace prefix) is used to refer to the name assigned
to the key constraint above.
The Schema itself is not defined in
terms of its root element. In fact it can be included as part of
another schema or the schema itself can include definitions and
declarations and custom types that are defined in other schemas. This is one of
XML schema's biggest strengths.
Validating XML Schema's in PHP
Schema validation is carried out after the XML
document has been loaded. In PHP 5, the DOMDocument object provides the schemaValidate() method. To validate the current document against
an XML schema, simply supply it with the path of the XML schema file.
For XML documents with a schema declared in the root element, it is
possible to write a small function to carry out schema validation
automatically.
PHP 5:
$library = new SchemaDOMDocument("1.0");
$library->validateOnParse = true;
class SchemaDOMDocument extends DOMDocument
{
public function validateXMLSchemas()
{
$schemaLocation = $this->documentElement->getAttributeNS('http://www.w3.org/2001/XMLSchema-instance', 'schemaLocation');
if (! $schemaLocation) {
throw new DOMException('No schemas found');
}
/* the schemaLocation contains pairs of values separated by spaces the first value in each pair
is the name space to be validated. The second is a URI defining the location of the schema
if(! $valid) {
throw new DOMException('XML Schema Validation Failure');
}
return true;
}
}
The above example extends the DOMDocument
class to include a validateXMLSchemas() method. This method attempts to
read the schemaLocation element in the root element of the XML. This
attribute contains pairs of values. The first value being the
namespace to which the schema applies, the second being the location of
the schema to validate XML in that namespace.
The XML schema language provides a robust way
of defining the structure of an XML document. The Web Services
Description language (WSDL), extends the XML schema as a means of
defining the structure of soap messages. This article by no means
covers every aspect of the language. The full specification can be
found on the W3C website.
XML Schema is not the only XML-based
validation language. The simpler RelaxNG validation language is
also supported by the DOM extension through the relaxNGValidate() function of the DOMDocument object.
Schematron Validation
Despite its flexibility, the XML schema
language still has its limitations. One of its main limitations is the
lack of support for document navigation. For example, there is no way
to declare the existence of an element or attribute based on the value
and/or the existence of another element or attribute. It also
misses the feature of friendly error reporting, leaving this to the
parser that validates the document. The Schematron language
fills these gaps. It is an xPath-based XML language that allows the
user to define validation assertions and can be used to obtain factual
information about the document.
The genius behind Schematron is its
implementation. Any language that provides support for XSLT, can also
support Schematron validation. It works using a three tier XSLT
transformation. The Schematron schema is first transformed using a meta
style sheet (a variety of which can be downloaded from the ASCC Site). This turns the schema into an XSL file that will act as
validation engine for the instance of XML being validated. The XML to
be validated is then transformed using the XSL validating engine. The
result of this transformation is the validation result. It contains a
list failed assertions and reports giving information about the XML
document.
The library XML document can be further validated using a Schematron schema as follows:
The cover element is only needed as an
optional element, when the hascover attribute of the book element is
set to yes. The cover element defines an alternative name for the image
file that contains the image of book cover. If it is included when the
hascover attribute is not set to yes, validation will fail.
A book may be assigned multiple
categories or authors. However, it cannot be assigned the same author
or category more than once. Although this type of unique constraint can
be applied in the schema language, the Schematron language allows us to
produce a custom error message when a duplicate is found.
The Schematorn schema that validates the library XML is as follows:
XML Schema - library-schematron.xml:
<?xmlversion="1.0"encoding="UTF-8"?> <sch:schema xmlns:sch="http://www.ascc.net/xml/schematron"> <!-- ensure the correct namespace is used when validating the library XML --> <sch:ns prefix="lib"uri="http://www.phpbuilder.com/adam_delves/library_xml"/>
<!-- give the validation instance a title --> <sch:title>Library XML Contextual Validation</sch:title>
<!-- rules are grouped in patterns the pattern may be given an optional name --> <sch:pattern> <!-- each rule contains a list of assertions and/or reports that are applied to the selected context-->
<!-- apply the following rules an assertions when the hascover attribute of the book element is NOT set to yes --> <sch:rule context="lib:library/lib:books/lib:book[@hascover!='yes']"> <!-- an assertion is a test, which if fails causes the assertion to fail: < = < if the hascover attrribute of the book element is not yes, the number of cover elements must be zero --> <sch:assert test="count(lib:cover) < 1"> Book cover not expected when hascover is set to no. </sch:assert> </sch:rule>
<!-- apply the following rules and assertions to each author element which is a child of the book element --> <sch:rule context="lib:library/lib:books/lib:book/lib:author"> <sch:let name="current"value="."/> <sch:assert test="count(parent::node()/lib:author[text() = $current]) = 1"> Duplicate Author: <sch:value-of select="/lib:library/lib:authors/lib:author[@id=$current]"/> </sch:assert> </sch:rule>
<!-- apply the following rules and assertions to each category element which is a child of the book element --> <sch:rule context="lib:library/lib:books/lib:book/lib:category"> <!-- the let element alows you to assign a value to a variable which can be usedi n xPath expressions --> <sch:let name="current"value="."/> <sch:assert test="count(parent::node()/lib:category[text() = $current]) = 1"> <!-- use of value-of to give more information about the error using the $current variable defined above --> Duplicate Category: <sch:value-of select="/lib:library/lib:categories/lib:category[@id=$current]"/> </sch:assert> </sch:rule>
<!-- apply these rules and assertions to the books element --> <sch:rule context="lib:books"> <!-- unlike an assertion, a report does not cause a vlaidation failure. the xPath expression within the test attribute must evaluate to true for the report to succeed --> <sch:report test="lib:book"> Library contains <sch:value-of select="count(lib:book)"/> books. </sch:report> </sch:rule> </sch:pattern> </sch:schema>
Like XSL, the Schematron language uses xPath
expressions to select the rule context nodes and to carry out tests in
assertions and reports. The use of xPath allows for detailed
examination of the XML document being validated.
Validating Schematron Schemas in PHP
To validate a Schematron schema in PHP the XSL
extension is required. This enables the XSLT transformation on the
schema and the XML to be validated. If you are unfamiliar with XSLT,
read the third article in this series which gives a brief
overview of the XSL language and some examples.
To make validation of the Schematron schema
simple I have created a Schematron validation class and several
Schematron exception objects. The full source code and meta stylesheet
are included in the ZIP file that accompanies this article. The class
also includes the ability to validate the document against its DTD and
an XML Schema:
/* create a new Schematron validator using the path of the Schema */
$s = new Schematron('library-schematron.xml');
$s->XML_SCHEMA = 'library-xml.xsd'; // set the location of the XML Schema
$s->VALIDATE_DTD = true; // force DTD validation
try {
$doc = $s->validateFile('library.xml');
} catch (SchematronValidationException $shcematronValidationException) {
/* even if vlaidation fails the DOMDocument object of the XML being validated is still available
through the getDoc() function of the SchematronValidationException object */
$doc = $shcematronValidationException->getDoc();
} catch (SchematronException $schematronException) {
/* this exception is thrown if the document fails to load, or schema or DTD validation fails */
$doc = null;
}
/* the information from reports is available in the schematronReport property of the document
as an array of SchematronReport objects */
$reports = @$doc->schematronReports;
First an instance of the Schematron validation
object is created and initialised with the path of the Schematron
schema. The constructor function for the of the Schematron class loads
the schema into a DOM Document and transforms it into XSL using a
custom meta stylesheet.
PHP 5:
public function __construct($schemaPath)
{
$this->STYLESHEET_PATH = dirname(__FILE__) . '/' . $this->STYLESHEET_PATH;
/* load custom meata-stylesheet into sechmatron XSLT into a DOM -
throw an exception if fails
*/
$this->metaStylesheet = new DOMDocument("1.0");
if(! $this->metaStylesheet->load($this->STYLESHEET_PATH)) {
throw new SchematronException('Error Loading Meta-stylesheet.');
}
// load schema into a dom - throw an exception if it fails
$schema = new DOMDocument("1.0");
if (! $schema->load($schemaPath)) {
throw new SchematronException('Error Loading Schematron Schema');
}
// transform the schema into a new DOMDoc
$validatingGenerator = new XSLTProcessor;
$validatingGenerator->importStylesheet($this->metaStylesheet);
if (! ($validating = $validatingGenerator->transformToDoc($schema))) {
throw new SchematronException('Error generating validation engine.');
}
/* load the newly generated XSL into an XSLT processor */
$this->validationEngine = new XSLTProcessor;
$this->validationEngine->importStylesheet($validating);
}
The Schematron object exposes several
validation functions including validateFile(), validateXML() and validateDoc().
The validateFile() and validateXML() functions both create an instance
of a DOMDocument before calling the validateDoc() function which
carries out the actual validation:
PHP 5:
public function validateDoc(DOMDocument $doc)
{
$schematronReports = array(); // initialise the array of reports
if ($this->VALIDATE_DTD && (! $doc->validateOnParse)) { // only validate DTD, if it has not already been validated
if (! $doc->validate()) {
throw new SchematronException('DTD Validation Failure');
}
}
/* validate against an XML schema only if present */
if (! is_null($this->XML_SCHEMA)) {
if (! $doc->schemaValidate($this->XML_SCHEMA)) {
throw new SchematronException('XML Schema Validation Failure');
}
}
/* transform the XML i.e: validate it - if an error occurs during validation
throw an excpetion. N .b: this is not a Schematron assertion */
if(! ($newDoc = $this->validationEngine->transformToDoc($doc))) {
throw new SchematronException('Error validating XML.');
}
$asserts = $newDoc->getElementsByTagName('failedAssert'); // get a list of failed assertions
$reports = $newDoc->getElementsByTagName('reportFact'); // get a list of reports
if ($reports->length > 0) {
/* add each report to the reports array */
foreach($reports as $report) {
$location = $report->firstChild->nodeValue;
$description = $report->childNodes->item(1)->nodeValue;
/* each report is a SchematronReport object */
$schematronReports[] = new SchematronReport($description, $location);
}
}
$doc->schematronReports = $schematronReports; // add the reports to the DOMDocument object
if ($asserts->length == 0) { // validation succeeded
return $doc;
} else { // validation failed
/* initialise the array of assertions */
$assertArray = array();
/* if the SHOW_WARNINGS property is set to true, trigger a warning containing assertion information */
if ($this->SHOW_WARNINGS) {
trigger_error("Schematron Validation Error: $msg", E_USER_WARNING);
}
/* load each assertion in to a SchematronAssertion object */
$assertArray[] = new SchematronAssertion($description, $location);
}
/* throw a validation exception */
throw new SchematronValidationException($doc, $assertArray);
}
}
If the Schematron validation produces any
failed assertions, a SchematronValidationException is thrown. This can
then be caught and as demonstrated in the output, traversed like an
array in a foreach construct. Each assertion is loaded into a
SchematronAssertion object that contains the message and the location
in the document that caused the assertion.
Conclusion
Validating data is crucial in any application
when the data you are handling is from an untrusted source. Especially
when that data is from an external source. The DTD, XML Schema and
Schematron languages all define standards that enable application
independent validation of data, while preserving the portability and
extensibility of the document. This article has shown you some of the
methods available to you in PHP 5 that enable you to validate XML data
using these standards and demonstrated how to create a class which
encapsulates DTD, XML Schema and Schematron validation to ensure that
the XML document conforms to structure and business rules.
Validating XML is however a resource intensive
process. The guidelines below should be followed to maximise the
performance of your application when using validation:
Only validate XML from external sources (i.e: data from an
untrusted third party or data which is editable by others). There is no
need to validate XML generated by your application or any other
application you use which produces valid documents that are not sent
over the Internet.
Once you have validated an XML document, save a copy of the
validated document in cache. Ensure this copy is obtained from the
saveXML() method of the DOMDocument object as this will contain
the entity replacements from the DTD validation . Only revalidate the
document if it has been changed.
Save copies of DTD's and XML schemas on the same file system as
the application. By all means, provide a public copy of the validation
documents, but always use local copies in your application. Using local
copies of validation documents also increases security, as obtaining
them from an external/public resource means you have no control over
any changes made.
In the final installment of this series, I will
be showing you how XML fits in with databases, the tools database
management systems provide for XML, where and when to use it and the
pros and cons of native XML databases.