Introduction to Extensible Markup Language for Chemistry and Biosciences

Alex Amies   March 2, 2006

Chemical and biosciences data needs to be represented in a structured and unambiguous way for computers to stored and read back again later and to attach meaning to that data.  In the past many different structures were invented to do this, including comma separated variable files, binary files, and relational database schemas.  One of the issues was that the formats were proprietary and this led to incompatibility between software systems.  Extensible Markup Language (XML) was created to overcome this problem.  The authority for XML is the World Wide Web Consortium (W3C)1, which has various working groups and makes specifications available at their web site.  

Chemical Markup Language (CML)

One type of XML that may for the basis for much work in chemistry and the biosciences is Chemical Markup Language (CML)2.  This is a variety of XML designed to represent molecular information.  The current version of CML is 2.1.1.

What would you do with CML?  Many things.  For example, you could use it to store chemical formulas and then display the molecules in graphical formats that are meaning to people.  To see an example of this go to the the CML demo at Adobe3 (you will need to use IE).  Here is a screenshot

Adobe CML Viewer

If you drag your mouse around the 3D representation it will rotate to help aid in visualizing the molecule.  In the CML Source window the raw XML is shown.  Let's start by discussing a simpler XML document and get back to this example later.

Reading XML Documents

Here is an example from the CML Schema:

<cml>
  <molecule id="m1">
    <atomArray>
      <atom elementType="N"/>
      <atom elementType="O"/>
    </atomArray>
  </molecule>
</cml>


The first tag is <cml>. This is the top level tag.  The next tag is <molecule id="m1">.  If I hadn't used CML before I wouldn't know what this meant, although I can easily guess because the CML authors gave it a meaningful name.  How would I find out for sure?  I would go to the XML Schema and check out the documentation for it.  The CML Schema reference says that the <molecule> tag is "a container for atoms, bonds and submolecules."  The 'id' attribute is used as a unique identifier so that the molecule can be referred to from elsewhere.  Similarly, the <atomArray> is "a container for a list of atoms."  The tag <atom elementType="N"/> specifies a nitrogen atom and the tag <atom elementType="O"/> an oxygen atom.

The tag </atomArray> closes the <atomArray> element.  Tags must always be closed with a </...> pattern in XML to create well formed documents.  Also, tags must be fully enclosed within other tags and cannot overlap.  For example, <a><b></b></a> is well formed XML but <a><b></a></b> is not.  If a tag does not have anything inside it then the shorthand <.../> can be used to indicate both opening and closing an empty tag.

XML Documents

The current versions of XML are 1.0 and 1.1.   XML parsers (programs that read XML) should be compatible with both versions.  Let's go over some of the basic meanings of the Adobe ethanol CML document from above:

<?xml version="1.0"?>

This line appears at the top of all XML documents.  The <? ... ?> indicates that it is a processing instruction.  The line alsoidentifies the document as XML version 1.0.  The next line

<!DOCTYPE cml SYSTEM "http://www.xml-cml.org/dtd/cml1_0_1.dtd">

identifies the particular type of xml.  In this case it is CML. It points to the data type definition (DTD), which defines CML Version.  The next line

<cml title="ethanol" id="cml_ethanol_karne" xmlns="x-schema:cml_schema_ie_02.xml">

We saw the cml tag before.  This time it has a title, which is used for documentation purposes.  It has an ID to uniquely identity it.  The CML from the Adobe demo has a lot of information.  In addition to atom array information there is also bond, stereochemistry, and spectra information.

XML Schema

During the change from CML1 to CML2 Chemical Markup Language made the transition from a DTD to an XML Schema4.  These days XML Schema is preferred over DTD's.  XML Schema is a more powerful way of defining the structure of XML documents and it allows multiple XML languages to be mixed together.  With DTDs there is no way to make use of other schemas when creating a new schema.  In other words, you could only use one DTD at a time and it you wanted to use a type from somewhere else you would have to cut at paste it. 

Let's look at a CML2 document.

<?xml version="1.0" encoding="UTF-8" ?>
<molecule convention="MDLMol" id="arginine" title="ARGININE"
          xmlns='http://www.xml-cml.org/schema'>
  <atomArray>
    <atom id="a1" elementType="C" hydrogenCount="0" x2="0.7386" y2="0.1493"/>
    <atom id="a2" elementType="C" hydrogenCount="0" x2="-0.3772" y2="-0.6129"/>
    <atom id="a3" elementType="C" hydrogenCount="0" x2="2.3376" y2="-0.6129"/>
    <atom id="a4" elementType="C" hydrogenCount="0" x2="-1.5008" y2="0.1493"/>
    <atom id="a5" elementType="C" hydrogenCount="0" x2="3.8344" y2="0.1493"/>
    <atom id="a6" elementType="C" hydrogenCount="0" x2="2.3376" y2="-2.4004"/>
    <atom id="a7" elementType="N" hydrogenCount="0" x2="-2.5222" y2="-0.6129"/>
    <atom id="a8" elementType="O" hydrogenCount="0" x2="4.7891" y2="-0.4439"/>
    <atom id="a9" elementType="O" hydrogenCount="0" x2="3.8344" y2="1.6776"/>
    <atom id="a10" elementType="C" hydrogenCount="0" x2="-3.4376" y2="0.4047"/>
    <atom id="a11" elementType="N" hydrogenCount="0" x2="-4.6634" y2="-0.6129"/>
    <atom id="a12" elementType="N" hydrogenCount="0" x2="-3.4376" y2="2.5301"/>
    <atom id="a13" elementType="H" hydrogenCount="0" x2="2.8837" y2="-1.21"/>
  </atomArray>
  <bondArray>
    <bond atomRefs2="a1 a2" order="1"/>
    <bond atomRefs2="a1 a3" order="1"/>
    <bond atomRefs2="a2 a4" order="1"/>
    <bond atomRefs2="a3 a5" order="1"/>
    <bond atomRefs2="a3 a6" order="1"><stereo>W</stereo></bond>
    <bond atomRefs2="a4 a7" order="1"/>
    <bond atomRefs2="a5 a8" order="1"/>
    <bond atomRefs2="a5 a9" order="2"/>
    <bond atomRefs2="a7 a10" order="1"/>
    <bond atomRefs2="a10 a11" order="1"/>
    <bond atomRefs2="a10 a12" order="1"/>
    <bond atomRefs2="a3 a13" order="1"><stereo>H</stereo></bond>
  </bondArray>
</molecule>

In the second line

<molecule convention="MDLMol" id="arginine" title="ARGININE" xmlns='http://www.xml-cml.org/schema'>

the xmlns (XML namespace) attribute is used to group the CML tags as within a name space as distrinct from tags from some other name space that might happen to have the same tag names.  This type of definition is called a default name space.  There can only one default namespace at a particular location in a document and which this is is given by the value of the xmlns attribute.  In this case it is 'http://www.xml-cml.org/schema'.  In the XML Schema for CML this is the target namespace.  Let's look at an example with multiple namespaces and namespace qualifiers.

 
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                xmlns:cml="http://www.xml-cml.org/schema"
  <xsl:template match="/">
    <cml:molecule id="a1" ref="" role="" title="Methane">
      <cml:atomArray>
        <cml:atom id="a1" elementType="C" dictRef="" hydrogenCount="4"/>
      </cml:atomArray>
    </cml:molecule>
  </xsl:template>
</xsl:stylesheet>

In this case I am using the same CML schema but I have put the qualifier 'cml' for the CML schema tags to separate them from the other tags.  Those other tags have the qualifier 'xsl', for eXtensible Stylesheet Language (XSL).  Thus the use of qualifiers enables tags from different XML variants to be mixed together.  The values for the qualifiers 'cml' and 'xsl' can be changed arbitrarily.  I could have used 'blort' instead of 'cml', provided I used the correct namespace uniform resource indicator (URI), http://www.xml-cml.org/schema to identify it as CML.  eXtensible Stylesheet Language is a language for transforming XML to something else (that something else could be some more XML).  I will discuss it below.

How do I know what can go inside a molecule tag?  I could look up the documentation and read it or I could look up the schema itself.  For molecule I would find this

  <xsd:element name="molecule" id="el.molecule"> 
    <xsd:complexType>
      <xsd:sequence>
      ...
        <xsd:element ref="atomArray"/>
        <xsd:element ref="bondArray" minOccurs="0"/>
      ...
      </xsd:sequence>
    </xsd:complexType>
  </xsd:element>

The meaning of this is: a molecule is a complex type, which is a sequence of ... atom array, bond array, and ....  You can see that XML Schema is a kind of XML itself and is somewhat self describing although verbose.  The attributes of molecule (convention, id, and title) are defined in a similar way.

How do I get Data out of XML Documents?

In the text above I focussed on reading and understanding XML documents.  A bioscience or health care subject matter expert may be interviewed by a software engineer to help develop a XML language to help the software engineer do data modelling for formulating the language.  Or you may read journal articles that refer to XML.  In these cases a little background in reading XML documents won't hurt.  However, if you do any work in bioinformatics you will likely have to write programs or scripts to process XML.  To actually do some thing with XML programmatically you will need to parse it to etract data from it. 

Many programming languages have XML parsing built into them.  Java 5.0 has XML parsing libraries included as part of the platform5.  This is extended to provide wizards for building object models and XML editors from XML Schema by the Eclipse Modeling Framework6.  For C++ the Xerces C++ project provides a open source parser7.

You may also need to create web services.  A web services is a type of protocol for communication between computers over the internet.  The W3C Web Services Activity8 and the Web Services Project @ Apache9 sites are useful references for this.


Continued

Google

Please send me ideas and opinions by email at webmaster@medicalcomputing.net or add comments to my blog.  The content may become part of the web site.

© 2006 Alex Amies