|
Chemical and biosciences data needs to be represented in a structured and unambiguous way for computers to stored and read back again later and to attach meaning to that data. In the past many different structures were invented to do this, including comma separated variable files, binary files, and relational database schemas. One of the issues was that the formats were proprietary and this led to incompatibility between software systems. Extensible Markup Language (XML) was created to overcome this problem. The authority for XML is the World Wide Web Consortium (W3C)1, which has various working groups and makes specifications available at their web site. Chemical Markup Language (CML)
One type of XML that may for the basis for much work in chemistry and
the biosciences is Chemical Markup Language
(CML)2.
This
is a variety of XML designed to represent molecular information.
The current version of CML is 2.1.1.
|
|
If you drag your mouse around the 3D representation it will rotate to help aid in visualizing the molecule. In the CML Source window the raw XML is shown. Let's start by discussing a simpler XML document and get back to this example later.
Here is an example from the CML Schema:
<cml>. This is the top level
tag. The next tag is <molecule id="m1">.
If I hadn't used CML before I wouldn't know what this
meant, although I can easily guess because the CML authors gave it a
meaningful name. How would I find out for sure? I would go
to the XML Schema and check out the documentation for it. The CML
Schema reference says that the <molecule>
tag is "a container for atoms, bonds and submolecules."
The 'id' attribute is used as a unique identifier so that the molecule
can be referred to from elsewhere. Similarly, the <atomArray>
is "a container for a
list of atoms." The tag <atom elementType="N"/>
specifies a nitrogen atom and the tag <atom
elementType="O"/> an oxygen atom.
The tag </atomArray> closes the <atomArray>
element. Tags must always be closed with a </...>
pattern in XML to create well formed
documents. Also, tags must be fully enclosed within other tags
and cannot overlap. For example, <a><b></b></a>
is well formed XML but <a><b></a></b>
is not. If a tag does not have anything inside it then the
shorthand <.../>
can be used to indicate both opening and closing an empty tag.
<?xml version="1.0"?>
This line appears at the top of all XML documents. The <?
... ?> indicates that it is a processing instruction.
The line alsoidentifies
the document as XML version 1.0. The next line
<!DOCTYPE cml SYSTEM
"http://www.xml-cml.org/dtd/cml1_0_1.dtd">
identifies the particular type of xml. In this case it is CML. It
points to the data type definition (DTD), which defines CML
Version. The next line
<cml title="ethanol" id="cml_ethanol_karne"
xmlns="x-schema:cml_schema_ie_02.xml">
We saw the cml tag before. This time it has a title, which is
used for documentation
purposes. It has an ID to uniquely identity it. The CML
from the Adobe demo has a lot of information. In
addition
to atom array information there is also bond, stereochemistry, and
spectra information.
During the change from CML1 to CML2 Chemical Markup Language made the transition from a DTD to an XML Schema4. These days XML Schema is preferred over DTD's. XML Schema is a more powerful way of defining the structure of XML documents and it allows multiple XML languages to be mixed together. With DTDs there is no way to make use of other schemas when creating a new schema. In other words, you could only use one DTD at a time and it you wanted to use a type from somewhere else you would have to cut at paste it.
Let's look at a CML2 document.
In the second line
<molecule convention="MDLMol" id="arginine"
title="ARGININE" xmlns='http://www.xml-cml.org/schema'>
the xmlns
(XML namespace) attribute is used to group the CML tags as within a
name space as distrinct from tags from some other name space that might
happen to have the same tag names. This type of definition is
called a default name space. There can only one default namespace
at a particular location in a document and which this is is given by
the value of the xmlns attribute. In this case it is
'http://www.xml-cml.org/schema'. In the XML Schema for
CML this is the target namespace. Let's look at an example with
multiple namespaces and namespace qualifiers.
In this case I am using the same CML schema but I have put the
qualifier
'cml' for the CML schema tags to separate them from the other
tags. Those other tags have the qualifier 'xsl', for eXtensible
Stylesheet Language (XSL). Thus the use of qualifiers enables
tags from different XML variants to be mixed together. The values
for the qualifiers 'cml' and 'xsl'
can be changed arbitrarily. I could have used 'blort' instead of
'cml', provided I used the correct namespace uniform resource indicator
(URI),
http://www.xml-cml.org/schema to identify it as CML. eXtensible
Stylesheet Language is a language for transforming XML to something
else (that something else could be some more XML). I will discuss
it below.
How do I know what can go inside a molecule tag? I could look
up the documentation and read it or I could look up the
schema itself. For molecule I would find this
The meaning of this is: a molecule is a complex type, which is a sequence of ... atom array, bond array, and .... You can see that XML Schema is a kind of XML itself and is somewhat self describing although verbose. The attributes of molecule (convention, id, and title) are defined in a similar way.
In the text above I focussed on reading and understanding XML documents. A bioscience or health care subject matter expert may be interviewed by a software engineer to help develop a XML language to help the software engineer do data modelling for formulating the language. Or you may read journal articles that refer to XML. In these cases a little background in reading XML documents won't hurt. However, if you do any work in bioinformatics you will likely have to write programs or scripts to process XML. To actually do some thing with XML programmatically you will need to parse it to etract data from it.
Many programming languages have XML parsing built into them.
Java 5.0 has XML parsing libraries included as part of the platform5. This
is extended to provide
wizards for building object models and XML editors from XML Schema by
the Eclipse Modeling Framework6.
For C++ the Xerces C++ project provides a open
source parser7.
You may also need to create web services. A web services is a type of protocol for communication between computers over the internet. The W3C Web Services Activity8 and the Web Services Project @ Apache9 sites are useful references for this.