Introduction to Extensible Markup Language for Chemistry and
Biosciences (Continued)
|
Alex Amies March 2, 2006
Previous
Transforming XML into other formats
Chemical Markup Language and other versions of XML may need to
be
tranformed into different formats. Examples of other formats
may be HTML for display on web browsers, PDF, structure query language
(SQL) database statements, and other varieties of XML. For
example, we may want to transform CML v1 into CML v2. To do this
eXtensible Style Sheet Language (XSL) may be used. An XSL
Transformer (XSLT) is a tool that takes an XSL file and uses it to
tranform an XML document into something else. The XSL document
specifies the rules for doing the transformation.
Let's look at an example XSLT style sheet. This is a
simple
example that outputs a HTML summary of a CML1
file. The file itself is cml1toxhtml.xslt.
1 <?xml version="1.0"?>
2 <!-- A style sheet for transforming Chemical
Markup Language v1 to
HTML -->
3 <xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">
4
5 <xsl:output method="html"
indent="yes"/>
6
7 <!-- Template rule outputs the basic
outline for a HTML page.
-->
8 <xsl:template match="/">
9 <html>
10 <head>
11
<title>Extract from a
Chemical Markup Language file</title>
12 </head>
13 <body>
14 <h1>Extract
from a
Chemical Markup Language File</h1>
15
<xsl:apply-templates/>
16 </body>
17 </html>
18 </xsl:template>
19
20 <!-- HTML fragment for a CML molecule tag
-->
21 <xsl:template match="molecule">
22 <h2>Molecule <xsl:value-of
select="@title"/></h2>
23 <xsl:apply-templates
select="atomArray"/>
24 </xsl:template>
25
26 <!-- HTML fragment for a CML atomArray
tag -->
27 <xsl:template match="atomArray">
28 <h3>Atom Array</h3>
29 <xsl:apply-templates/>
30 </xsl:template>
31
32 <!-- HTML fragment for a CML atom tag
-->
33 <xsl:template match="atom">
34 <br/>atom id: <xsl:value-of
select="@id"/>, <xsl:value-of select="string"/>
35 </xsl:template>
36
37 </xsl:stylesheet>
|
|
Line 1 is the processing instruction letting the processor know that
this is an XML file. Line 3 is the top level sytlesheet tag that
encloses the whole sytle sheet. It includes a namespace
attribute. The mix of XSLT tags, XML tags, and HTML tags here
illustrates the importance of namespaces for differentiating the
different varieties of markup. Line 5 is an output instruction
letting the processor know that the output will he a HTML file and
please indent it to make it easier for humans to read.
Line 8 starts an XSL template block, which is the top level template
to output the outline of the HTML file. The HTML outline is given
in
lines 9 through 17. Line 15 tells the processor to look for other
rules to apply. The next rule to be found by the XSLT processor
will
be the template for molecule on line 21. This outputs a h2 level
2
HTML heading tag with the text 'Molecule' and the value of the title
attribute in the molecule. The block in lines 27 through 30 match
an atomArray tag. They output a level 3 HTML header tag h3 and
then tell the processor to apply remaining templates. The
remaining template on lines 33 through 35 match an atom tag. It
output the element type given in the string tag. This is not a
comprehensive XSLT style sheet and many tags are ignored.
Let's use this stylesheet to transform the CML v1 file arginine.cml. The first few lines of
the file are shown here.
<molecule convention="MDLMol" id="arginine"
title="ARGININE">
<date day="22" month="11" year="1995">
</date>
<atomArray>
<atom id="a1">
<string
builtin="elementType">C</string>
<float
builtin="x2">0.7386</float>
<float
builtin="y2">0.1493</float>
</atom>
...
</atomArray>
...
</molecule>
Now we need an XSLT processor. The Sun Java 5.0 platform
includes XSL processing features5.
A number of other vendors and open source projects have XSLT
processors, including Microsoft. I am using a Java XSLT processor but
you may use anything other
type with the XSL style sheet below. The Java file is XmlTransformer.java.
1 import java.io.File;
2 import javax.xml.transform.Result;
3 import javax.xml.transform.Transformer;
4 import
javax.xml.transform.TransformerConfigurationException;
5 import javax.xml.transform.TransformerException;
6 import javax.xml.transform.TransformerFactory;
7 import javax.xml.transform.stream.StreamResult;
8 import javax.xml.transform.stream.StreamSource;
9
10 /*
11 * Creates an XSLT transformer and applies it to
the first
argument. The resulting xml document is
12 * output as the file output.xml.
13 */
14 public class XmlTransformer {
15 /**
16 * @param argv[0] is the name of the style
sheet, argv[1] is the name of the input html file,
17
*
argv[2] is the name of the output file.
18 */
19 public static void main(String[] argv) {
20
System.out.println("Source style
sheet: " + argv[0]);
21
System.out.println("Input xml
document: " + argv[1]);
22
System.out.println("Output html
document: " + argv[2]);
23 TransformerFactory
transformerFactory = TransformerFactory.newInstance();
24 StreamSource xsltSource
= new
StreamSource(new File(argv[0]));
25 StreamSource
xmlSource = new
StreamSource(new File(argv[1]));
26 Result result =
new
StreamResult(new File(argv[2]));
27 try {
28
Transformer
transformer = transformerFactory.newTransformer(xsltSource);
29
transformer.transform(xmlSource, result);
30 } catch
(TransformerConfigurationException e) {
31
e.printStackTrace();
32 } catch
(TransformerException e) {
33
e.printStackTrace();
34 }
35 }
37 }
This program loads an XSLT transformer with a given style sheet and
transforms the input file to an output file. In lines 1 through 8
the relevant input files are imported. Line 14 declares the class
and line 19 declares the main method. Line 23 instantiates the
transformer factory, which will create the transformer. Line 24
creates the input source that the XSLT file will be read from using the
first argument from the command line. Line 25 creates the input
source that the input XML file will be read from using the second
parameter from the command line. Line 26 defines the result that
the output file will be written to using the third argument from the
command line. Line 28 uses the transformer factory to create a
new transformer with the XSLT input source. Line 29 takes the
input XML file and transforms it to the output file. Lines 30
through 33 catch and process exceptions that may be generated during
the program.
To compile and run the program enter these command on a DOS or UNIX
command line.
> javac src\XmlTransformer.java
> java -cp classes XmlTransformer cml1toxhtml.xslt arginine.cml
arginine.html
The output is shown below and in the file arginine.html.
<html>
<head>
<META http-equiv="Content-Type" content="text/html;
charset=UTF-8">
<title>Extract from a Chemical Markup Language file</title>
</head>
<body>
<h1>Extract from a Chemical Markup Language File</h1>
<h2>Molecule ARGININE</h2>
<h3>Atom Array</h3>
<br>atom id: a1, C
<br>atom id: a2, C
<br>atom id: a3, C
<br>atom id: a4, C
<br>atom id: a5, C
<br>atom id: a6, N
<br>atom id: a7, N
<br>atom id: a8, O
<br>atom id: a9, O
<br>atom id: a10, C
<br>atom id: a11, N
<br>atom id: a12, N
<br>atom id: a13, H
</body>
</html>
The article Tools for Working with Chemical
XML provides an XSLT stylesheet for transforming CML1 to
CML2. The W3C The Extensible Stylesheet Language Family (XSL)
Architecture
Domain is a fundamental reference
for
this10. The current version
is 1.0 and there is a candidate recomendation for a version 1.1.
Other references are Kay, XSLT: A Programmer's Reference11 and the Xalan Apache Project12.
Other XML Languages for Chemistry and BioSciences
There are a number of of XML languages that are relevant to
biosciences are listed in the XML for Molecular Biology web site13, including
- Bioinformatic Sequence Markup Language (BSML) - an extensible
language specification and container for bioinformatic data14
- chadoxml - development of a generic model organism system
database15
- RNAML - express data on RNA sequence and structure16
- BlastXML - description of Basic Local Alignment Search Tool output17
An interesting research project is the W3C Semantic Web Health Care
and Life Sciences Interest Group. Languages and projects
include
- BioDASH - a prototype of a Drug Development Dashboard that
associates disease, compounds, drug progression stages, molecular
biology, and pathway knowledge
- Partners Healthcare Systems on 'Clinical Knowledge Management'
- Gene Ontology -
provides a controlled vocabulary to describe gene and gene product
attributes in any organism
- Clinical Data Interchange
Standards Consortium
- Foundational
Model of Anatomy at the Universtiy of Washington
Previous
About the Author
Alex Amies (alexamies@yahoo.com) has a Bachelor of Science in Computer
Science from the
University of New South Wales, Australia and an Master of Science in
Engineering from Stanford
University. He is currently working for IBM as a Senior Software
Engineer. He lives in Irvine, California.
Related Reading
- Alex Amies, 2006. Tools for Working with Chemical Markup
Language at www.medicalcomputing.net/cmltools.html.
- Alex Amies, 2006. Basic Biological Chemistry with Chemical
Markup Language at www.medicalcomputing.net/biological_chem_computer.html.
References
- World Wide Web Consortium (W3C), 2004. W3C Recommendation:
Extensible Markup Language (XML) 1.0 (Third Edition), www.w3.org/XML.
- The Chemical Markup Language
(CML) project is hosted by SourceForge at cml.sourceforge.net.
- The CML SVG demo was created and hosted by Adobe Systems, SVG
Zone at www.adobe.com/svg/demos/main.html.
- World Wide Web Consortium (W3C), 2004. W3C Recommendation:
XML Schema Part 0: Primer Second Edition, www.w3.org/TR/xmlschema-0.
- Sun Microsystems, Java XML Technology Page at java.sun.com/xml.
- Eclipse Organization, the Eclipse Modeling Framework at www.eclipse.org/emf.
- Apache Foundation, Xerces-C++ Version 2.7.0 at xml.apache.org/xerces-c.
- World Wide Web Consortium (W3C), Web Services Activity
Architecture Domain at www.w3.org/2002/ws.
- Apache Foundation, Web Services Project @ Apache at ws.apache.org.
- World Wide Web Consortium (W3C), The Extensible Stylesheet
Language Family (XSL) Architecture Domain page at www.w3.org/Style/XSL.
Extensible Stylesheet Language (XSL) Version 1.1
W3C Candidate Recommendation 20 February 2006 is at www.w3.org/TR/2006/CR-xsl11-20060220.
- Kay, M 2000. XSLT: A Programmer's Reference, Wrox Press.
- Apache Foundation, Xalan Project at xalan.apache.org/index.html.
- Paul Gordon, XML for Molecular Biology at www.visualgenomics.ca/gordonp/xml.
- Bioinformatic Sequence Markup Language (BSML) 3.1 DTD at www.bsml.org/resources/default.asp.
- Generic Model Organism System Database project, hosted on
SourceForge at sourceforge.net/projects/gmod.
- RNAML XML Schema Definition 1.1, 2002. Laboratoire de
Bioinformatique Théorique at www-lbit.iro.umontreal.ca/rnaml.
- BlastXML DTD at www.ncbi.nlm.nih.gov/blast.
- World Wide Web Consortium (W3C), Semantic Web Health Care and
Life Sciences Interest Group at www.w3.org/2001/sw/hcls.
Please send me ideas and
opinions
by email at webmaster@medicalcomputing.net or add comments to my blog.
The content may
become part of
the web
site.
© 2006 Alex Amies