Introduction to Extensible Markup Language for Chemistry and Biosciences (Continued)

Alex Amies   March 2, 2006
Previous

Transforming XML into other formats

Chemical Markup Language and other versions of XML may need to be tranformed into different formats.  Examples of other formats may be HTML for display on web browsers, PDF, structure query language (SQL) database statements, and other varieties of XML.  For example, we may want to transform CML v1 into CML v2.  To do this eXtensible Style Sheet Language (XSL) may be used.  An XSL Transformer (XSLT) is a tool that takes an XSL file and uses it to tranform an XML document into something else.  The XSL document specifies the rules for doing the transformation.

Let's look at an example XSLT style sheet.  This is a simple example that outputs a HTML summary of a CML1 file. The file itself is cml1toxhtml.xslt.

1    <?xml version="1.0"?>
2    <!-- A style sheet for transforming Chemical Markup Language v1 to HTML -->
3    <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
4
5      <xsl:output method="html" indent="yes"/>
6
7      <!-- Template rule outputs the basic outline for a HTML page. -->
8      <xsl:template match="/">
9        <html>
10         <head>
11           <title>Extract from a Chemical Markup Language file</title>
12         </head>
13         <body>
14           <h1>Extract from a Chemical Markup Language File</h1>
15           <xsl:apply-templates/>
16         </body>
17       </html>
18     </xsl:template>
19
20     <!-- HTML fragment for a CML molecule tag -->
21     <xsl:template match="molecule">
22       <h2>Molecule <xsl:value-of select="@title"/></h2>
23       <xsl:apply-templates  select="atomArray"/>
24     </xsl:template>
25
26     <!-- HTML fragment for a CML atomArray tag -->
27     <xsl:template match="atomArray">
28       <h3>Atom Array</h3>
29       <xsl:apply-templates/>
30     </xsl:template>
31
32     <!-- HTML fragment for a CML atom tag -->
33     <xsl:template match="atom">
34       <br/>atom id: <xsl:value-of select="@id"/>, <xsl:value-of select="string"/>
35     </xsl:template>
36
37   </xsl:stylesheet>  

Line 1 is the processing instruction letting the processor know that this is an XML file.  Line 3 is the top level sytlesheet tag that encloses the whole sytle sheet.  It includes a namespace attribute.  The mix of XSLT tags, XML tags, and HTML tags here illustrates the importance of namespaces for differentiating the different varieties of markup.  Line 5 is an output instruction letting the processor know that the output will he a HTML file and please indent it to make it easier for humans to read.

Line 8 starts an XSL template block, which is the top level template to output the outline of the HTML file.  The HTML outline is given in lines 9 through 17.  Line 15 tells the processor to look for other rules to apply.  The next rule to be found by the XSLT processor will be the template for molecule on line 21.  This outputs a h2 level 2 HTML heading tag with the text 'Molecule' and the value of the title attribute in the molecule.  The block in lines 27 through 30 match an atomArray tag.  They output a level 3 HTML header tag h3 and then tell the processor to apply remaining templates.  The remaining template on lines 33 through 35 match an atom tag.  It output the element type given in the string tag.  This is not a comprehensive XSLT style sheet and many tags are ignored.

Let's use this stylesheet to transform the CML v1 file arginine.cml.  The first few lines of the file are shown here.


<molecule convention="MDLMol" id="arginine" title="ARGININE">
  <date day="22" month="11" year="1995">
  </date>
  <atomArray>
    <atom id="a1">
      <string builtin="elementType">C</string>
      <float builtin="x2">0.7386</float>
      <float builtin="y2">0.1493</float>
    </atom>
...
  </atomArray>
...
</molecule>

Now we need an XSLT processor.  The Sun Java 5.0 platform includes XSL processing features5.  A number of other vendors and open source projects have XSLT processors, including Microsoft. I am using a Java XSLT processor but you may use anything other type with the XSL style sheet below.  The Java file is XmlTransformer.java.


1    import java.io.File;
2    import javax.xml.transform.Result;
3    import javax.xml.transform.Transformer;
4    import javax.xml.transform.TransformerConfigurationException;
5    import javax.xml.transform.TransformerException;
6    import javax.xml.transform.TransformerFactory;
7    import javax.xml.transform.stream.StreamResult;
8    import javax.xml.transform.stream.StreamSource;
9
10   /*
11    * Creates an XSLT transformer and applies it to the first argument.  The resulting xml document is
12    * output as the file output.xml.
13    */
14   public class XmlTransformer {
15       /**
16        * @param argv[0] is the name of the style sheet, argv[1] is the name of the input html file,
17        *        argv[2] is the name of the output file.
18        */
19       public static void main(String[] argv) {
20           System.out.println("Source style sheet: " + argv[0]);
21           System.out.println("Input xml document: " + argv[1]);
22           System.out.println("Output html document: " + argv[2]);
23           TransformerFactory transformerFactory = TransformerFactory.newInstance();
24           StreamSource xsltSource = new StreamSource(new File(argv[0]));
25           StreamSource xmlSource = new StreamSource(new File(argv[1]));
26           Result result = new StreamResult(new File(argv[2]));
27           try {
28               Transformer transformer = transformerFactory.newTransformer(xsltSource);
29               transformer.transform(xmlSource, result);
30           } catch (TransformerConfigurationException e) {
31              e.printStackTrace();
32           } catch (TransformerException e) {
33              e.printStackTrace();
34           }
35       }
37   }

This program loads an XSLT transformer with a given style sheet and transforms the input file to an output file.  In lines 1 through 8 the relevant input files are imported.  Line 14 declares the class and line 19 declares the main method.  Line 23 instantiates the transformer factory, which will create the transformer.  Line 24 creates the input source that the XSLT file will be read from using the first argument from the command line.  Line 25 creates the input source that the input XML file will be read from using the second parameter from the command line.  Line 26 defines the result that the output file will be written to using the third argument from the command line.  Line 28 uses the transformer factory to create a new transformer with the XSLT input source.  Line 29 takes the input XML file and transforms it to the output file.  Lines 30 through 33 catch and process exceptions that may be generated during the program.

To compile and run the program enter these command on a DOS or UNIX command line.

> javac src\XmlTransformer.java
> java -cp classes XmlTransformer cml1toxhtml.xslt arginine.cml arginine.html

The output is shown below and in the file arginine.html.


<html>
<head>
<META http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>Extract from a Chemical Markup Language file</title>
</head>
<body>
<h1>Extract from a Chemical Markup Language File</h1>
<h2>Molecule ARGININE</h2>
<h3>Atom Array</h3>
   
<br>atom id: a1, C
    <br>atom id: a2, C
    <br>atom id: a3, C
    <br>atom id: a4, C
    <br>atom id: a5, C
    <br>atom id: a6, N
    <br>atom id: a7, N
    <br>atom id: a8, O
    <br>atom id: a9, O
    <br>atom id: a10, C
    <br>atom id: a11, N
    <br>atom id: a12, N
    <br>atom id: a13, H
  </body>
</html>

The article Tools for Working with Chemical XML provides an XSLT stylesheet for transforming CML1 to CML2.  The W3C The Extensible Stylesheet Language Family (XSL) Architecture Domain is a fundamental reference for this10.  The current version is 1.0 and there is a candidate recomendation for a version 1.1.  Other references are Kay, XSLT: A Programmer's Reference11 and the Xalan Apache Project12.

Other XML Languages for Chemistry and BioSciences

There are a number of of XML languages that are relevant to biosciences are listed in the XML for Molecular Biology web site13, including

An interesting research project is the W3C Semantic Web Health Care and Life Sciences Interest Group.  Languages and projects include

Previous

About the Author

Alex Amies (alexamies@yahoo.com) has a Bachelor of Science in Computer Science from the University of New South Wales, Australia and an Master of Science in Engineering from Stanford University.  He is currently working for IBM as a Senior Software Engineer.  He lives in Irvine, California.

Related Reading

  1. Alex Amies, 2006.  Tools for Working with Chemical Markup Language at www.medicalcomputing.net/cmltools.html.
  2. Alex Amies, 2006.  Basic Biological Chemistry with Chemical Markup Language at www.medicalcomputing.net/biological_chem_computer.html.

References

  1. World Wide Web Consortium (W3C), 2004. W3C Recommendation: Extensible Markup Language (XML) 1.0 (Third Edition), www.w3.org/XML.   
  2. The Chemical Markup Language (CML) project is hosted by SourceForge at cml.sourceforge.net
  3. The CML SVG demo was created and hosted by Adobe Systems, SVG Zone at www.adobe.com/svg/demos/main.html.
  4. World Wide Web Consortium (W3C), 2004.  W3C Recommendation: XML Schema Part 0: Primer Second Edition, www.w3.org/TR/xmlschema-0.
  5. Sun Microsystems, Java XML Technology Page at java.sun.com/xml.
  6. Eclipse Organization, the Eclipse Modeling Framework at www.eclipse.org/emf.
  7. Apache Foundation, Xerces-C++ Version 2.7.0 at xml.apache.org/xerces-c.
  8. World Wide Web Consortium (W3C), Web Services Activity Architecture Domain at www.w3.org/2002/ws.
  9. Apache Foundation, Web Services Project @ Apache at ws.apache.org.
  10. World Wide Web Consortium (W3C), The Extensible Stylesheet Language Family (XSL) Architecture Domain page at www.w3.org/Style/XSL.  Extensible Stylesheet Language (XSL) Version 1.1
    W3C Candidate Recommendation 20 February 2006 is at www.w3.org/TR/2006/CR-xsl11-20060220.
  11. Kay, M 2000.  XSLT: A Programmer's Reference, Wrox Press.
  12. Apache Foundation, Xalan Project at xalan.apache.org/index.html.
  13. Paul Gordon, XML for Molecular Biology at www.visualgenomics.ca/gordonp/xml
  14. Bioinformatic Sequence Markup Language (BSML) 3.1 DTD at www.bsml.org/resources/default.asp.
  15. Generic Model Organism System Database project, hosted on SourceForge at sourceforge.net/projects/gmod.
  16. RNAML XML Schema Definition 1.1, 2002. Laboratoire de Bioinformatique Théorique at www-lbit.iro.umontreal.ca/rnaml.
  17. BlastXML DTD at www.ncbi.nlm.nih.gov/blast.
  18. World Wide Web Consortium (W3C), Semantic Web Health Care and Life Sciences Interest Group at www.w3.org/2001/sw/hcls.
Google

Please send me ideas and opinions by email at webmaster@medicalcomputing.net or add comments to my blog.  The content may become part of the web site.

© 2006 Alex Amies