Working with Common Molecular File Formats

Alex Amies March 31, 2006

Contents

Abstract

This document discusses some common molecular file formats, including useage and converting between them. It also lists a few sources for freely obtaining chemical data on the Internet. The intended audience is anyone who wants to work with molecular data and has a basic understanding of chemistry and computer usage.

Introduction

When working with different software tools in chemistry users are likely to come across a large number of different molecular file formats. This can be a barrier due to the confusion created and the ability for certain tools only able to work with certain file formats.  The motivation behind this document is to help users who are not experts in computational chemistry overcome that barrier.

Sources of Chemical Data

Here is a short list of sources of freely available molecular data. There are many more resources than listed here out there on the Internet.  Email me to suggest another.


  1. The US National Institute of Health PubChem database1 is a huge source of chemical data. All of the data is in two-dimensions. Data includes SDF, SMILES, PubChem XML, and PubChem ASN1 formats.
  2. The Protein Data Bank2 is an excellent source of protein molecular data. The data is three-dimensional and provided in Protein Data Bank (PDB) format.
  3. Chmoogle3 is a commercial data base for molecular data.  The data includes a two-dimensional structure diagram and a smiles string for each compound.  Chmoogle supports substructure searching based on parts of the molecular structure.
  4. ChemExper4 is a commercial data base for molecular data.  The search results include a two-dimensional structure diagram and a mole file for many compounds.
  5. New York University Library of 3-D Molecular Structures5.
  6. The US Environmental Protection Agency's The Distributed Structure-Searchable Toxicity (DSSTox) Database Network10 is a project of EPA's Computational Toxicology Program.  The database provides SDF molecular files with a focus on carcinogenic and otherwise toxic substances.

Chemical Markup Language

Chemical Markup Language (CML)6 is an open standard for representing molecular and other chemical data. The open source project includes XML Schema, source code for parsing and working with CML data, and an active community. The articles Tools for Working with Chemical Markup Language and XML for Chemistry and Biosciences discusses CML in more detail. CML data files are accepted by many tools, including JChemPaint, Jmol, and MarvinView.

Protein Data Bank Format

The Protein Data Bank (PDB)2 Format is commonly used for proteins but it can be used for other types of molecules as well. Because of the size of some of these files they are often compressed and some tools, such as Jmol can accept the files in gzipped format.

SMILES

SMILES7 is a simple yet describtive chemical representation. SMILES strings include connectivity but not include coordindate data. They can be used to generate three-dimensional cooridates as discussed below.

Atoms are represented by their element symbols B, C, N, O, F, P, S, Cl, Br, and I. Double bonds are represented by `=' and triple bonds are represented by `#'. Branching is indicated by parentheses. Ring closures are indicated by pairs of matching digits.

Some examples are

Name Formula SMILES String
Methane CH4 C
Ethanol C2H6O CCO
Benzene C6H6 C1CCCCC1
Ethylene C2H4 C=C

Other Common Formats

Structure Data Format (SDF) files are text files that adhere to a strict format for representing multiple chemical structure records and associated data fields. The format was originally developed and published by Molecular Design Limited (MDL)8.  MOL is another file formats from MDL.  It is documented in Chapter 4 of the white paper MDL® CTfile Formats. 

PubChem1 also has XML and ASN1 file formats, which are export options from the PubChem online database.  They are both text based (ASN1 is most often a binary format).

Converting Between Formats

Open Babel9 is a freely available open source tool specifically designed for converting between file formats.  It supports a large number of file formats.  Simple useage is

babel -i input_format input_file -o output_format output_file

For example, to convert the file epinephrine.sdf in SDF to CML use the command

babel -i sdf epinephrine.sdf -o cml epinephrine.cml

The resulting file is epinephrine.cml.

A number of tools intended for viewing and editing molecular structures are able to read in files in a number of formats and write them out in other formats.  The tools JChemPaint11, Chime12, and Jmol13 fit into this category.

Generating Three Dimensional Coordinates

Three-dimensional coordinates can be generated from SMILES string using the tool Corina from Molecular Networks14.

Thanks fo Angel Herraez of Universidad de Alcala, Madrid for suggesting this. One procedure to generate three dimensional coordinates is

  1. Download and install Chime. Chime is a Microsoft Internet Explorer plug-in. You will not see it in the start menu after installing it.
  2. Look the compound up on PubChem using it's name
  3. From the result returned click the 'Exports' button
  4. Still within PubChem copy the SMILES string
  5. Paste the SMILES string into Corina. You may use the online demo at the Molecular Networks web site (I had trouble with this) or the 3D Structures web page15.
  6. From within Chime right click anywhere on the canvas and select File | Save. Save the file as a PDB.

Acknowledgements

  1. Thanks fo Angel Angel Herraez, Dep. Bioquimica y Biologia Molecular, Universidad de Alcala, Madrid for suggesting the procedure for generating three dimensional coordinates.

About Author

Alex Amies is a senior software engineer at IBM.  He can be contacted at alexamies@gmail.com.

References

  1. National Library of Medicine, PubChem online database at pubchem.ncbi.nlm.nih.gov/.  The speciifcation for PubChem ASN1 molecular file format is at ftp://ftp.ncbi.nlm.nih.gov/pubchem/specifications/pubchem.asn and the XML Schema for PubChem XML format is at ftp://ftp.ncbi.nlm.nih.gov/pubchem/specifications/pubchem.xsd.
  2. Research Collaboratory for Structural Bioinformatics, Protein Data Bank at www.rcsb.org/pdb/Welcome.do.  
  3. eMolecules, Inc., Chmoogle is a search engine for chemical structures and properties at www.chmoogle.com/index.htm.  There is information about chem-informatics and structure searching at
    http://www.chmoogle.com/doc/cheminformatics-101.htm.
  4. ChemExper has a database including thousands of chemicals and their structural diagrams and properties.  
  5. New York University Library of 3-D Molecular Structures http://www.nyu.edu/pages/mathmol/library/.
  6. Chemical Markup Language is a SourceForge project hosted at  cml.sourceforge.net.  This includes the CML Schema, links to tools, documentation, and source code.  There is a discussion list at cml.sourceforge.net/list/index.html.  The CML Wiki is at cml.sourceforge.net/wiki/index.php/Main_Page.
  7. SMILES home page at www.daylight.com/smiles/index.html.
  8. Molecular Design Limited June 2005.  MDL® CTfile Formats White Paper at www.mdl.com/solutions/white_papers/ctfile_formats.jsp. The MDL company  web site is at www.mdl.com.
  9. Open Babel SourceForge project at openbabel.sourceforge.net/wiki/Main_Page.
  10. US Environmental Protection Agency Distributed Structure-Searchable Toxicity (DSSTox) Database Network at www.epa.gov/nheerl/dsstox/index.html.
  11. JChemPaint is a SourceForge project hosted at sourceforge.net/projects/jchempaint
  12. MDL Chime can be downloaded at www.mdl.com/products/framework/chime/ after registration. Chime tutorials are at www.mdlchime.com/support/developer/chime/index.jsp and www.chem.uwec.edu/ChimeTutDemos.
  13. Jmol is an open source project hosted on SourceForge at jmol.sourceforge.net.  It also has a wiki at wiki.jmol.org/WebsitesUsingJmol.
  14. Molecular Networks home page at www.mol-net.de and Corina demo page at www.mol-net.de/online_demos/corina_demo.html.
  15. Professor Gasteiger's research team at Computer-Chemie-Centrum and Institute for Organic Chemistry, University of Erlangen-Nürnberg, Germany - 3D Structures web page at www2.chemie.uni-erlangen.de/software/corina/free_struct.html.

Please send me ideas and opinions by email at webmaster@medicalcomputing.net or add comments to my blog.  The content may become part of the web site.