Working with Common Molecular File Formats
Contents
Abstract
This document discusses some common molecular file formats,
including useage and converting between them. It also lists a few
sources for freely obtaining chemical data on the Internet. The
intended audience is anyone who wants to work with molecular data and
has a basic understanding of chemistry and computer usage.
Introduction
When working with different software tools in chemistry users
are
likely to come across a large number of different molecular file
formats. This can be a barrier due to the confusion created and the
ability for certain tools only able to work with certain file
formats. The motivation behind this document is to help users who
are not experts in computational chemistry overcome that barrier.
Sources of Chemical
Data
Here is a short list of sources of freely available molecular
data.
There are many more resources than listed here out there on the
Internet. Email me to suggest another.
|
|
- The US National Institute of Health PubChem database1 is a huge source of chemical data. All
of the data is in two-dimensions. Data includes SDF, SMILES, PubChem
XML, and PubChem ASN1 formats.
- The Protein Data Bank2 is an
excellent source of protein molecular data. The data is
three-dimensional and provided in Protein Data Bank (PDB) format.
- Chmoogle3 is a commercial
data base for molecular data. The data includes a two-dimensional
structure diagram and a smiles string for each compound. Chmoogle
supports substructure searching based on parts of the molecular
structure.
- ChemExper4 is a commercial
data base
for molecular data. The search results include a two-dimensional
structure diagram and a mole file for many compounds.
- New York University Library of 3-D Molecular Structures5.
- The US Environmental Protection Agency's The Distributed
Structure-Searchable Toxicity (DSSTox) Database Network10 is a project of EPA's Computational
Toxicology Program. The database provides SDF molecular files
with a focus on carcinogenic and otherwise toxic substances.
Chemical Markup Language
Chemical Markup Language (CML)6
is an open
standard for representing molecular and other chemical data. The open
source project includes XML Schema, source code for parsing and working
with CML data, and an active community. The articles Tools for Working with Chemical Markup Language
and XML for Chemistry and Biosciences
discusses CML in more detail. CML data files are accepted by many
tools, including JChemPaint, Jmol, and MarvinView.
Protein Data Bank Format
The Protein Data Bank (PDB)2
Format is
commonly used for proteins but it can be used for other types of
molecules as well. Because
of the size of some of these files they are often compressed and some
tools, such as Jmol can accept the files in gzipped format.
SMILES
SMILES7 is a simple yet
describtive chemical representation. SMILES strings include
connectivity but not
include coordindate data. They can be used to generate
three-dimensional
cooridates as discussed below.
Atoms are represented by their element symbols B, C, N, O, F, P, S,
Cl,
Br, and I. Double bonds are represented by `=' and triple bonds are
represented by `#'. Branching is indicated by parentheses. Ring
closures are indicated by pairs of matching digits.
Some examples are
Other Common Formats
Structure Data Format (SDF) files are text files that adhere to a
strict format for representing
multiple chemical structure records and associated data fields. The
format was originally developed and published by Molecular Design
Limited (MDL)8. MOL is
another file formats from MDL. It is documented in Chapter 4 of
the
white paper MDL® CTfile Formats.
PubChem1 also has XML and ASN1
file formats, which are export options from the PubChem online
database. They are both text based (ASN1 is most often a binary
format).
Converting Between Formats
Open Babel9 is a freely
available open source tool specifically
designed for converting between file formats. It supports a large
number of file formats. Simple useage is
babel -i input_format
input_file -o output_format
output_file
For example, to convert the file epinephrine.sdf
in SDF to CML use the command
babel -i sdf epinephrine.sdf -o cml epinephrine.cml
The resulting file is epinephrine.cml.
A number of tools intended for viewing and editing molecular
structures are able to read in files in a number of formats and write
them out in other formats. The tools JChemPaint11, Chime12,
and Jmol13 fit into this category.
Generating
Three Dimensional Coordinates
Three-dimensional coordinates can be generated from SMILES string
using the tool Corina from Molecular Networks14.
Thanks fo Angel Herraez of Universidad de Alcala, Madrid for
suggesting this. One procedure to generate three dimensional
coordinates is
- Download and install Chime. Chime is a Microsoft Internet
Explorer plug-in. You will not see it
in the start menu after installing it.
- Look the compound up on PubChem using it's name
- From the result returned click the 'Exports' button
- Still within PubChem copy the SMILES string
- Paste the SMILES string into Corina. You may use the online demo
at the Molecular Networks web site
(I had trouble with this) or the 3D Structures web page15.
- From within Chime right click anywhere on the canvas and select
File | Save. Save the file as a PDB.
Acknowledgements
- Thanks fo Angel Angel Herraez, Dep. Bioquimica y Biologia
Molecular, Universidad de Alcala, Madrid for suggesting the procedure
for generating three dimensional coordinates.
About Author
Alex Amies is a senior software engineer at IBM. He can be
contacted
at alexamies@gmail.com.
References
- National Library of Medicine, PubChem online database at pubchem.ncbi.nlm.nih.gov/.
The speciifcation for PubChem ASN1 molecular file format is at ftp://ftp.ncbi.nlm.nih.gov/pubchem/specifications/pubchem.asn
and the XML Schema for PubChem XML format is at ftp://ftp.ncbi.nlm.nih.gov/pubchem/specifications/pubchem.xsd.
- Research Collaboratory for Structural Bioinformatics, Protein
Data Bank at www.rcsb.org/pdb/Welcome.do.
- eMolecules, Inc., Chmoogle is a search engine for chemical
structures and properties at www.chmoogle.com/index.htm.
There is information about chem-informatics and structure searching at
http://www.chmoogle.com/doc/cheminformatics-101.htm.
- ChemExper has a database
including thousands of chemicals and their structural diagrams and
properties.
- New York University Library of 3-D Molecular Structures
http://www.nyu.edu/pages/mathmol/library/.
- Chemical Markup Language is a SourceForge project hosted at
cml.sourceforge.net.
This includes the CML Schema, links to tools, documentation, and source
code. There is a discussion list at cml.sourceforge.net/list/index.html.
The CML Wiki is at cml.sourceforge.net/wiki/index.php/Main_Page.
- SMILES home page at www.daylight.com/smiles/index.html.
- Molecular Design Limited June 2005. MDL® CTfile Formats
White Paper at www.mdl.com/solutions/white_papers/ctfile_formats.jsp.
The MDL company web site is at www.mdl.com.
- Open Babel SourceForge project at openbabel.sourceforge.net/wiki/Main_Page.
- US Environmental Protection Agency Distributed
Structure-Searchable Toxicity (DSSTox) Database Network at www.epa.gov/nheerl/dsstox/index.html.
- JChemPaint is a SourceForge project hosted at sourceforge.net/projects/jchempaint.
- MDL Chime can be downloaded at www.mdl.com/products/framework/chime/
after registration. Chime
tutorials are at www.mdlchime.com/support/developer/chime/index.jsp
and www.chem.uwec.edu/ChimeTutDemos.
- Jmol is an open source project hosted on SourceForge at jmol.sourceforge.net.
It also has a wiki at wiki.jmol.org/WebsitesUsingJmol.
- Molecular Networks home page at www.mol-net.de
and Corina
demo page at www.mol-net.de/online_demos/corina_demo.html.
- Professor Gasteiger's research team at Computer-Chemie-Centrum
and Institute for Organic Chemistry, University of
Erlangen-Nürnberg, Germany - 3D Structures web page at www2.chemie.uni-erlangen.de/software/corina/free_struct.html.
Please send me ideas and
opinions
by email at webmaster@medicalcomputing.net or add comments to my blog.
The content may
become part of
the web
site.