Resources for Research in Medical Computing

April 30, 2006

Contents

Chemical, Biological, and Medical Databases

Literature

PubMed is a database for literature medical and biology developed by the National Center for Biotechnology Information (NCBI) at the National Library of Medicine (NLM).  It contains abstracts from more than 4,800 biomedical journals.

Chemistry

National Library of Medicine, PubChem web site at pubchem.ncbi.nlm.nih.gov provides information on the biological properties of small molecules. You can use the structure search facility to find properties and structure of many chemicals.  It is a component of the US National Institute of Health's (NIH) Molecular Libraries Roadmap Initiative.

Argonne National Laboraties host the WIT (What is There?) database containing comparative analysis of sequenced genomes.  Argonne National Laboraties also hosts EMP (Enzymes and Metabolic Pathways) and other databases.

Protein Structure and Sequence

The US National Center for Biotechnology Information (NCBI) Entrez Protein Database is compiled from a variety of sources, including SwissProt, PIR, PRF, Protein Data Bank (PDB). A copy of the Protein Data Bank (PDB) hosted by the Research Collaboratory for Structural Bioinformatics (RCSB) can be found at at www.rcsb.org/pdb.  The latest additions to the PDB can also be browsed and downloaded at www.rcsb.org/pdb/smartSubquery.do?smartSearchSubtype=LastLoadQuery.  The Basic Local Alignment Search Tool (BLAST)  is a program that compares nucleotide or protein sequences to sequences in these and other similar databases.

The Swiss Institute of Bioinformatics (SIB) hosts the ExPASy (Expert Protein Analysis System) proteomics server, which includes a protein knowledge base and tools.

The Sanger Institute maintains the Protein Family (Pfam) database at www.sanger.ac.uk/Software/Pfam.  Seventy four percent of protein sequences have at least one match to at least one entry in Pfam.

Genetics

Completed in 2003, The Human Genome Project took many advances in computing to complete.  Among the goals of the project directly relating to computing are storing the DNA sequence information in databases openly accessible from the Internet and improving tools for data analysis.  The human genome can be browsed with the Human Chromosome Launchpad.  The GenBank database hosted by the NCBI is a DNA sequence database with genetic data from humans and other organisms.  GenBank can be most easily browsed with MapViewer. ENTREZ is the NCBI's search engine, which searches in GenBank and other NCBI databases.  The NCBI has a number of tools available for analysis, including BLAST.  The European Bioinformatics Institute has similar genetic databases and tools for working with the data.

The Institute for Genomic Research (TIGR) also hosts a genome sequence database also hosts serveral genome databases, including the Expressed Gene Anatomy Database (EGAD) at www.tigr.org/tdb/egad/egad.shtml.

The Kyoto Genes and Genome (KEGG) database is another genetic information database but also has pathway, ligand, and drug information and has a web services API to access the database.  Genome.net is a bioinformatics gateway hosted by the Bioinformatics Center at the Institute of Chemical Research, Kyoto University.  It includes KEGG and other bioinformatics databases.

Weissmann Institute of Science, GeneCards Database at www.genecards.org is integrated database of human genes that includes genomic, proteomic, and transcriptomic information.

Tools

EMBOSS is an open source tool for working with data from GenBank.

GrailEXP is an Experimental Gene Discovery Suite that can be used to predict gene locations in genome data developed by the Genome Analysis and System Modeling Group of the Life Sciences Division of Oak Ridge National Laboratory. The Oak Ridge National Laboratory has also developed PROSPECT (PROtein Structure Prediction and Evaluation Computer Toolkit), a protein structure prediction system.

GENSCAN is a gene finding program developed by Chris Burge and Samuel Karlin.  A web interface to the program is available at genes.mit.edu/GENSCAN.html.

PROCRUSTES is a gene recognition program that uses spliced alignment to explores possible all exon assemblies within DNA sequences.

GeneWise is a program that searches for genes by comparing a protein sequence to a genomic DNA sequence, allowing for introns and frameshifting errors.  A web interface for the program is hosted by the European Bioinformatics Institute at www.ebi.ac.uk/Wise2/.

Biojava is an open source project initiated by Great Britain's Sanger Institute and hosted by the Open Bioinformatics Foundation. It focusses on genetic analysis.  The introductory article BioJava -- Java Technology Powers Toolkit for Deciphering Genomic Codes describes the project. 

Interesting Projects and Organizations in Medical Computing Research

There are computer systems to aid in medical research and there is also research in medical computer systems themselves.  Since the goal of both is the same, to move progress our ability to help people live longer and healthier lives, I will discuss both.  There are so many different things that researchers are trying to acheive and these are two of the hottest areas in science and technology at present.  There is some exciting stuff in this area but the best I can hope to do is to get a sample of what is out there and get peoples opinions on the it and on barriers that researchers face.

World Community Grid

Here (www.worldcommunitygrid.org)  is a software system with a lofty goal.  I have it installed on all three of my systems (two at work and one at home) and run it night and day.  It uses a grid of computers to solve various problems that are very computationally expensive.  The problem that it is working on now is FIGHTAIDS@HOME.  It analyses how potential drug molecules fit into the HIV protease and notes that the best candidate will be lab tested.   The software executing in the grid at present was written by The Molecular Graphics Laboratory at the Scripps Institute.  The grid software was made and is operated by IBM along with a very worthy collection of partners listed on the site.  The World Community Grid has also run the computations for The Human Proteome Folding Project.


BrainMaps.org

BrainMaps.org is an interactive high-resolution digital brain atlas and virtual microscope.  It features scanned images of serial sections of both primate and non-primate brains, which is integrated with a database and Flash interactive user interface.

Allen Brain Atlas

The Allen Brain Atlas is another very worthy project.  Its goal is to map the gene expression in the brain at the cellular level.  It is funded by The Allen Institute for Brain Science, a non-profit organization.  They are using an automated procedure with a custom built laboratory information system to crunch through the huge amount of data.

BrainInfo

BrainInfo is a web site / application designed to help users identify structures in the brain.

Neuroscientific.net

Neuroscientific.net is a portal for neurosciences, especially those relating to bioinformatics.

Genomes to Life

The U.S. Department of Energy Office of Science Genomes to Life project is studies the proteins encoded by genomes of different organisms to explore natural capabilities in microbes.

The Genographic Project

The Genepgrahic Project is a truely multidisciplinary anthopology / genetics / computing project based on collaboration between National Geographic, IBM, and other partners.  It uses the Internet to generate interest from the public.  Public participants send in samples from their inner cheek for DNA analysis. The Y chromosome is scanned for genetic markers that are corelated with geographic movements. The real research is with indigenous groups whose DNA is analysed in more detail.  The Genographic site gives a description of the genetic science involved and IBM Research gives an overview of the computing technology involved.

Human Physiology in Space Outline

The Human Physiology in Space Outline project is managed by the National Space Biomedical Research Institute to measure the effects of life in space on on physiology.

Extensible Markup Language and of Data Modelling Projects and Groups

W3C Semantic Web Health Care and Life Sciences Interest Group aims to improve collaboration, research and development, and innovation adoption in the health care and life science industries.

There are a number of eXtensible Markup Language (XML) projects underway.  These are essential to allow for interchange of data between groups of people and software systems.  Some of these are listed at xml.com.  There are a number of protein databases in existence, some of which are listed at The European Bioinformatics Insitute's web site.  One project is the Protein eXtensible Markup Language (PROXIML).  Another project is the Molecular Interaction XML run by the proteomics standards initiative.

The National Center for Biomedical Ontology is a consortium of leading biologists, clinicians, informaticians, and ontologists who develop innovative technology and methods that allow scientists to create, disseminate, and manage biomedical information and knowledge in machine-processable form.  They sponsor the Open Biomedical Ontologies SourceForge project, which is a focus point for modelling of information for shared use across different biological and medical domains.  This includes the OBO Ontology Browser, which lists a number of different vocabularies.  There is a vocabulary for human development anatomy and another for human disease.

Systems Biology Markup Language (SBML) is a computer-readable format for representing models of biochemical reaction networks.  The current stable version, SBML Level 2, describes structures and facilities for model definitions using XML Schema.

The Physiome Project seeks to describe the human organism quantitatively to understand its physiology and pathophysiology using a collection of models.  Much of this work focusses on biophysics.

BioPAX is a collaborative effort to create a data exchange format for biological pathway data.  The BioPAX group first met in 2002.  The project uses  Resource Description Framework (RDF), which builds on URI and XML technologies.  RDF specifications are developed by the World Wide Web Consortium (W3C) Semantic Web Group.  The W3C's RDF home is at www.w3.org/RDF.

Cell Markup Language is an XML language for modelling cells, in particular, to store and exchange computer-based mathematical models.. CellML is being developed by the Bioengineering Institute at the University of Auckland and affiliated research groups.

Information Hubs

The University of Washington Genome Center provides links to projects within the University, in addition to publications, technology, and other resources.  There is also a short tutorial on Analyzing Genome Sequences.

Bioinformatics.net is a hub that includes links to bioinformatics web sites, companies, tools, news articles, and forumns.

News, Blogs, and Feeds

The site www.bio.net/bionet/ is a gateway for biology related newsgroups.  Many of these include RSS feeds and a BIO-SOFTWARE list of Information about software for biology at www.bio.net/bionet/mm/bio-soft/.


Google

Please send me ideas and opinions by email at webmaster@medicalcomputing.net or add comments to my blog.  The content may become part of the web site.

© 2006 Alex Amies