A
Handbook of Chemoinformatics
Edited
by J. Gasteiger
Chemoinformatics
is a generic term that encompasses the design, creation, organization,
management, retrieval, analysis, dissemination, visualization and use of
chemical information. The basic goal of this field is to transform data into
knowledge through information processing for the intended purpose of making
better decisions faster. The new discipline of chemoinformatics covers the
application of computer-assisted methods to chemical problems such as
information storage and retrieval, the prediction of physical, chemical or
biological properties of compounds, spectra simulation, structure elucidation,
reaction modeling, synthesis planning and drug design.
Although this field is recognized as an important research
area, the over all objectives and classifications were not done until the recent
release of a comprehensive series of four volumes on chemoinformatics a handbook
containing in-depth contributions from top authors around the world, with the
content organized into chapters dealing with the representation of molecular
structures and reactions, data types and databases/data sources, search methods,
methods for data analysis as well as applications edited by Prof. Gasteiger. The
Handbook of Chemoinformatics is the first reference work to be exclusively
devoted to this developing field from data to knowledge, and will set the
standard as the premier information source for the next decade. This handbook is
a must read for experts as well as students of chemistry and biology.
The handbook provides a comprehensive and coherent overview of
the state of the art of chemoinformatics. The first volume of the handbook
begins with the history of chemoinformatics aptly written by Peter Willett. The
subsequent few chapters deal with the chemical nomenclature and representation
of chemical structures using Graph theory and SMILES. Chemoinformatics uses a
wide variety of algorithms for indexing and retrieving chemical compounds in
databases. Four chapters are devoted to processing constitutional information of
molecules. The computational methods for 3D structure generation and ligand and
structure based design of the so-called bioactive conformation of the potential
drug have been defined. Shape analysis is a powerful tool in chemistry as
investigations of the molecular recognition of receptor ligand interactions near
surface are likely to be more precise than anywhere near the molecule. The large
amount of data generated by computer/experiments needs to be visualized to
identify trends and structures, and recognize shapes and patterns. In this
context, the strategic basis of molecular graphics for the optimization of
information transfer between human activity and computational processes assumes
great importance.
The chemoinformatics of chemical reactions is not as far
developed as that of chemical structures. The two fundamental tasks for chemical
reaction representation are prediction of outcome of a reaction and the design
of a synthesis. Both data driven and model driven reaction classification
methods used for knowledge extraction have been described. However the automatic
assignment of a reaction center has not been presented.
Data acquisition and data analysis are important tools for
building up knowledge in chemistry and to ensure
that outgoing product meets all customer requirements. The next topic
Experimental Design (ED) familiarizes the readers with this mathematical
technique to plan and carry out experiments so that maximum possible information
is gained from experimental data. Standard data formats are essential for
facilitating exchange of data between scientists. XML-eXensible Markup Language
deals with electronic exchange of information and documents in every discipline.
This standardized language has a specific extension for handling chemical
information and many of its features are under review or development.
There are various types of databases available in the field of
chemistry, which store treasures of information. Abstracting and indexing in
bibliographic databases has been described in detail. As CAS Information system
is the major provider of chemical information since the computer age a complete
section is devoted to CAS databases- CAPLUS, Registry and its online sources-
STN Express and Scifinder. The largest information database on organic compounds
is the Beilstein database and now with its additional features like crossfire
its potential has been realized. Databases for retrieving inorganic and
organometallic compounds are also included in this chapter. The chemical
structure database (CSD) provides information on 3D structures of small organic
and organometallic molecules. Spectroscopy, patents, environmental information,
molecular topology, biochemistry databases too find a mention here. Internet is
the largest repository of data and the next section invariably leads to
chemistry on the Internet with an overview on the internet technologies used to
harness chemical knowledge. Laboratories generate a lot of data that needs to be
organized and managed. The chapter concludes with the basic structure modules
and functioning of Laboratory Information Management Systems (LIMS).
Chemical structure search is the most important method of
accessing chemical informaton. This section begins with the methods available
for 2D
structure and substructure search. The Markush chemical
structures are generic structures in patents and their retrieval poses a problem
in chemical structure searching. This article throws light on the current state
of the art of Markush Topological Search Systems. Computable structure
similarities are strongly correlated with biological similarities (structure
property principle); similarity searching is now widely used for virtual
screening as a precursor to sub structural analysis.
Chemical structure information can be correlated with
physical, chemical or biological data to make a model, which can be used to
predict new data. The third volume of the handbook focuses on calculation of
physical and chemical data through direct computational methods. Molecular
mechanics or force field methods are used often as they are rapid and can be
applied to a large number of molecules with many atoms. Some of the force field
methods for mainly small molecules are MM2, MM3, Tinker, UFF, Momec, Osmos and
for biological molecules are AMBER, CHARMM, Gromos, POLS, ECEPP, CVFF, MMFF. The
quantum mechanical methods can be applied to large molecules or large data sets
unlike molecular mechanics methods. The molecular orbital theories are described
first and the properties from quantum mechanical calculations of interest to
chemoinformatics, for instance net atomic charges, dipole moments,
polarizabilities, orbital energies are described in detail. The extra
information and details provided by quantum mechanics is important for accurate
work involving specific interactions, docking studies.
Eighth chapter of this volume provides detailed information on
descriptors for chemical compounds. As more than 1500 descriptors are known care
must be taken to choose the correct set. The first section covers topological
descriptors, which have now been superseded by sophisticated descriptors.
Searching for relationship between molecular structure and biological activity
can be efficiently done using geometric descriptors with their large information
content. Next section by Gasteiger, is on a series of structure coding methods,
different ways of encoding a molecular structure into a vector of numerical
values. He suggests a hierarchy of structure representation: construction, 3D
structure and molecular surface. The section also touches upon descriptors of
molecular chirality mainly developed in his group. The last section in this
series deals with representation of molecular chirality as qualitative
representation of chiral structure is necessary for QSAR studies. Even though
many approaches have been devised for computer detection, specification and
representation of chirality, yet correlation with observable properties has been
limited, the data seta are smaller in comparison to non-chiral
structure-property relationships.
The succeeding chapter delves into the methods for data
analysis, collectively referred to as “inductive learning methods”. Machine
learning is a common term used by computer scientists for classification and
generalization of data, basically to extract regularity from data or harvest
latent knowledge from the databases. Another method of data analysis is
multivariate data analysis, a tool commonly used in chemo metrics as more than
one variable is required to describe chemistry relevant objects. Yet another
method is Partial Least Squares, which can be used to analyze data with strongly
collinear, noisy and numerous X-variables and also model several response
variables Y.
A chapter on Artificial Neural Network (ANN) and its
applications viz., classification, mapping, modeling, prediction of missing
data, reduction of representation etc is followed by a section on concept of
Fuzzy logic. Fuzzy logic is viewed as a system of concepts, principles and
methods for modes of reasoning that are approximate rather than exact and
expressed in natural language. The authors demonstrated that patter recognition
strategies, which are related to the application of human sense, could be
transferred to an algorithmic process applicable in the field of molecular
recognition.
Evolutionary algorithms (EAs) or evolutionary computations are
stochastic search methods that are inspired by the basic principles of Darwinian
evolution and by DNA like genetics, containing a component of randomness in
their algorithmic procedure. The main algorithms used under this term are
genetic algorithms (GA), evolutionary programming (EP), evolutionary strategies
(ES), genetic programming (GP) and classifier systems (CFS). Their vast
applications in chemistry include conformational search and structure
optimization, protein ligand docking, de novo molecular design, pharmacophoric
perception, psuedo receptor modeling, chemical structure handling, QSAR,
chemometric, combinatorial libraries, crystallography, spectroscopy, structure
prediction of biological macromolecules, force field parameterization, chemical
reaction handling, sequence alignment -infact the entire world of chemistry.
Expert systems are computer programs derived from artificial
intelligence research which aid expert in making decisions. Next section on
Expert Systems defines the various terms used under this concept and describes
development of expert systems using rule based programming, inference engine,
fuzzy logic etc. The last chapter in the third volume delves into the
application of chemoinformatics methods, though only selected ones are described
in detail. The first section on prediction of physical and chemical properties
elaborates on lipophilicity a widely applied tool for large databases,
quantified by partition coefficient P or its logarithm log P. The existing log P
data is negligible compared to the known desirable compound hence a need to
develop methods to derive log P from molecular structure. Both the sub
structural and whole molecular approaches for quantifying log P exist with their
intrinsic advantages and drawbacks. QSPR computer assisted prediction of
chemical physical and biological properties directly from molecular structure is
of great relevance. QSPR methods can be used to predict properties such as
normal boiling points, critical temperatures, surface tension, Henry’s law
constants, gas chromatographic retention times, ion mobility etc. Three major
part of QSPR studies: representation, feature selection and mapping have been
accounted. This chapter gives insight into various descriptors, design and
implementation of which is a current research area in QSPR.
Web technology, due to its ease of use and high interactivity
offers many advantages for processing chemical information and invariably the
next section is on web-based calculation of molecular properties. The
development of Java programs and other new technologies, servelets, VRML, XML,
and CML are making web an ideal environment for processing chemical information.
Some representative examples of the web tools and in-silico profiling of
molecules at Novartis have been described by the authors, however not all the
commercially available software packages are mentioned which would have been
useful for the readers. Correlating structural and spectroscopic information is
an important aspect of chemoinformatics, IR and NMR in particular. The digital
encoding of IR spectra and coding of the chemical structure and computational
correlation between NMR spectra and molecular structure has been described in
two sections. Spc Info, CS Search, NMR Shift DB and CNRM databases form the
basis for shift prediction tool. From these compressed representation of data
such as HOSE code tables can be generated which aid in chemical shift prediction
for new structures. Structure validation by ab initio quantum mechanical
computations is now feasible with PCs and workstations. The simultaneous use of
various spectral data provides leads to the exact structure elucidation of a
molecule. The next section throws light on the development of automatic systems
for structure elucidation CASE (Computer Assisted Structure Elucidation), only
for small organic molecules. A typical CASE process involves spectral database
searching and storage as a bit string representation.
The last volume of the Handbook is on Chemical reactions and
synthesis design. The analysis and processing of reaction data information is
very important to chemists for solving any synthetic problem. Topology based
reaction classification codes; Kohonen neural networks help in retrieving
reaction information from different sources by using algorithmically derived
hash codes. Computer Assisted Synthesis Design (CASD) looks at technical ways of
organizing communication between computer and chemist for description of
reactions. Molecules are described by a connectivity table, matrices or
numerical linear notation. These three systems lead to three methods for coding
reactions in CASD programs: Transform approach, BE-Matrices approach and
Numerical Approach. Next article features an interesting design system WODCA
(Workbench for the Organization of Data for Chemical Applications). All the
aspects of organic reactions- reaction planning, reaction prediction and
synthesis design have been dealt with. Specific examples have been given to
explain the various disconnection strategies available for the perception of
strategic bonds within a target compound.
Drug discovery is undoubtedly the most important application
of chemoinformatics. All chemoinformatics activities viz., chemical library,
virtual screening, structure activity relationships, high throughput screening,
in-silico screening, de novo ligand design, data mining are vital to the
processes of drug discovery. The drug discovery paradigm: HTS hits-HTS active
-lead series- drug candidates—launched drug has shifted focus from good
quality drug candidates to good quality leads. The succeeding section deals with
QSAR contributions in drug design. QSAR applications in drug design include
transport and distribution of drugs in biological systems, enzyme inhibition and
correlation of different kinds of biological activities. Classical QSAR studies
do not consider the 3D structures of drugs or their chirality. The COMFA
(Comparative Molecular Field analysis) was therefore developed for deriving 3D
QSAR models. It is mostly used in the field of ligand protein interaction,
describing affinity inhibition constants. Yet another section on 3D and nD QSAR
methods defines a rapid method of determining 3D QSAR descriptors which are then
converted into a QSAR model using PLS with better predictivity called (COMMA)
Comparative Molecular Moment Analysis based on molecule’s moment of shape and
charge distribution. The methodology of nD QSAR adds to the 3D QSAR methodology
by incorporating unique physical characteristics into the available descriptor
pool for creation of models. Other types of QSAR methods 5D QSAR, RD QSAR, FEFF,
MI QSAR are briefly touched upon. The implementation of these methodologies will
add wealth of information about how small organic molecules interact with
biological molecules and macromolecules.
An overview of applications of combinatorial chemistry in drug
discovery in the next section entitled “high throughput chemistry”.
Traditionally the term high throughput chemistry encompasses all the
technologies and combinatorial chemistry and multiple parallel syntheses of
chemical entities by condensing a small number of reagents together in all
possible combinations with an aim to expedite the drug discovery process. Some
of the techniques have been described schematically such as matrix and spilt
synthesis, encoding libraries, deconvolution etc. The concept of solid phase
synthesis, solution phase synthesis, dynamic combinatorial chemistry and
combinatorial biosynthesis has been explained in detail. The advancement in HTS
and combinatorial chemistry has led to a large collection of compounds, which
require equally advanced methods for their property characterization. The field
of molecular diversity allows a selection of dissimilar compounds from a large
range of chemical space in order to discover new leads. The methods and
descriptors available to solve the problem of making diverse selection have been
summarized in this section.
Pharmacophore approach is an intermediate between 3D QSAR as a
strictly ligand based approach and full computation at quantum mechanical level,
for the dynamic interaction between the ligand and the receptor site.
Applications of the pharmacophore are in de novo drug design, guidance for
design of targeted combinatorial libraries, interpretation of data from high
throughput screening and mostly in databases searches of 3D structure of small
molecules. The current trends in pharmacophore development include 3D
substructure perception, electron conformational methods and property-based
pharmacophores.
There are different approaches used for structure generation
also known as de novo design of potential ligands that can bind to the receptor
site of an enzyme whose 3D structure is known. The denovo design process
involves steps such as analysis for the structural information of receptor to
determine the active site, meeting requirement of the active site by placing
appropriate chemical functionality in the required location and constructing a
molecular scaffold to hold them in place and finally sorting and selecting the
designed molecules by estimation of their chemical and biological properties. In
practice de novo systems are generally used in combination with other modeling
tools and initially designed structure are modified by the medicinal chemists
before any synthesis is carried out. Some of the computer programs used are
SPROUT, TOPAS, LEGEND, SEEDS in the literature, however most of the work is not
published in this area. The limitation of the denovo design systems is that they
do not take into account factors such as transport properties, toxicity and
stability.
Next section introduces the reader to the basic concept of
docking that is the formation of non-covalent ligand receptor complexes and the
docking problem ie, the task of predicting the structure of the resulting
complex. There are two opposing approaches for this either to reformulate it to
a discrete problem that can be solved with combinatorial algorithm or to use
stochastic search algorithms. Basically docking is an energy minimization
problem concerned with the search of lowest free energy binding mode of a ligand
within a protein-binding site. After search the next step in docking is to rank
the different configurations generated with respect to their binding affinity to
one ligand. Special aspects of the docking problem such as protein flexibility,
water molecules, protein homology and combinatorial dockings have been described
briefly.
The increase in structural information on proteins and
systematic evaluation of geometries of protein ligand complexes using protein
crystallography or multidimensional NMR will expedite the process of lead
discovery. However mere raw information is not enough, it has to be evaluated,
distilled and transformed to a unique data format to store it. In structural
biology the central database system is PDB (Protein Data Bank), which is
accessible to public. This section describes an object oriented database tool,
Relibase developed by the authors to handle protein ligand information. Relibase
operates on intramolecular geometries and correlated intermolecular interaction
patters and also has tools for protein information such as sequence similarity,
secondary structural elements or solvent accessibility. Water based module in
Relibase can detect surface exposed as well as deeply buried water molecules in
the protein ligand interface. Specialized topics such as comparative analysis of
ligand binding pockets and secondary structural elements, which provide special
binding motif in protein, have also been dealt with.
The last chapter of the handbook consists of two sections that
deal with the interface of chemoinformatics and bioinformatics – protein
structure sequence and genome. The first section deals with prediction of 3D
protein structure from amino acid sequence. The databases for known protein
sequences (1,000,000) are expanding to due to implementation of large scale
genome projects but protein whose structures are known (PDB, 20,000) are
considerably less in comparison. In practice the prediction of 3D structure from
sequence is challenging as energy difference between native and unfolded
proteins is extremely small and secondly the high complexity of protein folding
requires more computing time.
There are three prediction methods that try to bridge the
sequence structure gap: homology modeling, threading and 1D prediction. For
proteins to perform function there is a need to maintain the specific 3D
structure. This evolutionary history is used successfully for aligning proteins
(or nucleotide) sequences. Generally advanced alignment algorithms use programs
such as BLAST and FASTA and then apply dynamic programming algorithm. The 1D
prediction can be useful precursor to 3D prediction and the 1D predictors used
are solvent accessibility, transmembrane strands, helices and regions of
structural switches. Predictions in two or three dimensions have met with
limited success so far. The section on genome bioinformatics explores the vast
information encrypted into the DNA to identify all the genetic elements that
perform any biological function. The comprehensive analysis of a genome starts
with identification of coding regions, regulatory sites, tRNAs, rRNAs. The two
major branches in high throughput analysis are expression analysis and
‘proteomics’ ie, the study of protein products of the genome and their
interactions and functions. Though major advances have been made in these areas,
topics such as tertiary structure, prediction, protein-protein interaction
remain unsolved till today.
The volume concludes with a brief chapter on future directions
in the field of chemoinformatics by Gasteiger. He foresees chemoinformatics
gaining importance in chemistry and its incorporation into regular chemistry curricula. Use of computer assisted
Structure Elucidation (CASE) process and Computer Assisted Synthesis Design (CASD)
would be integrated into the daily work process of bench chemists.
Chemoinformatics methods will be extended to theoretical chemistry, simulation
of reactions, modeling of biochemical and metabolic reaction, study of proteins
will be the future areas of thrust for
chemoinformatics. Another field of great activity will be the merging of
bioinformatics and chemoinformatics; their common problems can be solved using
methods developed in both the fields. Drug design will no longer be the major
domain of chemoinformatics, other fields such as material science, non-linear
optical properties, adhesives, electrical energy, hair coloring chemicals,
detergents etc will also be part of chemo informatics. The other challenges
before chemoinformatics are multivariate optimization i.e., simulations
optimization of several properties, for example they should predict not just the
activity of a drug but also its toxicity, solubility, penetration etc. Gasteiger
argues for chemists to use electronic lab notebooks to record data, which can be
used to fill other information sources such as manuscripts, journals, books and
databases. Finally chemoinformatics should speak the language of chemists and
provide him with just the desired information and not heaps of unnecessary data.
With
rich in content and originality in presentation, my
personal opinion is that “this is the first set books
should find a place in every chemoinformaticians desk and every university
libraries in the world”.
M.Karthikeyan
Scientist
Pune - 411 008
m.karthikeyan@ncl.res.in [WEB]
Handbook
of Chemoinformatics: From Data to Knowledge,
Volumes 1-4 Edited by Johann
Gasteiger (University of Erlangen-Nürnberg).
Wiley-VCH Verlag GmbH & Co.
KGaA:
Weinheim. 2003. xlvii + 1870 pp.
$750.00.
ISBN 3-527-30680-3.