A Handbook of Chemoinformatics

 Edited by J. Gasteiger

 Chemoinformatics is a generic term that encompasses the design, creation, organization, management, retrieval, analysis, dissemination, visualization and use of chemical information. The basic goal of this field is to transform data into knowledge through information processing for the intended purpose of making better decisions faster. The new discipline of chemoinformatics covers the application of computer-assisted methods to chemical problems such as information storage and retrieval, the prediction of physical, chemical or biological properties of compounds, spectra simulation, structure elucidation, reaction modeling, synthesis planning and drug design.

  Although this field is recognized as an important research area, the over all objectives and classifications were not done until the recent release of a comprehensive series of four volumes on chemoinformatics a handbook containing in-depth contributions from top authors around the world, with the content organized into chapters dealing with the representation of molecular structures and reactions, data types and databases/data sources, search methods, methods for data analysis as well as applications edited by Prof. Gasteiger. The Handbook of Chemoinformatics is the first reference work to be exclusively devoted to this developing field from data to knowledge, and will set the standard as the premier information source for the next decade. This handbook is a must read for experts as well as students of chemistry and biology.

  The handbook provides a comprehensive and coherent overview of the state of the art of chemoinformatics. The first volume of the handbook begins with the history of chemoinformatics aptly written by Peter Willett. The subsequent few chapters deal with the chemical nomenclature and representation of chemical structures using Graph theory and SMILES. Chemoinformatics uses a wide variety of algorithms for indexing and retrieving chemical compounds in databases. Four chapters are devoted to processing constitutional information of molecules. The computational methods for 3D structure generation and ligand and structure based design of the so-called bioactive conformation of the potential drug have been defined. Shape analysis is a powerful tool in chemistry as investigations of the molecular recognition of receptor ligand interactions near surface are likely to be more precise than anywhere near the molecule. The large amount of data generated by computer/experiments needs to be visualized to identify trends and structures, and recognize shapes and patterns. In this context, the strategic basis of molecular graphics for the optimization of information transfer between human activity and computational processes assumes great importance.

  The chemoinformatics of chemical reactions is not as far developed as that of chemical structures. The two fundamental tasks for chemical reaction representation are prediction of outcome of a reaction and the design of a synthesis. Both data driven and model driven reaction classification methods used for knowledge extraction have been described. However the automatic assignment of a reaction center has not been presented.

  Data acquisition and data analysis are important tools for building up knowledge in chemistry and to  ensure that outgoing product meets all customer requirements. The next topic Experimental Design (ED) familiarizes the readers with this mathematical technique to plan and carry out experiments so that maximum possible information is gained from experimental data. Standard data formats are essential for facilitating exchange of data between scientists. XML-eXensible Markup Language deals with electronic exchange of information and documents in every discipline. This standardized language has a specific extension for handling chemical information and many of its features are under review or development.

  There are various types of databases available in the field of chemistry, which store treasures of information. Abstracting and indexing in bibliographic databases has been described in detail. As CAS Information system is the major provider of chemical information since the computer age a complete section is devoted to CAS databases- CAPLUS, Registry and its online sources- STN Express and Scifinder. The largest information database on organic compounds is the Beilstein database and now with its additional features like crossfire its potential has been realized. Databases for retrieving inorganic and organometallic compounds are also included in this chapter. The chemical structure database (CSD) provides information on 3D structures of small organic and organometallic molecules. Spectroscopy, patents, environmental information, molecular topology, biochemistry databases too find a mention here. Internet is the largest repository of data and the next section invariably leads to chemistry on the Internet with an overview on the internet technologies used to harness chemical knowledge. Laboratories generate a lot of data that needs to be organized and managed. The chapter concludes with the basic structure modules and functioning of Laboratory Information Management Systems (LIMS).

  Chemical structure search is the most important method of accessing chemical informaton. This section begins with the methods available for 2D

  structure and substructure search. The Markush chemical structures are generic structures in patents and their retrieval poses a problem in chemical structure searching. This article throws light on the current state of the art of Markush Topological Search Systems. Computable structure similarities are strongly correlated with biological similarities (structure property principle); similarity searching is now widely used for virtual screening as a precursor to sub structural analysis.

  Chemical structure information can be correlated with physical, chemical or biological data to make a model, which can be used to predict new data. The third volume of the handbook focuses on calculation of physical and chemical data through direct computational methods. Molecular mechanics or force field methods are used often as they are rapid and can be applied to a large number of molecules with many atoms. Some of the force field methods for mainly small molecules are MM2, MM3, Tinker, UFF, Momec, Osmos and for biological molecules are AMBER, CHARMM, Gromos, POLS, ECEPP, CVFF, MMFF. The quantum mechanical methods can be applied to large molecules or large data sets unlike molecular mechanics methods. The molecular orbital theories are described first and the properties from quantum mechanical calculations of interest to chemoinformatics, for instance net atomic charges, dipole moments, polarizabilities, orbital energies are described in detail. The extra information and details provided by quantum mechanics is important for accurate work involving specific interactions, docking studies.

  Eighth chapter of this volume provides detailed information on descriptors for chemical compounds. As more than 1500 descriptors are known care must be taken to choose the correct set. The first section covers topological descriptors, which have now been superseded by sophisticated descriptors. Searching for relationship between molecular structure and biological activity can be efficiently done using geometric descriptors with their large information content. Next section by Gasteiger, is on a series of structure coding methods, different ways of encoding a molecular structure into a vector of numerical values. He suggests a hierarchy of structure representation: construction, 3D structure and molecular surface. The section also touches upon descriptors of molecular chirality mainly developed in his group. The last section in this series deals with representation of molecular chirality as qualitative representation of chiral structure is necessary for QSAR studies. Even though many approaches have been devised for computer detection, specification and representation of chirality, yet correlation with observable properties has been limited, the data seta are smaller in comparison to non-chiral structure-property relationships.

  The succeeding chapter delves into the methods for data analysis, collectively referred to as “inductive learning methods”. Machine learning is a common term used by computer scientists for classification and generalization of data, basically to extract regularity from data or harvest latent knowledge from the databases. Another method of data analysis is multivariate data analysis, a tool commonly used in chemo metrics as more than one variable is required to describe chemistry relevant objects. Yet another method is Partial Least Squares, which can be used to analyze data with strongly collinear, noisy and numerous X-variables and also model several response variables Y.

  A chapter on Artificial Neural Network (ANN) and its applications viz., classification, mapping, modeling, prediction of missing data, reduction of representation etc is followed by a section on concept of Fuzzy logic. Fuzzy logic is viewed as a system of concepts, principles and methods for modes of reasoning that are approximate rather than exact and expressed in natural language. The authors demonstrated that patter recognition strategies, which are related to the application of human sense, could be transferred to an algorithmic process applicable in the field of molecular recognition.

  Evolutionary algorithms (EAs) or evolutionary computations are stochastic search methods that are inspired by the basic principles of Darwinian evolution and by DNA like genetics, containing a component of randomness in their algorithmic procedure. The main algorithms used under this term are genetic algorithms (GA), evolutionary programming (EP), evolutionary strategies (ES), genetic programming (GP) and classifier systems (CFS). Their vast applications in chemistry include conformational search and structure optimization, protein ligand docking, de novo molecular design, pharmacophoric perception, psuedo receptor modeling, chemical structure handling, QSAR, chemometric, combinatorial libraries, crystallography, spectroscopy, structure prediction of biological macromolecules, force field parameterization, chemical reaction handling, sequence alignment -infact the entire world of chemistry.

  Expert systems are computer programs derived from artificial intelligence research which aid expert in making decisions. Next section on Expert Systems defines the various terms used under this concept and describes development of expert systems using rule based programming, inference engine, fuzzy logic etc. The last chapter in the third volume delves into the application of chemoinformatics methods, though only selected ones are described in detail. The first section on prediction of physical and chemical properties elaborates on lipophilicity a widely applied tool for large databases, quantified by partition coefficient P or its logarithm log P. The existing log P data is negligible compared to the known desirable compound hence a need to develop methods to derive log P from molecular structure. Both the sub structural and whole molecular approaches for quantifying log P exist with their intrinsic advantages and drawbacks. QSPR computer assisted prediction of chemical physical and biological properties directly from molecular structure is of great relevance. QSPR methods can be used to predict properties such as normal boiling points, critical temperatures, surface tension, Henry’s law constants, gas chromatographic retention times, ion mobility etc. Three major part of QSPR studies: representation, feature selection and mapping have been accounted. This chapter gives insight into various descriptors, design and implementation of which is a current research area in QSPR.

  Web technology, due to its ease of use and high interactivity offers many advantages for processing chemical information and invariably the next section is on web-based calculation of molecular properties. The development of Java programs and other new technologies, servelets, VRML, XML, and CML are making web an ideal environment for processing chemical information. Some representative examples of the web tools and in-silico profiling of molecules at Novartis have been described by the authors, however not all the commercially available software packages are mentioned which would have been useful for the readers. Correlating structural and spectroscopic information is an important aspect of chemoinformatics, IR and NMR in particular. The digital encoding of IR spectra and coding of the chemical structure and computational correlation between NMR spectra and molecular structure has been described in two sections. Spc Info, CS Search, NMR Shift DB and CNRM databases form the basis for shift prediction tool. From these compressed representation of data such as HOSE code tables can be generated which aid in chemical shift prediction for new structures. Structure validation by ab initio quantum mechanical computations is now feasible with PCs and workstations. The simultaneous use of various spectral data provides leads to the exact structure elucidation of a molecule. The next section throws light on the development of automatic systems for structure elucidation CASE (Computer Assisted Structure Elucidation), only for small organic molecules. A typical CASE process involves spectral database searching and storage as a bit string representation.

  The last volume of the Handbook is on Chemical reactions and synthesis design. The analysis and processing of reaction data information is very important to chemists for solving any synthetic problem. Topology based reaction classification codes; Kohonen neural networks help in retrieving reaction information from different sources by using algorithmically derived hash codes. Computer Assisted Synthesis Design (CASD) looks at technical ways of organizing communication between computer and chemist for description of reactions. Molecules are described by a connectivity table, matrices or numerical linear notation. These three systems lead to three methods for coding reactions in CASD programs: Transform approach, BE-Matrices approach and Numerical Approach. Next article features an interesting design system WODCA (Workbench for the Organization of Data for Chemical Applications). All the aspects of organic reactions- reaction planning, reaction prediction and synthesis design have been dealt with. Specific examples have been given to explain the various disconnection strategies available for the perception of strategic bonds within a target compound.

  Drug discovery is undoubtedly the most important application of chemoinformatics. All chemoinformatics activities viz., chemical library, virtual screening, structure activity relationships, high throughput screening, in-silico screening, de novo ligand design, data mining are vital to the processes of drug discovery. The drug discovery paradigm: HTS hits-HTS active -lead series- drug candidates—launched drug has shifted focus from good quality drug candidates to good quality leads. The succeeding section deals with QSAR contributions in drug design. QSAR applications in drug design include transport and distribution of drugs in biological systems, enzyme inhibition and correlation of different kinds of biological activities. Classical QSAR studies do not consider the 3D structures of drugs or their chirality. The COMFA (Comparative Molecular Field analysis) was therefore developed for deriving 3D QSAR models. It is mostly used in the field of ligand protein interaction, describing affinity inhibition constants. Yet another section on 3D and nD QSAR methods defines a rapid method of determining 3D QSAR descriptors which are then converted into a QSAR model using PLS with better predictivity called (COMMA) Comparative Molecular Moment Analysis based on molecule’s moment of shape and charge distribution. The methodology of nD QSAR adds to the 3D QSAR methodology by incorporating unique physical characteristics into the available descriptor pool for creation of models. Other types of QSAR methods 5D QSAR, RD QSAR, FEFF, MI QSAR are briefly touched upon. The implementation of these methodologies will add wealth of information about how small organic molecules interact with biological molecules and macromolecules.

  An overview of applications of combinatorial chemistry in drug discovery in the next section entitled “high throughput chemistry”. Traditionally the term high throughput chemistry encompasses all the technologies and combinatorial chemistry and multiple parallel syntheses of chemical entities by condensing a small number of reagents together in all possible combinations with an aim to expedite the drug discovery process. Some of the techniques have been described schematically such as matrix and spilt synthesis, encoding libraries, deconvolution etc. The concept of solid phase synthesis, solution phase synthesis, dynamic combinatorial chemistry and combinatorial biosynthesis has been explained in detail. The advancement in HTS and combinatorial chemistry has led to a large collection of compounds, which require equally advanced methods for their property characterization. The field of molecular diversity allows a selection of dissimilar compounds from a large range of chemical space in order to discover new leads. The methods and descriptors available to solve the problem of making diverse selection have been summarized in this section.

  Pharmacophore approach is an intermediate between 3D QSAR as a strictly ligand based approach and full computation at quantum mechanical level, for the dynamic interaction between the ligand and the receptor site. Applications of the pharmacophore are in de novo drug design, guidance for design of targeted combinatorial libraries, interpretation of data from high throughput screening and mostly in databases searches of 3D structure of small molecules. The current trends in pharmacophore development include 3D substructure perception, electron conformational methods and property-based pharmacophores.

  There are different approaches used for structure generation also known as de novo design of potential ligands that can bind to the receptor site of an enzyme whose 3D structure is known. The denovo design process involves steps such as analysis for the structural information of receptor to determine the active site, meeting requirement of the active site by placing appropriate chemical functionality in the required location and constructing a molecular scaffold to hold them in place and finally sorting and selecting the designed molecules by estimation of their chemical and biological properties. In practice de novo systems are generally used in combination with other modeling tools and initially designed structure are modified by the medicinal chemists before any synthesis is carried out. Some of the computer programs used are SPROUT, TOPAS, LEGEND, SEEDS in the literature, however most of the work is not published in this area. The limitation of the denovo design systems is that they do not take into account factors such as transport properties, toxicity and stability.

  Next section introduces the reader to the basic concept of docking that is the formation of non-covalent ligand receptor complexes and the docking problem ie, the task of predicting the structure of the resulting complex. There are two opposing approaches for this either to reformulate it to a discrete problem that can be solved with combinatorial algorithm or to use stochastic search algorithms. Basically docking is an energy minimization problem concerned with the search of lowest free energy binding mode of a ligand within a protein-binding site. After search the next step in docking is to rank the different configurations generated with respect to their binding affinity to one ligand. Special aspects of the docking problem such as protein flexibility, water molecules, protein homology and combinatorial dockings have been described briefly.

  The increase in structural information on proteins and systematic evaluation of geometries of protein ligand complexes using protein crystallography or multidimensional NMR will expedite the process of lead discovery. However mere raw information is not enough, it has to be evaluated, distilled and transformed to a unique data format to store it. In structural biology the central database system is PDB (Protein Data Bank), which is accessible to public. This section describes an object oriented database tool, Relibase developed by the authors to handle protein ligand information. Relibase operates on intramolecular geometries and correlated intermolecular interaction patters and also has tools for protein information such as sequence similarity, secondary structural elements or solvent accessibility. Water based module in Relibase can detect surface exposed as well as deeply buried water molecules in the protein ligand interface. Specialized topics such as comparative analysis of ligand binding pockets and secondary structural elements, which provide special binding motif in protein, have also been dealt with.

  The last chapter of the handbook consists of two sections that deal with the interface of chemoinformatics and bioinformatics – protein structure sequence and genome. The first section deals with prediction of 3D protein structure from amino acid sequence. The databases for known protein sequences (1,000,000) are expanding to due to implementation of large scale genome projects but protein whose structures are known (PDB, 20,000) are considerably less in comparison. In practice the prediction of 3D structure from sequence is challenging as energy difference between native and unfolded proteins is extremely small and secondly the high complexity of protein folding requires more computing time.

  There are three prediction methods that try to bridge the sequence structure gap: homology modeling, threading and 1D prediction. For proteins to perform function there is a need to maintain the specific 3D structure. This evolutionary history is used successfully for aligning proteins (or nucleotide) sequences. Generally advanced alignment algorithms use programs such as BLAST and FASTA and then apply dynamic programming algorithm. The 1D prediction can be useful precursor to 3D prediction and the 1D predictors used are solvent accessibility, transmembrane strands, helices and regions of structural switches. Predictions in two or three dimensions have met with limited success so far. The section on genome bioinformatics explores the vast information encrypted into the DNA to identify all the genetic elements that perform any biological function. The comprehensive analysis of a genome starts with identification of coding regions, regulatory sites, tRNAs, rRNAs. The two major branches in high throughput analysis are expression analysis and ‘proteomics’ ie, the study of protein products of the genome and their interactions and functions. Though major advances have been made in these areas, topics such as tertiary structure, prediction, protein-protein interaction remain unsolved till today.

  The volume concludes with a brief chapter on future directions in the field of chemoinformatics by Gasteiger. He foresees chemoinformatics gaining importance in chemistry and its incorporation  into regular chemistry curricula. Use of computer assisted Structure Elucidation (CASE) process and Computer Assisted Synthesis Design (CASD) would be integrated into the daily work process of bench chemists. Chemoinformatics methods will be extended to theoretical chemistry, simulation of reactions, modeling of biochemical and metabolic reaction, study of proteins will be the future areas of thrust  for chemoinformatics. Another field of great activity will be the merging of bioinformatics and chemoinformatics; their common problems can be solved using methods developed in both the fields. Drug design will no longer be the major domain of chemoinformatics, other fields such as material science, non-linear optical properties, adhesives, electrical energy, hair coloring chemicals, detergents etc will also be part of chemo informatics. The other challenges before chemoinformatics are multivariate optimization i.e., simulations optimization of several properties, for example they should predict not just the activity of a drug but also its toxicity, solubility, penetration etc. Gasteiger argues for chemists to use electronic lab notebooks to record data, which can be used to fill other information sources such as manuscripts, journals, books and databases. Finally chemoinformatics should speak the language of chemists and provide him with just the desired information and not heaps of unnecessary data.

 With rich in content and originality in presentation, my personal opinion is that “this is the first set books should find a place in every chemoinformaticians desk and every university libraries in the world”.

 M.Karthikeyan

 Scientist

National Chemical Laboratory

Pune - 411 008

m.karthikeyan@ncl.res.in [WEB]

Handbook of Chemoinformatics: From Data to Knowledge,

Volumes 1-4 Edited by Johann Gasteiger (University of Erlangen-Nürnberg).

Wiley-VCH Verlag GmbH & Co. KGaA:

Weinheim. 2003. xlvii + 1870 pp.

$750.00.

ISBN 3-527-30680-3.