Archive for the ‘Cheminformatics’ Category

InChI generation mystery

Thursday, April 30th, 2009

I’ve been playing with InChIs these days quite a bit and have been quite suprised by the results that I got. The one thing that is very surprising indeed, is that the InChI - and therefore the InChIKey, should one want to use these to index a chemical compound database - one gets, seems to depend on the input format for the InChI generator.

After having seen some more InChI related things atthe BioIT World yesterday, I thought I’d have another go at some InChI stuff. And I was again quite surprised to what I found, and actually even more confused.

Here’s what I did:

I took two molecules from PubChem (diaisostereomers) cis- and trans-1,2-dihydroxy-cyclohexane (1,2-cyclohexanediol). I created InChIKeys for both of them, using different inputs. And not at all did I end up with two different InChIs for these two structures that were the same.

The SMILES and InChIKey from PubChem are the following:
cis-1,2-cyclohexanediol: C1CC[C@@H]([C@@H](C1)O)O - PFURGBBHAOXLIO-OLQVQODUSA-N
trans-1,2-cyclohexanediol: C1CC[C@H]([C@@H](C1)O)O - PFURGBBHAOXLIO-PHDIDXHHSA-N

I used the Standard InChI generator v1.2 released in January 2009 by the IUPAC, more specifically the Linux binary. Because this software doesn’t read SMILES, one has to convert the SMILES input molecules to SD files. This was done with OpenBabel and Pipeline Pilot. In addition, I used CML obtained from OpenBabel as input, too.

And here is what I obtained:

Source file format

Source file from

Coordinates

InChIKey

Same as PubChem

Distinct stereo

SDF

OpenBabel

0D

cis

PFURGBBHAOXLIO-UHFFFAOYSA-N

N

N

trans

PFURGBBHAOXLIO-UHFFFAOYSA-N

N

SDF

OpenBabel

3D (OB –gen3D)

cis

PFURGBBHAOXLIO-OLQVQODUSA-N

Y

N

trans

PFURGBBHAOXLIO-OLQVQODUSA-N

N

CML

OpenBabel

0D

cis

FWITZFBVZWAIRX-OLQVQODUSA-N

N

Y

trans

FWITZFBVZWAIRX-PHDIDXHHSA-N

N

CML

OpenBabel

3D (OB –gen3D)

cis

PFURGBBHAOXLIO-OLQVQODUSA-N

Y

N

trans

PFURGBBHAOXLIO-OLQVQODUSA-N

N

SDF

Scitegic PipelinePilot

0D

Cis

PFURGBBHAOXLIO-UHFFFAOYSA-N

Y

N

trans

PFURGBBHAOXLIO-UHFFFAOYSA-N

N

SDF

Scitegic PipelinePilot

2D

Cis

PFURGBBHAOXLIO-OLQVQODUSA-N

Y

Y

trans

PFURGBBHAOXLIO-PHDIDXHHSA-N

Y

SDF

Scitegic PipelinePilot

3D

Cis

PFURGBBHAOXLIO-OLQVQODUSA-N

Y

Y

trans

PFURGBBHAOXLIO-PHDIDXHHSA-N

Y

CML

OpenBabel conversion of Scitegic PP files

0D

Cis

DQQAEJAQEMHJBB-UHFFFAOYSA-N

N

N

trans

DQQAEJAQEMHJBB-UHFFFAOYSA-N

N

CML

OpenBabel conversion of Scitegic PP files

3D

Cis

DQQAEJAQEMHJBB-UHFFFAOYSA-N

N

N

trans

DQQAEJAQEMHJBB-UHFFFAOYSA-N

N

CML

OpenBabel conversion of Scitegic PP 0D SD file

3D (OB –gen3D)

Cis

PFURGBBHAOXLIO-OLQVQODUSA-N

Y

N

trans

PFURGBBHAOXLIO-OLQVQODUSA-N

N

Now, this raises the following questions:

  1. First of all, why do I not just get the same InChI for everything? Isn’t it supposed to canonicalize the structure to a fair extent (I know it does) and come up with the same InChI?
  2. If there are different ways of getting a presumably valid InChI for a compound, which one should one take?
  3. Why do I get the same InChI for different stereoisomers when the 3D coordinates come from OpenBabel, but not if the coordinates come from Pipeline Pilot? Is the 3D coordinate generation in OpenBabel the problem?
  4. Why do I get the correst stereoinformation in the InChIKey when I use CML without any coordinates?
  5. Why do I get a different hash for the connectivity part when I use CML without any coordinates? FWITZFBVZWAIRX instead of PFURGBBHAOXLIO?
  6. And then, if CML without 3D coordinates gives the right stereochemistry in the InChI, why doesn’t 3D CML with coordinates from OpenBabel?
  7. 3D CML from PipelinePilot coordinates also doesn’t get the stereochemistry part right; instead, there’s yet a different connectivity part DQQAEJAQEMHJBB.
  8. Why should the stereochemistry come out right when 2D coordinates are used, but not if 0D coordinates are used?
  9. Why not get InChIs directly from SMILES?

Any comments on this anybody? I seriously wonder.

Murcko fragments with python and pybel

Friday, April 25th, 2008

There are actually a few bugs in this version. Shortly I’ll put up a new version here.

Couldn’t find a program to get all ring systems out of an SDF file, along with how often they occur. So I wrote a script in python that does exactly that job. It gets the smallest set of smallest rings (SSSR) from pybel: once a molecule is read in (e.g., mol=pybel.readstring(”smi”, smiles)), then you have the SSSR in mol.sssr, which is a vector of OBRing objects (see the OpenBabel documentation for more info about that).

You can iterate over this vector in a standard pythonic fashion, e.g., for ring in mol.sssr: pass. The ring size is easily accessed by ring.PathSize(), the atoms in the ring are stored in the member variable _path, e.g., ring._path will give you the atoms in the ring.

The script checks for fused ring systems by identifying any shared atoms between any members of the SSSR. This is achieved by intersection of the sets of member atoms of any two ring systems. Two rings are considered to be a ring system if they share at least one atom, i.e., strictly speaking it is not fused but rather a spiro system. This behaviour can be changed by changing if len(intsec) in function GetFusedRingsMatrix(mol) to if len(intsec) > 1.

Should you want to get all individual elements of the SSSR, instead of the fused/linked rings as one ring system, then it should suffice to supply a manually crafted matrix as the one returned by GetFusedRingsMatrix(). That would be something like l=len(mol.sssr) and FusedRingsMatrix = [[0 for x in range(l)] for y in range(l)]. Then all rings are supposed to be unconnected.

The script also includes exocyclic double bonds as part of rings they may be linked to.

All fragments are written out to a file fragments.sdf, and the number how often that fragment was encountered in the structure file supplied is written into the SDF field COUNT. If you watch the output with something like mview fragments.sdf (MarvinView from ChemAxon) it will look similar to the picture displayed. Screenshot of Murcko fragments in Marvin 5

The script can be found here: Python script for extraction of Murcko fragments

Pybel on Mac OS X

Thursday, March 20th, 2008

I just tried to install openbabel and its python wrapper pybel on a Mac running OS X 10.4. OpenBabel compiled fine from source, no problems with that. But when I tried to run python setup.py build in the scripts/python subdirectory, the build failed at the linking stage with

/usr/bin/ld: for architecture ppc
/usr/bin/ld: can't locate file for: -lopenbabel
collect2: ld returned 1 exit status

The solution turned out to be that—at least for my computer—I needed to set the environment variable OPENBABEL_INSTALL to /usr/local explicitly. Normally, the library is looked for in the src directory, i.e., where openbabel was built, but that didn’t work, for whatever reason (I haven’t checked up on that).

After setting the appropriate environment variable with export OPENBABEL_INSTALL=/usr/local everything works fine, the install as root finished without any problems, and everything works:

host:~/models/ flo$ python
Python 2.5 (r25:51918, Sep 19 2006, 08:49:13)
[GCC 4.0.1 (Apple Computer, Inc. build 5341)] on darwin
Type “help”, “copyright”, “credits” or “license” for more information.
>>> import pybel
>>>

Winnow algorithm in C++ for multiclass classification

Friday, March 14th, 2008

In a recent paper (How To Winnow Actives from Inactives: Introducing Molecular Orthogonal Sparse Bigrams (MOSBs) and Multiclass Winnow) I applied the Winnow algorithm to the prediction of biological activities of molecules.

I thought that it would certainly be of much use to provide the code to anybody who may want to use this methodology as well. So here we go: the source code of the inplementation can be downloaded here: multiclass winnow in C++, with MySQL support (covered by the GNU General Public Licence; for what that means either read the copy enclosed or read it online).

This program is provided as is without any guarantees that it will work or anything else. For the structure of the underlying MySQL tables, should you wish to use that feature, there is no documentation yet. You can easily find the required information by reading through the source code in the file Winnow.h. If I get round to doing it, I will provide some more documentation in a while here on this page. If you have any questions, please feel free to contact me and I will try to help you out.

The program is capable of either reading the training and test data from text files, or otherwise from MySQL tables. It also includes the possibilities of bagging multiple classifiers, to use a thick threshold, as well as orthogonal sparse bigrams or an exhaustive enumeration of all features provided. Invoking the program without any arguments will display a list of all available options.

There are still a couple of very useful things that can be done with this algorithm in the area of cheminformatics. I won’t spend an excessive amount of time pushing that project any further, therefore I would be happy to share some ideas with people who may want to collaborate on a project.

I would be more than happy to receive any comments or suggestions. In particular, if there is anybody out there who would be willing to write a graphical interface, then I would be very happy to help out as much as I can. But I do not have the time to focus on that for the time being.

Laplacian modified Bayesian classifier

Tuesday, February 5th, 2008

I needed a Bayesian classifier with support for some custom add-ons. The implementation that is in the script provided here has been done in Python. It reads data from tab-separated textfiles or alternatively supports reading from MySQL databases. Additionally, it also has the possibility to include combinations of features as orthogonal sparse bigrams (OSBs). The classifier gives the scores and class labels for a user-definable number of classes. The source code is well documented with comments. There is no usage() function as of yet, so the command line arguments are only documented within the source.
(more…)

Dump molecules and info from an SDF file to a MySQL database

Monday, February 4th, 2008

This script can be used to dump all molecules from an SDF file with all or a specified subset of corresponding SDF tags into an SQL database, in this case MySQL.
SDF parsing: The script parses SDF files and initialises an instance of class Molecule for each parsed molecule. A dictionary of associated SDF tags is available as Molecule.SDFdict, all SDF tags found are in the list Molecule.SDFtags, the connection table in Molecule.CTAB. Either all or a few selected tags are then written into a database. MySQL: A database has to be created manually (create database ). The name of the table where to write the structural info (i.e. CTAB) has to be specified on the command line. At least one SDF tag has to be specified on the command line. Additional SDF tags to be written to the database are specified as a colon-separated list. (more…)

Add field(s) to SDF file

Monday, February 4th, 2008

Add information to each molecule in a SDF file. The information for each molecule is specified as a colon-separated list in an additional file. This allows multiple values to be added under the same tag for each molecule. Alternatively, this allows for an easy extension to be used to add multiple tags with different values (not implemented, but easy to do).

Python script to add fields to a SDF file
(more…)

Circular fingerprints in Python

Monday, February 4th, 2008

This script is basically a re-implementation of the MOLPRINT 2D fingerprint. It takes a Sybyl MOL2 file as input and analyses for each atom its neighbouring environment and encodes this as a string ot Sybyl atom types. The depth level to which the environment goes can be specified, as well as if to separate each depth-level from the others. And, being in python it is easily amenable to additions, improvements, etc. The script only converts valid molecules as defined by regular expressions for atom and bond lines in the corresponding block of the MOL2 files.

Update 16/10/2007: Modified regular expressions for CTAB, included possibility to dump everything to MySQL database - see source code.

Circular fingerprints in Python
(more…)

Kennard Stone algorithm in C++

Monday, February 4th, 2008

This algorithm can be used for the selection of a training set. It works in the following way: (see this PDF for a graphical depiction of how this algorithm works)

Find the two points most separated in the training set
For each candidate point, find the smallest distance to any object already selected
Select that point for the training set which has the largest of these smallest distances
As described above, this algorithm always gives the same result, due to the two starting points which are always the same. Alternatively, one could choose two random points to start with. In that way, different training sets can be obtained.

The source code can be downloaded here, it is provided as a gzipped tar file. The code is far from being perfect, but it does the job. If you make any modifications to the code please let me know so that I can include them on this page here and in the original code. There are some comments in the source files that should explain how the program works. If you have any questions don’t hesitate do drop me an email.

R script for classifications with Random Forest

Monday, February 4th, 2008

I wrote this script to read in a given file, that is then split up into training/test sets a given number of times. randomForest() is then called, the execution time is measured and all the output is saved into files. The script also calculates the Matthews Correlation Coefficient (MCC), number of true/false positives, true/false negatives, and recall and precision for each class.
The format of the input file is specified in the script itself as well: semicolon separated fields, the first contains the class label, all others the (binary) features. You will have to specify the total number of columns in the file.
To speed up the reading in of the file I used scan() to read a given number (specifyable as a parameter) of lines into a buffer which is then processed and appended onto a dataframe which then holds the totality of all data.
I tried that script on a datamatrix of around 100,000 rows containing around 150 columns, the original file size is approx. 33MB. Reading in is reasonably fast.
You may have to adjust the script slightly if you want to work with more than around 10 classes (which I wrote this script for), but that is straightforward.

The script is here.