Showing posts with label databases. Show all posts
Showing posts with label databases. Show all posts

Monday, May 05, 2025

InChI metal-reconnected layer

While I wasn’t looking, the InChI folks implemented the metal-reconnected layer. Isn’t it nice? I discovered it quite by chance thanks to the Beilstein-Institut ChemInfo Labs page. You can see how it works on InChI Web Demo.

Consider ferrocyanide (a):

(a)
  1. [Fe(CN)6]4−
    hexacyanidoferrate(4−) (additive)
    ferrocyanide (trivial)

Its standard InChI is:

InChI=1S/6CN.Fe/c6*1-2;/q;;;;;;-4 (1)

The main layer contains two types of entities: 6CN (i.e. six CN molecules) and Fe (one iron atom). If we try to convert (1) back to structure, using the same Web Demo tool, we get six free-floating CN radicals and a separate Fe4− anion. Ew. But if we tick the “Include Bonds to Metal” box in the Web Demo tool, we have

InChI=1/6CN.Fe/c6*1-2;/q;;;;;;-4/rC6FeN6/c8-1-7(2-9,3-10,4-11,5-12)6-13/q-4 (2)

where the metal-reconnected layer (/r) appears. It looks like an alternative InChI added directly after the standard one, with its own connectivity (/c) and charge (/q) sublayers. In this layer, there is only one entity: C6FeN6, i.e. [Fe(CN)6]. The string (2) is correctly converted back to the structure (a).

Now let’s look at the structure of a salt known as Prussian Blue (b):

(b)
  1. Fe4[Fe(CN)6]4−
    iron(3+) hexacyanidoferrate(4−) (additive)
    ferric ferrocyanide (trivial)
    Prussian Blue (trivial)

Its standard InChI is:

InChI=1S/18CN.7Fe/c18*1-2;;;;;;;/q;;;;;;;;;;;;;;;;;;3*-4;4*+3 (3)

The main layer contains two types of entities: 18CN (i.e. 18 CN molecules) and 7Fe (seven iron atoms). Converting (3) to structure brings about a horrible mess. With “bonds to metal”, however, we get

InChI=1/18CN.7Fe/c18*1-2;;;;;;;/q;;;;;;;;;;;;;;;;;;3*-4;4*+3/r3C6FeN6.4Fe/c3*8-1-7(2-9,3-10,4-11,5-12)6-13;;;;/q3*-4;4*+3 (4)

In the metal-reconnected layer (/r) we see two different types of entities: 3C6FeN6 (i.e. three [Fe(CN)6]) and 4Fe. The string (4) is correctly converted back to the structure (b).

Years ago, I was complaining (to the universe) about different InChIs for the same molecular entity, viz. chromate (ce). I’ve revisited it with the new version of InChI.

[Cr(O)2(O-)2] [Cr(O)4]2- [Cr(2+)(O-)4]
(c) (d) (e)
    [CrO4]2−
    chromate (trivial)
    tetraoxidochromate(2−) (additive)
    tetraoxidochromate(VI) (additive)

Alas, the standard InChIs for the representations (c), (d) and (e) remain different. Try to convert them back to structures: they also are all different and all wrong (all have extra hydrons). Nevertheless, I see a sign of progress: the metal-reconnected layers for the corresponding strings (5), (6) and (7) are identical!

(c) InChI=1/Cr.4O/q;;;2*-1/rCrO4/c2-1(3,4)5/q-2 (5)
(d) InChI=1/Cr.4O/q-2;;;;/rCrO4/c2-1(3,4)5/q-2 (6)
(e) InChI=1/Cr.4O/q+2;4*-1/rCrO4/c2-1(3,4)5/q-2 (7)

Moreover, all three strings, (5), (6) and (7), are converted back to the structure (c).

What about our old friend ferrocene? Depends how you draw it. I’ll stick to the ChEBI’s decacoordinate-iron representation (f):

ferrocene with 10-coordinate iron
(f)
  1. bis(η5-cyclopentadienyl)iron (additive)
    ferrocene (trivial)

The standard InChI for ferrocene is:

InChI=1S/2C5H5.Fe/c2*1-2-4-5-3-1;/h2*1-5H; (8)

Converting (8) to structure results in two standalone cyclopentadienyl radicals and a neutral iron atom. With “bonds to metal”:

InChI=1/2C5H5.Fe/c2*1-2-4-5-3-1;/h2*1-5H;/rC10H10Fe/c1-2-4-5-3(1)11(1,2,4,5)6-7(11)9(11)10(11)8(6)11/h1-10H (9)

In the /r layer we see a single entity, C10H10Fe, i.e. [Fe(C5H5)2]. The string (9) is correctly converted back to the structure (f).

Tuesday, August 26, 2014

Ogres are not like cakes

I was intrigued by the article in New Scientist which starts with the question, “Do you speak chemistry?” [1]. So much that I asked my friend to send me the original paper [2] authored by the Bartosz Grzybowski group of Northwestern University in Evanston, Illinois. It is a curious reading.

Don’t get me wrong. I have nothing against the analogies. I love the analogies. If the linguistic analogy works for chemistry, it’s fine by me. As long as everybody understands that it is just an analogy.

The authors try to “demonstrate that a natural language such as English and organic chemistry have the same structure in terms of the frequency of, respectively, text fragments and molecular fragments”. How do they do that? They start by looking at the maximum common substrings (MCS) found in 100 sentences randomly chosen from English Wikipedia.

Perhaps not surprisingly, the most common fragment of the sentences is “e”, followed by “a” and “o”.
That is surprising to me though, considering that only “a” is a word in English. I wouldn’t be surprised if it happened to be Spanish Wikipedia. Are the authors talking about letter frequency per chance? But the “top three” letters in English (from most to least common) are known to be E, T, A while in Spanish they are E, A, O. Anyway, they show that the distribution of the fragments, whatever they are, follows the power law. Then they show that the distribution of the common molecular fragments, derived from the corpus of organic molecules, also follows the power law. Big deal: so do the earthquake magnitudes, populations of cities and stock market crashes [3]. Cadeddu et al. do not seem to be bothered with that at all:
We have just shown that there exists a set of molecular fragments with which organic molecules can be described akin to a language.
So far so bad; whether you are a linguist, a computational chemist or an organic chemist, both methodology and conclusions of this paper are bound to make you cringe. So, my immediate reaction was to dismiss it altogether. Ogres are not like cakes. Organic molecules are not like a language. End of story.

But could it be that I am missing something? On the one hand, the language of chemistry — whether we are talking trivial names, systematic names, or graphical diagrams — is very much like any other language: a system of communication. On the other hand, the molecules themselves are not. Unless they are the information macromolecules. The message encoded in a single DNA molecule can be very much abstracted from its chemical structure. Without any doubt, genetic code is a communication system, therefore it is a language, although not man-made.

It’s interesting that the authors view organic molecules as “sentences” rather than “words”; the latter would be the nomenclaturist’s approach. I guess it depends on your taste, or language preferences. Most systematic chemical names look alien in English but would fit rather nicely in German or Finnish. I personally view any chemical name as a noun phrase describing a corresponding molecular entity; a molecular entity itself is not a noun phrase. However, in natural languages, there rarely is a confusion regarding the boundaries of a word:

a word is the smallest element that may be uttered in isolation with semantic or pragmatic content (with literal or practical meaning).
On the contrary, Grzybowski’s “words” are the molecular fragments which do not exist in isolation. It is also worth noting that in the world of biopolymers, say nucleic acids, each monomer (as complex as any of Grzybowski’s “sentences”), is often represented as a letter, while an entire bacterial genome (still a single DNA molecule) could be considered a War and Peace (or Crime and Punishment).

Cadeddu et al. further claim that linguistic approach identifies the symmetry/repeat units in molecules such as α-cyclodextrin and porphyrin:

We emphasize that this is not a small feat given we have not even considered any (x, y, z) coordinates of the atoms making up these molecules and performed no linear-algebra analyses to find symmetries which, incidentally, can be a computationally intensive procedure involving manipulation of matrices.
I find this modest remark regarding the size of the “feat” within the body of a scientific article in a respectable journal really cute. Are the authors even aware that there are chemical similarity/substructure search engines? You don’t need atomic coordinates to identify the fragments with the same connectivity.

Which brings me to the final point. What is the “chemical linguistics” anyway? If the “words” of chemistry, as postulated in [2], are nothing else but molecular fragments, or substructures, then the chemoinformaticians were doing the substructure search of chemical databases for donkey’s years without knowing that it is called chemical linguistics. I am aware of completely different use of this term in a sense “mining of natural language texts for chemical information” [4, 5]. This latter use is well-established and I think applying the name “chemical linguistics” to unrelated area will only confuse everybody.

  1. Aron, J. (2014) Language of chemistry is unveiled by molecular make-up. New Scientist no. 2981, p. 8.
  2. Cadeddu, A., Wylie, E.K., Jurczak, J., Wampler-Doty, M. and Grzybowski, B.A. (2014) Organic chemistry as a language and the implications of chemical linguistics for structural and retrosynthetic analyses. Angewandte Chemie 126, 8246—8250.
  3. Buchanan, M. (2000) Ubiquity, Weidenfeld & Nicolson, London.
  4. Goebels, L., Grotz, H., Lawson, A.L., Roller, S. and Wisniewski, J. (2005) Method and software for extracting chemical data. Patent DE 102005020083 A1.
  5. Day, N.E., Corbett, P.T. and Murray-Rust, P. (2007) Semantic chemical publishing. ACS National Meeting #233, Chicago.

Tuesday, December 06, 2011

Collaborative Computational Technologies for Biomedical Research

It’s been a while since I read a science/technology book from back to back. And was it worth it? Definitely.

The book is about collaboration and is a collaboration. Ironically, the best-written chapters almost invariably are those by single authors. Which confirms my own theory that writing (including scientific writing) is not exactly collaborative activity. The contributions by Robert Porter Lynch [1], Robin W. Spencer [2], Victor J. Hruby [3], Edward D. Zanders [4], Brian Pratt [5] and Keith T. Taylor [6] are especially worth noting — I wish the whole book was written at the level of these chapters. Then again, collaboration is always a compromise. The material presented here is diverse and heterogeneous — what did you expect?

I am sure there are people who do all sorts of stuff using their smartphones, including scientific database browsing and chemical structure drawing [7]. This latter activity does not strike me as especially productive or convenient. (Also, makes me glad that the use of mobile phones while driving is outlawed in most of Europe.) In my view, for the purposes of computer graphics bigger is better: if I had a choice, I’d go for HIPerWall (25,600 × 8000 pixels) or, better still, HIPerSpace (35,840 × 8000 pixels) display walls [8]. Then I could draw some really large (in many senses) molecules.

As much as I enjoy reading the real (hardcopy) book, it could be nice to see it online, preferably in open access. For instance, Chapter 25 [9] has 196 references, all of them are URLs, and some of them are rather long ones. I’d love to be able to click on them rather than type!

Will the wikis, virtual communities and cloud computing replace the behemoth pharma companies and NCBI? A man can dream. Ekins et al. write [10]:

As a result of the recent recession there is a lot of drug discovery and development talent available now due to company lay-offs. If the software or other tools to enable this workforce to be productive and collaborate were available and they participated in the existing scientific collaboration networks, then there may be potential for enormous breakthroughs.

I wish I could share the authors’ optimism. Yes there is potential, but it is highly unlikely that unemployed researchers are in the mood to collaborate. In case you wonder why: being unemployed is a full-time occupation, which leaves preciously little spare time. I rather inclined to agree with Robin W. Spencer [2]:

Especially for cutting-edge scientific challenges, the participants you need are probably well paid and not particularly enthused by another tee shirt, coffee cup, or $100 voucher.

More quotes from this book can be found here.

I use this opportunity to lament the decline of old-fashioned copy editing [11]. I get used to the lack of any such luxury in open access publications: if the paper is accepted, the publisher tends to keep all your typos intact. But when you buy a book from John Wiley & Sons for a hundred something bucks, you’d expect some editorial intervention. (To be honest, I did not buy it. I can’t afford buying books at such prices anyway.) The major and minor irritations include:

  • Typos: “chpater” instead of “chapter” (p. 281) — I thought by now the text editing software should take care of these.
  • Tautologies: ‘The institutes of the national Institutes of Health’ (p. 496); ‘... we need to consider standards specifically for chemistry and biology. In chemistry specifically...’ (p. 202).
  • Impenetrable sentences, e.g. ‘Many aspects should be considered, such as a regulatory path for filing, potential market size, differentiability of the therapeutic and experience with and difficulty to carry out clinical trials in the disease of interest’ (p. 252) or ‘This will only be done by drawing from the mental resources of an extended scientific community in an innovative and complex, yet “daily practice”, manner that promises a profound impact on our ability to use existing data to generate new knowledge with the maximum conceivable serendipity’ (p. 454). You what?
  • Overabundance of acronyms (have a look at p. 497 and you’ll see what I mean).
  • Overabundance of buzz-words of yesteryear: crowdsourcing (see below), integration, leveraging, paradigm, stakeholder and so on. The worst offenders, however, are clear and clearly. Clearly, when these words is used too often, it is clear that something is not quite clear.

Now for “crowdsourcing”: I find the term not only ugly but offensive. As a scientist (once a scientist, always a scientist), I am open to collaboration. Also, as a scientist, I detest being part of a crowd. Period.

Don’t get me wrong: it is a good book. I wouldn’t hesitate to recommend it to any decent scientific library. But it could have been a great book.

  1. Lynch, R.P. Collaborative innovation: essential foundation of scientific discovery. In: Ekins, S., Hupcey, M.A.Z. and Williams, A.J. (eds.) Collaborative Computational Technologies for Biomedical Research. John Wiley & Sons, Hoboken, 2011, pp. 19—37.
  2. Spencer, R.W. Consistent patterns in large-scale collaboration. Ibid., pp. 99—111.
  3. Hruby, V.J. Collaborations between chemists and biologists. Ibid., pp. 113—120.
  4. Zanders, E.D. Scientific networking and collaborations. Ibid., pp. 149—160.
  5. Pratt, B. Collaborative systems biology: open source, open data, and cloud computing. Ibid., pp. 209—220.
  6. Taylor, K.T. Evolution of electronic laboratory notebooks. Ibid., pp. 303—320.
  7. Williams, A.J., Arnold, R.J.G., Neylon, C., Spencer, R.W., Schürer, S. and Ekins, S. Current and future challenges for collaborative computational technologies for the life sciences. Ibid., pp. 491—517.
  8. He, Z., Ponto, K. and Kuester, F. Collaborative visual analytics environment for imaging genetics. Ibid., pp. 467—490.
  9. Bradley, J.-C., Lang, A.S.I.D., Koch, S. and Neylon, C. Collaboration using open notebook science in academia. Ibid., pp. 425—452.
  10. Ekins, S., Williams, A.J. and Hupcey, M.A.Z. Standards for collaborative computational technologies for biomedical research. Ibid., pp. 201—208.
  11. Clark, A. The lost art of editing. The Guardian, 11 February 2011.

Wednesday, November 04, 2009

Some ChEBI news

Wow. Today’s ChEBI news inform that

ChEBI release 62 is now available, containing 455,788 total entities, of which 19,236 are annotated entities and 607 were submitted via the ChEBI submission tool.

Looking back, I must say that ChEBI news lack consistency and therefore the users are bound to be confused. Before, we never said how many entities ChEBI contained in total. For the previous release, only “annotated” entities were counted:

ChEBI release 61 is now available, containing 18933 annotated entities, with 413 of those submitted via the ChEBI submission tool.

And a few releases back, “annotated” was not mentioned at all:

ChEBI release 58 contains 18186 entities with 175 of those submitted via the ChEBI submission tool.

Does “annotated” matter? Most ChEBI entries are annotated is some way. That includes all these thousands of compounds which just came from ChEMBL. What is meant here really is “annotated and approved by ChEBI curators”. But wait:

With this release, we’ve incorporated the compound records from the ChEMBL dataset and introduced a starring system to identify core (3-star) annotated ChEBI entries from entries annotated by the ChEMBL project and ChEBI submitters.

This is unfortunate that ChEBI introduced stars to signify entry quality. I think I am not the only one who firmly associates stars with user’s (external reviewer’s) rating. “Thou shalt not award no stars to thyself.” In addition, the star rating system usually implies that stars can be lost as well as gained, which is not the case in ChEBI: once the three-star status is reached, the entry stays as it is.

Oh well. Sure ChEBI is not perfect, but what is? Perhaps in a couple of releases we’ll see another change, stars replaced by other celestial bodies or flowers or traffic signs. Good night.

Monday, September 21, 2009

Sulfimide bond in collagen IV

The recent paper in Science describes the sulfilimine (sulfimide, in IUPACese) bond, “not previously found in biomolecules”, identified in collagen. The bond (>S=N–) cross-links methionine and hydroxylysine residues of adjoining protomers.

Thursday, August 13, 2009

What is a correct InChI for chromate?

During the IUPAC International Chemical Identifier (InChI) Subcommittee meeting in Glasgow last month, we touched upon the issue of normalisation of metal complexes. I did not realise before that even simple entity such as chromate(2−), drawn in different ways (a)(c), will give different InChIs. (And different standard InChIs as well; and InChIKeys too.) This is, I am told, because the current InChI algorithm involves “disconnection” of metals before “normalisation”, while it really should do normalisation first. Bother.

[Cr(O)2(O-)2]
InChI=1/Cr.4O/q;;;2*-1
[Cr(O)4]2-
InChI=1/Cr.4O/q-2;;;;
[Cr(2+)(O-)4]
InChI=1/Cr.4O/q+2;4*-1
(a) (b) (c)

Friday, March 27, 2009

Stories of chronomes and metallomes

I do not understand what principle is used by PubMed to indicate which papers are “related” to the one you are looking at. Take, for instance, the recent paper “Epigenetics: an important challenge for ICP-MS in metallomics studies” — among “Related Articles”, the top one is entitled “Chronoastrobiology: proposal, nine conferences, heliogeomagnetics, transyears, near-weeks, near-decades, phylogenetic and ontogenetic memories”. (Is that a real title? Yes it is.) True, the abstract, though truncated, makes an intriguing reading, but has it anything to do with metallomics (or epigenetics, for that matter)? The only passage related to any ome or omics is the following:
Structures in time are called chronomes; their mapping in us and around us is called chronomics. The scientific study of chronomes is chronobiology.
Well, I don’t know, Webster’s definition of chronobiology makes more sense to me and it does not use the dodgy concept of “chronome”. As for today, 27 March 2009, PubMed citations for chronome (61) and chronomics (39) visibly outnumber metallome (8) and metallomics (20), while there is none that combines any of the first pair of terms with any of the second pair of terms.

Saturday, March 21, 2009

Rhea has hatched

I am pleased to announce that after years of hard work, the Rhea database finally went online. Rhea is a freely available, manually annotated database of chemical reactions created as a collaboration between the EBI and SIB. From the Rhea website:

In classical Greek mythology, Rhea (Greek Ρέα; /ˈriːə/) was the daughter of Uranus and Gaia, and was known as the mother of gods. Her name is often linked to the Greek word ρείν (“flow”) but has no relation to the word “reaction”. Rhea is the name of a genus of flightless birds, also known as ñandú. Rhea is also the name of the second-largest moon of Saturn, which contains up to 75% water and may have a tenuous ring system. The image of Rhea (moon) is used in Rhea (database) logo.
Rhea image

Sunday, March 01, 2009

PubChem takes liberties with hydrogens

The submitted structure (a) is C3H5O5P, the PubChem shows C3H4O5P+ (b). How did that happen? Why the deposited molecule lost hydride (H)?

3-[hydroxy(oxido)phosphoranyl]pyruvic acid
(a)(b)

In the case of structure C16H36MoN6O4P2 (c), presumably submitted by NIST, it has acquired two hydrons in PubChem to become [C16H38MoN6O4P2]2+ (d).

(c)(d)