Tuesday, August 26, 2014

Ogres are not like cakes

I was intrigued by the article in New Scientist which starts with the question, “Do you speak chemistry?” [1]. So much that I asked my friend to send me the original paper [2] authored by the Bartosz Grzybowski group of Northwestern University in Evanston, Illinois. It is a curious reading.

Don’t get me wrong. I have nothing against the analogies. I love the analogies. If the linguistic analogy works for chemistry, it’s fine by me. As long as everybody understands that it is just an analogy.

The authors try to “demonstrate that a natural language such as English and organic chemistry have the same structure in terms of the frequency of, respectively, text fragments and molecular fragments”. How do they do that? They start by looking at the maximum common substrings (MCS) found in 100 sentences randomly chosen from English Wikipedia.

Perhaps not surprisingly, the most common fragment of the sentences is “e”, followed by “a” and “o”.
That is surprising to me though, considering that only “a” is a word in English. I wouldn’t be surprised if it happened to be Spanish Wikipedia. Are the authors talking about letter frequency per chance? But the “top three” letters in English (from most to least common) are known to be E, T, A while in Spanish they are E, A, O. Anyway, they show that the distribution of the fragments, whatever they are, follows the power law. Then they show that the distribution of the common molecular fragments, derived from the corpus of organic molecules, also follows the power law. Big deal: so do the earthquake magnitudes, populations of cities and stock market crashes [3]. Cadeddu et al. do not seem to be bothered with that at all:
We have just shown that there exists a set of molecular fragments with which organic molecules can be described akin to a language.
So far so bad; whether you are a linguist, a computational chemist or an organic chemist, both methodology and conclusions of this paper are bound to make you cringe. So, my immediate reaction was to dismiss it altogether. Ogres are not like cakes. Organic molecules are not like a language. End of story.

But could it be that I am missing something? On the one hand, the language of chemistry — whether we are talking trivial names, systematic names, or graphical diagrams — is very much like any other language: a system of communication. On the other hand, the molecules themselves are not. Unless they are the information macromolecules. The message encoded in a single DNA molecule can be very much abstracted from its chemical structure. Without any doubt, genetic code is a communication system, therefore it is a language, although not man-made.

It’s interesting that the authors view organic molecules as “sentences” rather than “words”; the latter would be the nomenclaturist’s approach. I guess it depends on your taste, or language preferences. Most systematic chemical names look alien in English but would fit rather nicely in German or Finnish. I personally view any chemical name as a noun phrase describing a corresponding molecular entity; a molecular entity itself is not a noun phrase. However, in natural languages, there rarely is a confusion regarding the boundaries of a word:

a word is the smallest element that may be uttered in isolation with semantic or pragmatic content (with literal or practical meaning).
On the contrary, Grzybowski’s “words” are the molecular fragments which do not exist in isolation. It is also worth noting that in the world of biopolymers, say nucleic acids, each monomer (as complex as any of Grzybowski’s “sentences”), is often represented as a letter, while an entire bacterial genome (still a single DNA molecule) could be considered a War and Peace (or Crime and Punishment).

Cadeddu et al. further claim that linguistic approach identifies the symmetry/repeat units in molecules such as α-cyclodextrin and porphyrin:

We emphasize that this is not a small feat given we have not even considered any (x, y, z) coordinates of the atoms making up these molecules and performed no linear-algebra analyses to find symmetries which, incidentally, can be a computationally intensive procedure involving manipulation of matrices.
I find this modest remark regarding the size of the “feat” within the body of a scientific article in a respectable journal really cute. Are the authors even aware that there are chemical similarity/substructure search engines? You don’t need atomic coordinates to identify the fragments with the same connectivity.

Which brings me to the final point. What is the “chemical linguistics” anyway? If the “words” of chemistry, as postulated in [2], are nothing else but molecular fragments, or substructures, then the chemoinformaticians were doing the substructure search of chemical databases for donkey’s years without knowing that it is called chemical linguistics. I am aware of completely different use of this term in a sense “mining of natural language texts for chemical information” [4, 5]. This latter use is well-established and I think applying the name “chemical linguistics” to unrelated area will only confuse everybody.

  1. Aron, J. (2014) Language of chemistry is unveiled by molecular make-up. New Scientist no. 2981, p. 8.
  2. Cadeddu, A., Wylie, E.K., Jurczak, J., Wampler-Doty, M. and Grzybowski, B.A. (2014) Organic chemistry as a language and the implications of chemical linguistics for structural and retrosynthetic analyses. Angewandte Chemie 126, 8246—8250.
  3. Buchanan, M. (2000) Ubiquity, Weidenfeld & Nicolson, London.
  4. Goebels, L., Grotz, H., Lawson, A.L., Roller, S. and Wisniewski, J. (2005) Method and software for extracting chemical data. Patent DE 102005020083 A1.
  5. Day, N.E., Corbett, P.T. and Murray-Rust, P. (2007) Semantic chemical publishing. ACS National Meeting #233, Chicago.