Metallome: linguistics

Showing posts with label linguistics. Show all posts

Monday, August 12, 2024

Ants, apples, amber

Let’s turn our attention now to other kind of acids. You know what I’m talking about: carboxylic acids. Here’s the simplest one (a):

(a)

HCOOH
formic acid (common, PIN)
methanoic acid (substitutive)
hydridohydroxidooxidocarbon (additive)

If we compare the structure (a) with that of our old friend, carbonic acid (b), we’ll notice that the only difference between them amounts to one oxygen atom.

Many of the chemical names referred today to as “common” or “trivial” — as opposed to “systematic” — at the time were very much systematic. Many of them, in fact, remain systematic because there is a system behind them.

Observe the structure (a):

(a)

H₂SO₄
[SO₂(OH)₂]
sulfuric acid (common)
dihydroxidodioxidosulfur (additive)

Its molecular formula, H₂SO₄, is probably the second most-known formula in the world after H₂O. We can rewrite it as [SO₂(OH)₂]. There’s nothing easier than to create a completely systematic additive name for (a): dihydroxidodioxidosulfur. However, almost nobody uses this name because there is much more famous one: sulfuric acid.

Irregularity and suppletion

Now that we’ve established that all chemical names consist of content words and each content word includes at least one base, we can rephrase our original statement ix

New chemical names are formed by combining existing content morphemes with functional morphemes or adding new content morphemes

New chemical names are formed by combining existing bases with functional morphemes or adding new bases.

When we say “combining”, we mean that the parts of our construction set themselves are not changing. Right? In this way, the chemical name-building (out of standardised blocks, like names of atoms, groups, etc.) reflects the actual molecule-building (out of standard blocks, like atoms, groups, etc.).

On the other hand, if we agree that chemical names form part of a natural language, we also have to accept that sometimes they behave in not quite regular fashion. For example, we can figure out that the substituent group called ethenyl is derived from ethene because they share the base ‘ethen’. However, we cannot deduce in the similar fashion that phenyl group is derived from benzene. What’s going on here?

Stems, roots, bases

In a number of IUPAC publications, the entities that are referred to as “stems” include

Latin stems such as ‘argent’, ‘aur’, ‘cupr’, ‘ferr’, etc. used before ‘ide’ or ‘ate’ in anion names [1];
Stem name ‘carotene’ in nomenclature of carotenoids [2, rule 2];
Stem ‘calci-’ in nomenclature of vitamin D [3];
Stem ‘retin-’ in nomenclature of retinoids [4];
In carbohydrate nomenclature, stem names that designate the chain length of the sugar, e.g. ‘pent-’, ‘hex-’, ‘hept-’ etc. [5];
Stems such as ‘irene’, ‘irane’, ‘epine’ etc. in Hantzsch-Widman (H-W) nomenclature [6];
Stem name ‘phosphatidic acid’ [7].

Before we go any further, we have to distinguish between the terms “root”, “stem” and “base”, which are often used interchangeably even in linguistic literature.

Suffixes — or combining forms?

In a number of IUPAC publications, the entities that are referred to as “suffixes” include

Suffix for the principal characteristic group, such as ‘-amine’, ‘-one’ or ‘-oic acid’ [1];
Suffixes indicating charge [2, p. 5];
Suffixes indicating loss or addition of one or more hydrogens from/to parent hydrides, e.g. ‘-ium’, ‘-ylium’, ‘-ide’ or ‘-uide’ [2, p. 105] ;
Suffixes ‘yl’, ‘ylidene’ or ‘ylidyne’ in the names of radicals and substituent groups [2, p. 108];
Subtractive suffixes ‘ene’ and ‘yne’ [3];
Composite suffixes [4, p. 82 ] aka combined suffixes [2, p. 251] that contain multiplicative prefixes, as in ‘diyl’ or ‘triylium’;
Suffixes ‘quinone’, ‘quinol’, ‘chromenol’ and ‘chromanol’ in names of quinones [5].

Prefixes — or combining forms?

With “endings” out of the way, shall we move on to “prefixes”?

In a number of IUPAC publications, the entities that are referred to as “prefixes” include

Numerical prefixes [1], aka multiplicative prefixes [2] ‘di’, ‘tri’, ‘tetra’, etc. and ‘bis’, ‘tris’, ‘tetrakis’, etc.;
Prefixes indicating atoms or groups, either substituents, e.g. ‘hydro’, ‘chloro’, ‘cyano’, or ligands, e.g. ‘hydrido’, ‘chlorido’, ‘cyanido’ [2];
Prefixes ‘de’ and ‘an’ in subtractive nomenclature as well as their combinations with the names of atoms or groups, e.g. ‘dehydro’, ‘anhydro’, ‘demethyl’, ‘deoxy’, etc.;
The ‘a’ prefixes for skeletal replacement and Hantzsch-Widman names, e.g. ‘aza’, ‘oxa’, ‘thia’, as well as their combinations with multiplicative prefixes, as in ‘dioxa’ [3];
Geometrical and structural prefixes such as catena-, arachno-, quadro-, etc. [3];
Configurational prefixes of inositols such as allo-, chiro-, cis-, epi-, muco-, myo-, neo- and scyllo- [4];
Prefixes retro- and ‘apo’ in nomenclature of carotenoids [5];
Configurational prefixes in nomenclature of carbohydrates [6];
Prefix sn- (for stereospecifically numbered) in nomenclature of glycerol derivatives [7];
Prefixes ‘abeo’, ‘cyclo’, ‘homo’, ‘nor’ and ‘seco’ in nomenclature of natural products [8];
Prefix ‘poly’ and qualifiers such as branch-, net-, or star- in polymer names [9].

I like “qualifiers”. I also don’t mind saying “multiplicative prefix” or “configurational prefix” as long as we understand that they actally might be not prefixes, just like vegetarian sausages are not sausages and white chocolate is not chocolate.

Content morphemes

At this point, it might be useful to mention that morphemes could be divided into two classes: content morphemes (i.e. those that have independent meaning) and functional morphemes. All content words contain at least one content morpheme. In English, content words include nouns, adjectives, adverbs and most verbs, while functional morphemes include conjunctions, prepositions, pronouns and articles as well as affixes. These two classes are sometimes referred to as “open class” and “closed class”, respectively. New morphemes are easily added to the former and hardly ever to the latter.

Now to continue with the list that I started earlier.

Since chemical names consist of content words (ii), they are open-class.
Every content word contains at least one root (iv) which is a content (open-class) morpheme.
Affixes are functional (closed-class) morphemes.
New chemical names are formed by combining existing content morphemes with functional morphemes or adding new content morphemes.

OK? Still no objections?

Endings

First of all, let’s have a look at endings, also known as inflectional suffixes. In highly inflected languages such as Latin or Russian endings change depending on number, gender and case. In Russian, there are three noun declensions:

Case	feminine (I)		neuter (II)		masculine (II)		feminine (III)
Case	singular	plural	singular	plural	singular	plural	singular	plural
Nominative	кислота	кислоты	основание	основания	спирт	спирты	соль	соли
Genitive	кислоты	кислот	основания	оснований	спирта	спиртов	соли	солей
Dative	кислоте	кислотам	основанию	основаниям	спирту	спиртам	соли	солям
Accusative	кислоту	кислоты	основание	основания	спирт	спирты	соль	соли
Instrumental	кислотой	кислотами	основанием	основаниями	спиртом	спиртами	солью	солями
Prepositional	кислоте	кислотах	основании	основаниях	спирте	спиртах	соли	солях
	acid	acids	base	bases	alcohol	alcohols	salt	salts

Step back

Those of you who were reading my blog this year might have noticed that words such as “prefix”, “suffix” or “ending” are used extensively in chemical nomenclature. And those of my readers who remember (from their school days perhaps) the basics of morphology also might have been wondering whether these terms have anything to do with their counterparts in linguistics. That’s what happens when you use terms without defining them first.

Before moving any further with nomenclature, it could be helpful to clarify our terminology.

Alas, it looks like the task is more complex than I thought.

There is no perfect language

From The Information: A History, A Theory, A Flood by James Gleick:

It was once thought that a perfect language should have an exact one-to-one correspondence between words and their meanings. There should be no ambiguity, no vagueness, no confusion. Our earthly Babel is a falling off from the lost speech of Eden: a catastrophe and a punishment. “I imagine,” writes the novelist Dexter Palmer, “that the entries of the dictionary that lies on the desk in God’s study must have one-to-one correspondences between the words and their definitions, so that when God sends directives to his angels, they are completely free from ambiguity. Each sentence that He speaks or writes must be perfect, and therefore a miracle.” We know better now. With or without God, there is no perfect language.

Leibniz thought that if natural language could not be perfect, at least the calculus could: a language of symbols rigorously assigned. “All human thoughts might be entirely resolvable into a small number of thoughts considered as primitive.” These could then be combined and dissected mechanically, as it were. “Once this had been done, whoever uses such characters would either never make an error, or, at least, would have the possibility of immediately recognizing his mistakes, by using the simplest of tests.” Gödel ended that dream.

On the contrary, the idea of perfection is contrary to the nature of language. Information theory has helped us understand that — or, if you are a pessimist, forced us to understand it.

Tuesday, August 26, 2014

Ogres are not like cakes

I was intrigued by the article in New Scientist which starts with the question, “Do you speak chemistry?” [1]. So much that I asked my friend to send me the original paper [2] authored by the Bartosz Grzybowski group of Northwestern University in Evanston, Illinois. It is a curious reading.

Don’t get me wrong. I have nothing against the analogies. I love the analogies. If the linguistic analogy works for chemistry, it’s fine by me. As long as everybody understands that it is just an analogy.

The authors try to “demonstrate that a natural language such as English and organic chemistry have the same structure in terms of the frequency of, respectively, text fragments and molecular fragments”. How do they do that? They start by looking at the maximum common substrings (MCS) found in 100 sentences randomly chosen from English Wikipedia.

Perhaps not surprisingly, the most common fragment of the sentences is “e”, followed by “a” and “o”.

That is surprising to me though, considering that only “a” is a word in English. I wouldn’t be surprised if it happened to be Spanish Wikipedia. Are the authors talking about letter frequency per chance? But the “top three” letters in English (from most to least common) are known to be E, T, A while in Spanish they are E, A, O. Anyway, they show that the distribution of the fragments, whatever they are, follows the power law. Then they show that the distribution of the common molecular fragments, derived from the corpus of organic molecules, also follows the power law. Big deal: so do the earthquake magnitudes, populations of cities and stock market crashes [3]. Cadeddu et al. do not seem to be bothered with that at all:

We have just shown that there exists a set of molecular fragments with which organic molecules can be described akin to a language.

So far so bad; whether you are a linguist, a computational chemist or an organic chemist, both methodology and conclusions of this paper are bound to make you cringe. So, my immediate reaction was to dismiss it altogether. Ogres are not like cakes. Organic molecules are not like a language. End of story.

But could it be that I am missing something? On the one hand, the language of chemistry — whether we are talking trivial names, systematic names, or graphical diagrams — is very much like any other language: a system of communication. On the other hand, the molecules themselves are not. Unless they are the information macromolecules. The message encoded in a single DNA molecule can be very much abstracted from its chemical structure. Without any doubt, genetic code is a communication system, therefore it is a language, although not man-made.

It’s interesting that the authors view organic molecules as “sentences” rather than “words”; the latter would be the nomenclaturist’s approach. I guess it depends on your taste, or language preferences. Most systematic chemical names look alien in English but would fit rather nicely in German or Finnish. I personally view any chemical name as a noun phrase describing a corresponding molecular entity; a molecular entity itself is not a noun phrase. However, in natural languages, there rarely is a confusion regarding the boundaries of a word:

a word is the smallest element that may be uttered in isolation with semantic or pragmatic content (with literal or practical meaning).

On the contrary, Grzybowski’s “words” are the molecular fragments which do not exist in isolation. It is also worth noting that in the world of biopolymers, say nucleic acids, each monomer (as complex as any of Grzybowski’s “sentences”), is often represented as a letter, while an entire bacterial genome (still a single DNA molecule) could be considered a War and Peace (or Crime and Punishment).

Cadeddu et al. further claim that linguistic approach identifies the symmetry/repeat units in molecules such as α-cyclodextrin and porphyrin:

We emphasize that this is not a small feat given we have not even considered any (x, y, z) coordinates of the atoms making up these molecules and performed no linear-algebra analyses to find symmetries which, incidentally, can be a computationally intensive procedure involving manipulation of matrices.

I find this modest remark regarding the size of the “feat” within the body of a scientific article in a respectable journal really cute. Are the authors even aware that there are chemical similarity/substructure search engines? You don’t need atomic coordinates to identify the fragments with the same connectivity.

Which brings me to the final point. What is the “chemical linguistics” anyway? If the “words” of chemistry, as postulated in [2], are nothing else but molecular fragments, or substructures, then the chemoinformaticians were doing the substructure search of chemical databases for donkey’s years without knowing that it is called chemical linguistics. I am aware of completely different use of this term in a sense “mining of natural language texts for chemical information” [4, 5]. This latter use is well-established and I think applying the name “chemical linguistics” to unrelated area will only confuse everybody.

Aron, J. (2014) Language of chemistry is unveiled by molecular make-up. New Scientist no. 2981, p. 8.
Cadeddu, A., Wylie, E.K., Jurczak, J., Wampler-Doty, M. and Grzybowski, B.A. (2014) Organic chemistry as a language and the implications of chemical linguistics for structural and retrosynthetic analyses. Angewandte Chemie 126, 8246—8250.
Buchanan, M. (2000) Ubiquity, Weidenfeld & Nicolson, London.
Goebels, L., Grotz, H., Lawson, A.L., Roller, S. and Wisniewski, J. (2005) Method and software for extracting chemical data. Patent DE 102005020083 A1.
Day, N.E., Corbett, P.T. and Murray-Rust, P. (2007) Semantic chemical publishing. ACS National Meeting #233, Chicago.

Metallome

Monday, August 12, 2024

Ants, apples, amber

Sunday, June 23, 2024

Oxoacids and their anions

Saturday, January 09, 2021

Irregularity and suppletion

Tuesday, December 29, 2020

Stems, roots, bases

Saturday, December 12, 2020

Suffixes — or combining forms?

Thursday, October 29, 2020

Prefixes — or combining forms?

Sunday, October 18, 2020

Content morphemes

Thursday, October 08, 2020

Endings

Thursday, October 01, 2020

Step back

Wednesday, June 10, 2015

There is no perfect language

Tuesday, August 26, 2014

Ogres are not like cakes

Our books / Nos livres / Наши книги

Pages

Map counter

Blog Archive

Labels

Links

Followers

About Me

I blog about science