By MIKE MAGEE
Not surprisingly, my nominee for “word of the year” involves AI, and specifically “the language of human biology.”
As Eliezer Yudkowski, the founder of the Machine Intelligence Research Institute and coiner of the term “friendly AI” stated in Forbes:
“Anything that could give rise to smarter-than-human intelligence—in the form of Artificial Intelligence, brain-computer interfaces, or neuroscience-based human intelligence enhancement – wins hands down beyond contest as doing the most to change the world. Nothing else is even in the same league.”
Perhaps the simplest way to begin is to say that “missense” is a form of misspeak or expressing oneself in words “incorrectly or imperfectly.” But in the case of “missense”, the language is not made of words, where (for example) the meaning of a sentence would be disrupted by misspelling or choosing the wrong word.
With “missense”, we’re talking about a different language – the language of DNA and proteins. Specifically, the focus in on how the four base units or nucleotides that provide the skeleton of a strand of DNA communicate instructions for each of the 20 different amino acids in the form of 3 “letter” codes or “codons.”
In this protein language, there are four nucleotides. Each “nucleotide” (adenine, quinine, cytosine, thymine) is a 3-part molecule which includes a nuclease, a 5-carbon sugar and a phosphate group. The four nucleotides unique chemical structures are designed to create two “base-pairs.” Adenine links to Thymine through a double hydrogen bond, and Cytosine links to Guanine through a triple hydrogen bond. A-T and C-G bonds effectively “reach across” two strands of DNA to connect them in the familiar “double-helix” structure. The strands gain length by using their sugar and phosphate molecules on the top and bottom of each nucleoside to join to each other, increasing the strands length.
The A’s and T’s and C’s and G’s are the starting points of a code. A string of three, for example A-T-G is called a “codon”, which in this case stands for one of the 20 amino acids common to all life forms, Methionine. There are 64 different codons – 61 direct the chain addition of one of the 20 amino acids (some have duplicates), and the remaining 3 codons serve as “stop codons” to end a protein chain.
Messenger RNA (mRNA) carries a mirror image of the coded nucleotide base string from the cell nucleus to ribosomes out in the cytoplasm of the cell. Codons then call up each amino acid, which when linked together, form the protein. The protein’s structure is defined by the specific amino acids included and their order of appearance. Protein chains fold spontaneously, and in the process form a 3-dimensional structure that effects their biologic functions.
A mistake in a single letter of a codon can result in a mistaken message or “missense.” In 2018, Alphabet (formerly Google) released AlphaFold, an artificial intelligence system able to predict protein structure from DNA codon databases, with the promise of accelerating drug discovery. Five years later, the company released AlphaMissense, mining AlphaFold databases, to learn the new “protein language” as with the large language model (LLM) product ChatGPT. The ultimate goal: to predict where “disease-causing mutations are likely to occur.”
A work in progress, AlphaMissense has already created a catalogue of possible human missense mutations, declaring 57% to have no harmful effect, and 32% possibly linked to (still to be determined) human pathology. The company has open sourced much of its database, and hopes it will accelerate the “analyzes of the effects of DNA mutations and…the research into rare diseases.”
The numbers are not small. Believe it or not, AI says the 46-chromosome human genome theoretically harbors 71 million possible missense events waiting to happen. Up to now, they’ve identified only 4 million. For humans today, the average genome includes only 9000 of these mistakes, most of which have no bearing on life or limb.
But occasionally they do. Take for example Sickle Cell Anemia. The painful and life limiting condition is the result of a single codon mistake (GTG instead of GAG) on the nucleoside chain coded to create the protein hemoglobin. That tiny error causes the 6th amino acid in the evolving hemoglobin chain, glutamic acid, to be substituted with the amino acid valine. Knowing this, investigators have now used the gene-editing tool CRISPR (a winner of the Nobel Prize in Chemistry in 2020) to correct the mistake through autologous stem cell therapy.
As Michigan State University physicist Stephen Hsu said, “The goal here is, you give me a change to a protein, and instead of predicting the protein shape, I tell you: Is this bad for the human that has it? Most of these flips, we just have no idea whether they cause sickness.”
Patrick Malone, a physician researcher at KdT ventures, sees AI on the march. He says, this is “an example of one of the most important recent methodological developments in AI. The concept is that the fine-tuned AI is able to leverage prior learning. The pre-training framework is especially useful in computational biology, where we are often limited by access to data at sufficient scale.”
AlphaMissense creators believe their predictions may:
“Illuminate the molecular effects of variants on protein function.”
“Contribute to the identification of pathogenic missense mutations and previously unknown disease-causing genes.”
“Increase the diagnostic yield of rare genetic diseases.”
And of course, this cautionary note: The growing capacity to define and create life carries with it the potential to alter life. Which is to say, what we create will eventually change who we are, and how we behave toward each other.
Mike Magee MD is a Medical Historian and a regular THCB contributor. He is the author of CODE BLUE: Inside America’s Medical Industrial Complex (Grove/2020)