Tuesday, December 25, 2018

Split gene theory

220 of Borg: Added tag to article (TW)


Liquid error: wrong number of arguments (1 for 2)
The eukaryotic genes’ coding sequences are split into exons and introns. As the split gene structure is central to eukaryotic biology, the question of how and why eukaryotic genes are split is extremely important.

== Background ==
Genes of all organisms, except bacteria, consist of short protein-coding regions (exons) interrupted by long sequences that intervene the coding sequences (introns) [FIGURE - show split gene, →  Transcription (RNA Pol), → Splicing (Spliceosome), Translation (Ribosome) → Protein]. When a gene is expressed, its DNA sequence is copied into a “primary RNA” sequence. Then the “spliceosome” machinery physically removes the introns from the RNA copy of the gene, leaving only a contiguously connected series of exons, which becomes the “messenger” RNA (mRNA). This mRNA is now “read” by another cellular machinery, called the “ribosome,” to produce the encoded protein. Thus, although introns are not physically removed from a gene, a gene’s sequence is read as if introns never existed.

The length of introns varies widely between 10 bases to 500,000 bases in a genome (for example, the human genome), but the length of exons has an upper limit of about 600 bases in most of the eukaryotic genes [REF]. Because exons code for protein sequences, they are very important for the cell, yet constitute only ~2% of the genes’ sequences. Introns, in contrast, constitute 98% of the genes’ sequences but seem to have little crucial functions in genes, except for functions such as containing enhancer sequences and developmental regulators in rare instances (3,4).

Until introns were discovered to interrupt genes in 1977 by Philip Sharp [REFs] from MIT and Richard Roberts [REFs] then at CSHL (currently at NEB), it was believed that genes contained its coding sequence in one stretch, bounded by a single Open Reading Frame (ORF) [FIGURE - contiguous coding gene - in the legend say one line - this type of genes are the norm in prokaryotic organisms]. The discovery that introns interrupted the eukaryotic genes was a profound surprise to scientists, which instantly brought up the questions of how, why and when did the introns come into being, leading to the split structure of genes. As more eukaryotic genes were sequenced, it became apparent that a typical gene was interrupted in many places by introns, dividing the coding sequence into many short exons. Also surprising was that the introns were very long, even as long as hundreds of thousands of bases (e.g., in the human genes NAME, NAME, NAME). These findings prompted the question of not only why introns came into the eukaryotic genes but also why many introns occur within a gene (up to 200 introns in human genes, for example, in genes NAME, NAME, NAME) and why they are very long, and why exons are very short [ACTUAL FIGURE OF SYN1 OR ANOTHER GENE FROM EXORF].

It was discovered that the spliceosome machinery that spliced together the exons and eliminated the introns from the primary RNA transcript was very large and complex with ~300 proteins and several SnRNA molecules [REF]. So, the questions also extended to the origin of the spliceosome. Soon after the discovery of introns, it became apparent that the junctions between exons and introns on either side exhibited specific sequences that signalled the spliceosome machinery to the exact base position for splicing. How and why did these splice junction signals came into being was another important question to be answered.

== Contrasting discussions ==
These questions prompted contrasting discussions in the literature almost immediately.

Were the introns introduced when eukaryotic genes evolved from more ancient prokaryotic intronless genes or were the eukaryotes more ancient to evolve along with introns (5-9)?

'''Although  he later retracted, Dr. F Doolittle’s thinking turned out to be correct that the original structure of the genes could be the split gene version of the gene. And James Darnell …'''

Apparently, none of these publications answered the questions of why and how introns and the the split structure of genes originated, what are splice junction sequences, why are exons short and introns long, and genomes are large.
<br />

== The Split-gene theory ==

=== '''The hypothesis''' ===
Around the same time introns were discovered, Dr. Senapathy was asking how genes themselves could have originated. He surmised that for any gene to come into being, there must have been genetic sequences (RNA or DNA) present in the prebiotic chemistry environment. A basic question he asked was how protein-coding sequences could have originated from primordial DNA sequences at the initial development of the very first cells.

To answer this, he made two basic assumptions: (i) before a self-replicating cell could come into existence, DNA molecules were synthesized in the primordial soup by random addition of the 4 nucleotides without the help of templates and (ii) the nucleotide sequences that code for proteins were selected from these preexisting DNA sequences in the primordial soup, and not by construction from shorter coding sequences. He also surmised that codons must have been established prior to the origin of the first genes. If primordial DNA did contain random nucleotide sequences, he asked: Was there an upper limit in the coding-sequence lengths, and, if so, did this limit play a crucial role in the formation of the structural features of genes at the very beginning of the origin of genes?

His logic was the following. The average length of proteins in living organisms, including the eukaryotic organisms and bacterial organisms, was ~400 amino acids. There also existed much longer proteins in both eukaryotic and bacterial organisms, up to 10,000 AAs and longer. However, the coding sequence existed in a single stretch of 1,200 bases to 30,000 bases long in bacterial genes, whereas the coding sequence of eukaryotes existed in short segments of exons of approx. 120 bases long regardless of the length of the protein. If the coding sequence lengths in random DNA sequences were as long as those from the contiguous genes of bacterial organisms, then contiguous coding genes were possible to have directly originated from random DNA. Although three stop codons out of the 64 codon set would lead to a very short average coding sequence length (defined as an ORF) of ~60 bases, the upper limit of ORFs could be very long to the tune of several thousands of bases in length, matching the lengths of contiguously coding genes in bacterial organisms. This was not known, as the distribution of the lengths of ORFs in a random DNA sequence was never studied before.

=== '''Testing the hypothesis''' ===
Dr. Senapathy analyzed the distribution of the ORF lengths in computer-generated random DNA sequences first. Surprisingly, this study revealed that there actually existed an upper limit of about 200 codons (600 bases) in the lengths of ORFs (FIGURE 1). The shortest ORF (zero) was the most frequent. At increasing lengths of ORFs, their frequency decreased logarithmically, reaching almost zero at about 600 bases. When the probability of ORF lengths in a random sequence was plotted (see FIGURE 2), it also revealed that the  probability of increasing lengths of ORFs decreased exponentially and tailed off at a maximum of about 600 bases. From this “negative exponential” distribution of ORF lengths, it was found that most of the ORFs are extremely shorter than even the upper maximum of 600 bases, being closer to the zero length.

This finding was surprising because the average protein length of 400 AAs (with ~1,200 bases of coding sequence) and longer proteins of thousands of AAs (requiring >10,000 bases of coding sequence) would not occur at a stretch in a random sequence. If this was true, a typical gene with a contiguous coding sequence could not originate in a random sequence. The only possible way that any gene coding for a protein longer than 200 AAs could originate from a random sequence was to split the coding sequence into shorter segments and select these segments from short ORFs available in the random sequence. This would lead to a split structure of the gene.

If this hypothesis was true, eukaryotic DNA sequences should show evidence for it. When Senapathy plotted the distribution of ORF lengths in eukaryotic DNA sequences, the plot was remarkably similar to that from random DNA sequence. It was also a negative exponential distribution that tailed off at a maximum of about 600 bases. This finding was amazing because the lengths of exons from eukaryotic genes had a maximum of about 600 bases [REF], which coincided exactly with the maximum length of ORFs observed in both random DNA sequence and in eukaryotic DNA sequence. These findings indicated that it was likely that split genes originated from random DNA sequences with exons and introns as described above. The Nobel Laureate [[Marshall Warren Nirenberg|Dr. Marshall Nirenberg]], who deciphered the codons, stated that these findings strongly showed that the split gene theory for the origin of introns and the split structure of genes must be valid, and communicated the paper to the PNAS.[[Shapiro - Senapathy Algorithm#cite%20note-%3A10-103|<sup>[103]</sup>]] New Scientist covered this publication in “A long explanation for introns”.[[Shapiro - Senapathy Algorithm#cite%20note-105|<sup>[105]</sup>]]

=== '''Origin of Splice junctions''' ===
The split gene theory thus suggested that genes with long coding sequences originated from random DNA sequences by choosing the best of the short coding segments (exons) and joining them by a process of splicing. The intervening intron sequences were left-over vestiges of the random sequences, and thus were earmarked to be removed by the spliceosome. This split-gene organization would require that a mechanism to recognize an ORF should have originated. As an ORF is defined by a contiguously coding sequence bounded by stop codons, these stop codon ends had to be recognized by this gene recognition system. This system could have defined the exons by the presence of a stop codon at the ends of ORFs. Thus, the introns should contain a stop codon at their ends, which would be part of the splice junction sequences.

If this hypothesis was true, the split genes of today’s living organisms should contain stop codons exactly at the ends of introns. When Senapathy tested this hypothesis in the splice junctions of eukaryotic genes, it was astonishing that almost all splice junctions did contain a stop codon at the ends of introns, right outside of the exons. In fact, these stop codons were found to form the “canonical” AG:GT splicing sequence, with the three stop codons occurring as part of the strong consensus signals. Thus, the basic split gene theory led to the hypothesis that the splice junctions originated from the stop codons.[[Shapiro - Senapathy Algorithm#cite%20note-%3A11-104|<sup>[104]</sup>]]

Surprisingly, all three stop codons (TGA, TAA and TAG) were found after one base (G) at the start of introns. These stop codons are shown in the consensus canonical donor splice junction as AG:GT(A/G)GGT, wherein the TAA and TGA are the stop codons, and the additional TAG is also present at this position. Besides the codon CAG, only TAG, which is a stop codon, was found at the ends of introns. The canonical acceptor splice junction is shown as (C/T)AG:GT, in which TAG is the stop codon. These consensus sequence clearly show the presence of the stop codons at the ends of introns bordering the exons in all eukaryotic genes.  [[Marshall Warren Nirenberg|Dr. Marshall Nirenberg]] again stated that these observations fully supported the split gene theory for the origin of splice junction sequences from stop codons, who was the referee for this paper.[[Shapiro - Senapathy Algorithm#cite%20note-%3A11-104|<sup>[104]</sup>]] New Scientist covered this publication in “Exons, Introns and Evolution”.[[Shapiro - Senapathy Algorithm#cite%20note-106|<sup>[106]</sup>]]

=== '''Origin of Spliceosome''' ===
Senapathy proposes that the spliceosome originated at the same time as the split genes originated from random DNA sequences. His concept is that the genes for the spliceosomal proteins also originated from the random sequences.

The chicken or the egg - all of these genes existed in random sequences. The first transcription and translation of these genes happened by the enzymatic activities that occurred in prebiotic chemistry in random polypeptides, RNA and ribonucleic acid - polypeptide complexes. There is much evidence.


from Wikipedia - New pages [en] http://bit.ly/2T8dIRj
via IFTTT

No comments:

Post a Comment