A team of researchers from Germany, USA, and Russia, including Dr. Mark Borodovsky, a Chair of the Department of Bioinformatics at MIPT, have proposed an algorithm to automate the process toward searching for genes, making it more efficient. The new improvement combines the advantages of the most advanced tools for working with genomic information.
The new technique will enable researchers to examine DNA sequences faster and all the more accurately and recognize the full set of genes in a genome. Despite the fact that in the journal Bioinformatics, the paper describing the algorithm recently showed up, which is published by Oxford Journals, the proposed technique has effectively proven to be very popular. The computer software program has been downloaded by more than 1500 different centers and laboratories centers around the world. Than other comparative algorithms, tests of the algorithm have demonstrated that it is considerably more accurate.
The advancement belongs to the field of bioinformatics, a cross-disciplinary field of science. Bioinformatics combines mathematics, statistics and software science to study about biological particles, for example, DNA, RNA, and protein structures.
Bioinformatics is a very topical subject; each newly sequenced genome brings up such a large number of many additional questions that researchers simply don’t have sufficient time to answer them all. Specialists’ time and the specialists themselves are extremely valuable. This is the reason for any bioinformatics project, automating processes is key to the success and for solving a wide variety of problems these algorithms are essential.
Amongst the most important areas of bioinformatics is annotating on genomes determining out which particular DNA atoms are utilized to combine RNA and proteins. These parts genes are of great scientific interest. The fact is that in many examinations researchers need not require with information about the entire DNA (which is around 2 meters long for a single human cell), but about its most informative part genes. Gene sections are recognized by searching down similarities between sequence fragments and known genes, or by detecting consistent patterns of the nucleotide sequence.
This process is completed using predictive algorithms. Locating gene sections is not a simple task, particularly in eukaryotic living organisms, which includes almost all widely known types of organism, aside from bacteria. This is because of the fact that in these cells, the transfer of genetic information is complicated by “gaps” in the coding regions (introns) and because there are no definite indicators to determine if a region is a coding region or not.
The algorithm proposed by the researchers determines out which regions in the DNA are genes and which are most certainly not. A Markov chain (a sequence of random events, the future of which is dependent on past events) studied in known genes can be utilized for this. The states of the chain for this situation are either nucleotides or nucleotide words (k-mers).
Classifying the genomic fragments in the best possible way as per their ability to encode proteins or RNA. Test information obtained from RNA give additional useful data which can be utilized to train the model utilized as a part of the algorithm. Certain gene prediction programs can utilize this information to improve the accuracy of finding genes. In any case, these algorithms require a training set including involving type-specific training of the model. For the AUGUSTUS software program, for instance, which has a high state of accuracy, a training set of genes is required. This set can be obtained utilizing another program GeneMark-ET, which is a self- training algorithm. By the developers of AUGUSTUS and GeneMark-ET, in the BRAKER1 algorithm, these two algorithms were combined which was proposed jointly.
BRAKER1 has shown a high level of efficiency. The developed program has just been downloaded by more than 1500 different centers and labs. Tests of the algorithm have demonstrated that it is considerably more accurate than other similar algorithms.
The example running time of BRAKER1 on a single processor is ∼17.5 hours for training and the prediction of genes in a genome with a length of 120 megabases. This is a good outcome, bearing in mind that this time might be significantly reduced by using parallel processors, and this implies later on the algorithm might have the ability to function faster and for the most part more efficiently.
Tools, for example, these help to solve a wide range of problems. Accurately annotating on genes in a genome is important, a case of this is the worldwide 1000 Genomes Project, the initial consequences of which have just been published. The project was launched in 2008 involving scientists from 75 different laboratories and organizations. Thus, sequences of uncommon gene variations and gene substitutions were found, some of which can cause disease.
When diagnosing genetic diseases, it is very important to know which substitutions in gene sections make the disease to develop. Under the project, genomes of various individuals are mapped, particularly their coding sections, and rare nucleotide substitutions are recognized. Later on, this will help specialists to analyze complex diseases, for example, diabetes, heart disease, and cancer. To work effectively with the genomes of new organisms, BRAKER1 enables researchers, speeding the way toward annotating genomes and acquiring essential knowledge about life sciences.