Statistical tool finds ‘gaps’ in DNA datasets that shouldn’t be ignored

Statistical tool finds 'gaps' in DNA datasets that shouldn't be ignored

Credit: CC0 Public Domain

A simple statistical test shows that, contrary to current practice, “gaps” within protein and DNA sequence alignments commonly used in evolutionary biology can provide important information about nucleotide and amino acid substitutions over the time The finding could be particularly relevant to those studying distant species. The work appears in Proceedings of the National Academy of Sciences.

Biologists who study evolution do so by observing how DNA and protein sequences change over time. These changes can be sequence length changes (when specific nucleotides are removed or added at certain positions) or substitutions, where one type of nucleotide is exchanged for a different type at a given point.

“Think of the DNA sequence and its evolution as a sentence copied by different people over time,” says Jeff Thorne, professor of biological sciences and statistics at NC State and co-author of the research. “Over time, a letter in a word will change, this is a substitution. Omitting or adding letters or words corresponds to deletions or insertions.”

The first step that analysts usually take when looking at evolutionary changes in DNA is to construct a sequence alignment. This means figuring out how all the sequences correspond to each other and then lining up the corresponding positions in columns to compare them. However, due to substitutions, insertions, and deletions, the types of nucleotides within columns may vary between sequences or be absent altogether. When a sequence does not have a corresponding nucleotide, a gap is placed in the alignment column for that sequence.

“Conventionally, when using sequence alignments for analysis, gaps within alignment columns are treated as missing data that do not provide information about substitutions,” says Thorne. “Historically, the research community has assumed that gap locations are independent of the substitution process. But what if that assumption is wrong?”

Thorne and his colleagues created a simple statistical test to assess whether the locations of the gaps are independent of the amino acid substitution process. They tested 1,390 different sets of sequence alignments and found that in about two-thirds of the sets, the usual assumption of independence between gap locations and amino acid substitution was rejected.

“One possibility is that the locations of the gaps provide useful information about the process of amino acid substitution,” says Thorne. “If so, evolutionary biologists should develop better techniques to extract this information.”

The research also illustrated how the usual approach of constructing a sequence alignment and then basing evolutionary conclusions on this single optimal alignment can be problematic. What if the alignment is wrong? Even worse, what if the alignment is skewed?

For example, if substitutions occur more frequently than gaps, researchers tend to repeatedly choose substitutions over gaps when constructing the sequence alignment, and the resulting alignment may contain too few gaps overall. And while these small errors in alignments between closely related species probably won’t affect the results, over time, and especially in comparisons between different species, this bias can create errors that could affect subsequent analyses.

“Sometimes our best guesses are biased,” says Tae-Kun Seo, a senior researcher at the Korea Polar Research Institute and co-author of the research. “There is no simple solution, but we hope this study will help us be aware of potential pitfalls. We need to be aware of the problems with conventional statistical methods and work to fix them.”

Ben Redelings, a research scientist at Duke University and the University of Kansas, also contributed to the work.

Does order matter in protein sequence alignment?

More information:
Correlations between alignment gaps and nucleotide substitution or amino acid substitution”. Proceedings of the National Academy of Sciences (2022). DOI: 10.1073/pnas.2204435119

Provided by North Carolina State University

Summons: Statistical tool finds ‘gaps’ in DNA datasets that shouldn’t be ignored (2022, August 16) Retrieved August 17, 2022, from /2022-08-statistical-tool-gaps-dna-shouldnt .html

This document is subject to copyright. Other than any fair dealing for private study or research purposes, no part may be reproduced without written permission. Content is provided for informational purposes only.

#Statistical #tool #finds #gaps #DNA #datasets #shouldnt

Leave a Comment

Your email address will not be published.