Seqanswers Leaderboard Ad

**Brian Bushnell** · 03-07-2014, 09:59 PM

Long stretches of Ns can indicate things like the chromosome's centromere or telomeres, or other highly repetitive content that is impossible to assemble or even count accurately (like ALU's). The exact number is not necessarily important, but may be a best estimate based on various data like microscopy and linkage disequilibrium; I'm not sure how they decide for the huge ones. In our denovo assemblies with long mate pair data (usually 8kbp) sometimes scaffolds get a fairly accurate number of N's in a gap based on the average insert size, but the bases themselves are either too repetitive or too low coverage to assemble. The assembler will insert some specific number of Ns, rather than rounding to 1 sig fig, even though it may only be accurate to 1 sig fig. Human was done with different technology, though.

I assume single Ns represent bases that were unable to be called in sequencing, as usually SNP locations are called in human reference regardless of SNP frequencies, even if 99% of the population has the non-ref allele. NCBI does not allow degenerate IUPAC symbols, though, only Ns. So if you know a site is either A or G, you can't submit an assembly with the symbol 'R' there, it has to be 'N'.

**majulier** · 03-07-2014, 10:20 PM

Thanks Brian. Sounds like if I want to align samples against these sections I either need to say "matches anything" or push an ACGT into that "hole".

I've seen that there's actually a fairly smooth spectrum in terms of size of N regions. Any thoughts on how many N's before "matches anything" turns into "we don't know"? Given what I assume has been a massive investment in getting to hg19 it seems odd that NCBI wouldn't want to distinguish between what is "known" and "not known".

**Brian Bushnell** · 03-07-2014, 10:51 PM

Originally posted by majulier View Post

Thanks Brian. Sounds like if I want to align samples against these sections I either need to say "matches anything" or push an ACGT into that "hole".

Well... you can't align to the n-only regions, of course, but a single N won't cause a problem. It depends on the aligner, but BBMap for example penalizes an alignment where the ref contains an N less than one where the ref is A,C,G,T and it does not match the read. So, leaving Ns as Ns is usually best. And sequences that map to the edge of a large poly-N region can extend into the Ns somewhat. Ns are more like "matches nothing" than "matches anything"; aligners won't place a sequence entirely on Ns.

Any thoughts on how many N's before "matches anything" turns into "we don't know"?

I'm not quite sure what you're asking there...?

Given what I assume has been a massive investment in getting to hg19 it seems odd that NCBI wouldn't want to distinguish between what is "known" and "not known".

NCBI has some strange rules. I assume it is to make the files easier to process with some of the software they use, that perhaps doesn't understand symbols other than ACGTN.

**majulier** · 03-08-2014, 09:27 AM

Brian, thanks again. I'm trying to write an aligner from scratch. (Because the world really needs yet another one.) I work for a "CPU" company and am interested in understanding what the processing requirements are for doing genome sequencing such that we could help improve the state of the art. Given that there is an explosion (coming) in the amount of sequencing that can be done it seems like a good idea to try to improve on the downstream ecosystem (aligning, variant calling...) as well. To be clear, I'm just trying to improve the processing efficiency.
From a computing perspective, the N's in HG19 and the samples are... problematic. I know quite clearly what to do with a region of 31M N's in hg19. I now have a better idea what to do with 1-10 N's. But, for now, I'm assuming I'm looking at paired end reads where the ends are ~500bp apart and I've got ~100bp of nucleotides from each end. So, if I've got an end that maps perfectly to only one spot but the paired end "should" (based on the 500bp parameter) sit in what is a region of Ns in hg19, should I go ahead and map it there? Or, should I go find a completely different alignment that fits best based on just that end? (I don't really like the latter option as it "throws away" the information that this read is supposed to be +/- 500bp from the other read.)

**Brian Bushnell** · 03-08-2014, 11:08 PM

Originally posted by majulier View Post

Brian, thanks again. I'm trying to write an aligner from scratch. (Because the world really needs yet another one.) I work for a "CPU" company and am interested in understanding what the processing requirements are for doing genome sequencing such that we could help improve the state of the art. Given that there is an explosion (coming) in the amount of sequencing that can be done it seems like a good idea to try to improve on the downstream ecosystem (aligning, variant calling...) as well. To be clear, I'm just trying to improve the processing efficiency.
From a computing perspective, the N's in HG19 and the samples are... problematic. I know quite clearly what to do with a region of 31M N's in hg19. I now have a better idea what to do with 1-10 N's. But, for now, I'm assuming I'm looking at paired end reads where the ends are ~500bp apart and I've got ~100bp of nucleotides from each end. So, if I've got an end that maps perfectly to only one spot but the paired end "should" (based on the 500bp parameter) sit in what is a region of Ns in hg19, should I go ahead and map it there? Or, should I go find a completely different alignment that fits best based on just that end? (I don't really like the latter option as it "throws away" the information that this read is supposed to be +/- 500bp from the other read.)

majulier,

At UTSW with Solid4 machines, at least half of our reads had 3+ errors; at the time, Bioscope couldn't correctly interpret more than 2 errors, which inspired me to write a new aligner. And until aligning is perfect and perfectly efficient, there's always room for another new aligner!

Anyway, in regards to your question - I won't claim that this is ideal, but BBMap decides the optimal alignment of a read by (100%*score)+(25%*(mate score)), adjusted by mate-pair distance - such that the full +25% of mate score is added if the mate is at average distance, decreasing as the distance diverges from average. In a normal fragment library, one read must be on the plus strand, the other must be on the minus strand, and the plus read must be to the left of the minus read. Otherwise the mapping is considered unpaired and no bonus is added. Nothing ever maps to Ns, so if one read maps perfectly and its mate "should" land on Ns, it is considered unpaired. HG19 is so complete that this is fine virtually all of the time. You could develop heuristics that handle the cases where one read aligns and its mate should land on nearby Ns, but HG19 has so few gaps, that it's probably waste of time. And I'm not an authority on this, but my impression is that areas adjacent to large N-filled gaps are generally unimportant.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 37 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 41 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 35 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 54 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Ns in hg19 have double meaning?

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News