SEQanswers

Go Back   SEQanswers > Introductions



Similar Threads
Thread Thread Starter Forum Replies Last Post
Twin peaks, double peaks, double humps of RNA Seq, itís all very frustrating...HELP! AndyG Sample Prep / Library Generation 20 12-22-2019 09:49 AM
RNA-seq: rRNA Depletion, Double riboZero or Double RiboMinus cvhove Sample Prep / Library Generation 5 08-22-2014 04:18 AM
Double reads for one gene jimineep Illumina/Solexa 16 10-28-2011 07:46 AM
A double RNAseq JuanLugo General 0 10-16-2011 05:08 PM
What is the meaning of quality value? lowfat SOLiD 4 10-26-2009 12:20 PM

Reply
 
Thread Tools
Old 03-07-2014, 08:31 PM   #1
majulier
Junior Member
 
Location: Portland, Oregon, U.S.A.

Join Date: Mar 2014
Posts: 3
Question Ns in hg19 have double meaning?

my first post here. I've attempted to google this first but couldn't get anywhere with that and am hoping someone will be kind enough to help me so that I don't have to read a dozen papers to maybe find the answer. Thanks in advance.

in hg19 there are many instances of several thousand Ns. I'm not asking about those. There also happen to be several instances of a single N sitting in the middle of several thousand A|C|G|T s. For those instances, does "N" mean "highly variable" (as in each letter has a 25% chance of being there depending on the specific individual)? Or is it literally the case that there is one single "unknown" nucleotide sitting in a sea of thousands of "known" nucleotides. As a lay person this later case makes no sense to me. If it _is_ the case, could someone please explain?

Also, most larger regions of Ns seem to have 2-3 significant digits. But there is one chromosome where the N regions are indicated with 7-8 significant digits. Is this even possible? Science has no idea what the actual nucleotides are, but knows that there are _exactly_ 7,263,972 of them?

If there are good papers that deal with either of these issues I'd be happy to read those instead of having you type in a long explanation... I just can't find any existing discussion of these two issues.

Thank you!!!!!
majulier is offline   Reply With Quote
Old 03-07-2014, 08:59 PM   #2
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Long stretches of Ns can indicate things like the chromosome's centromere or telomeres, or other highly repetitive content that is impossible to assemble or even count accurately (like ALU's). The exact number is not necessarily important, but may be a best estimate based on various data like microscopy and linkage disequilibrium; I'm not sure how they decide for the huge ones. In our denovo assemblies with long mate pair data (usually 8kbp) sometimes scaffolds get a fairly accurate number of N's in a gap based on the average insert size, but the bases themselves are either too repetitive or too low coverage to assemble. The assembler will insert some specific number of Ns, rather than rounding to 1 sig fig, even though it may only be accurate to 1 sig fig. Human was done with different technology, though.

I assume single Ns represent bases that were unable to be called in sequencing, as usually SNP locations are called in human reference regardless of SNP frequencies, even if 99% of the population has the non-ref allele. NCBI does not allow degenerate IUPAC symbols, though, only Ns. So if you know a site is either A or G, you can't submit an assembly with the symbol 'R' there, it has to be 'N'.
Brian Bushnell is offline   Reply With Quote
Old 03-07-2014, 09:20 PM   #3
majulier
Junior Member
 
Location: Portland, Oregon, U.S.A.

Join Date: Mar 2014
Posts: 3
Default

Thanks Brian. Sounds like if I want to align samples against these sections I either need to say "matches anything" or push an ACGT into that "hole".

I've seen that there's actually a fairly smooth spectrum in terms of size of N regions. Any thoughts on how many N's before "matches anything" turns into "we don't know"? Given what I assume has been a massive investment in getting to hg19 it seems odd that NCBI wouldn't want to distinguish between what is "known" and "not known".
majulier is offline   Reply With Quote
Old 03-07-2014, 09:51 PM   #4
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by majulier View Post
Thanks Brian. Sounds like if I want to align samples against these sections I either need to say "matches anything" or push an ACGT into that "hole".
Well... you can't align to the n-only regions, of course, but a single N won't cause a problem. It depends on the aligner, but BBMap for example penalizes an alignment where the ref contains an N less than one where the ref is A,C,G,T and it does not match the read. So, leaving Ns as Ns is usually best. And sequences that map to the edge of a large poly-N region can extend into the Ns somewhat. Ns are more like "matches nothing" than "matches anything"; aligners won't place a sequence entirely on Ns.
Quote:
Any thoughts on how many N's before "matches anything" turns into "we don't know"?
I'm not quite sure what you're asking there...?
Quote:
Given what I assume has been a massive investment in getting to hg19 it seems odd that NCBI wouldn't want to distinguish between what is "known" and "not known".
NCBI has some strange rules. I assume it is to make the files easier to process with some of the software they use, that perhaps doesn't understand symbols other than ACGTN.
Brian Bushnell is offline   Reply With Quote
Old 03-08-2014, 08:27 AM   #5
majulier
Junior Member
 
Location: Portland, Oregon, U.S.A.

Join Date: Mar 2014
Posts: 3
Default

Brian, thanks again. I'm trying to write an aligner from scratch. (Because the world really needs yet another one.) I work for a "CPU" company and am interested in understanding what the processing requirements are for doing genome sequencing such that we could help improve the state of the art. Given that there is an explosion (coming) in the amount of sequencing that can be done it seems like a good idea to try to improve on the downstream ecosystem (aligning, variant calling...) as well. To be clear, I'm just trying to improve the processing efficiency.
From a computing perspective, the N's in HG19 and the samples are... problematic. I know quite clearly what to do with a region of 31M N's in hg19. I now have a better idea what to do with 1-10 N's. But, for now, I'm assuming I'm looking at paired end reads where the ends are ~500bp apart and I've got ~100bp of nucleotides from each end. So, if I've got an end that maps perfectly to only one spot but the paired end "should" (based on the 500bp parameter) sit in what is a region of Ns in hg19, should I go ahead and map it there? Or, should I go find a completely different alignment that fits best based on just that end? (I don't really like the latter option as it "throws away" the information that this read is supposed to be +/- 500bp from the other read.)
majulier is offline   Reply With Quote
Old 03-08-2014, 10:08 PM   #6
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by majulier View Post
Brian, thanks again. I'm trying to write an aligner from scratch. (Because the world really needs yet another one.) I work for a "CPU" company and am interested in understanding what the processing requirements are for doing genome sequencing such that we could help improve the state of the art. Given that there is an explosion (coming) in the amount of sequencing that can be done it seems like a good idea to try to improve on the downstream ecosystem (aligning, variant calling...) as well. To be clear, I'm just trying to improve the processing efficiency.
From a computing perspective, the N's in HG19 and the samples are... problematic. I know quite clearly what to do with a region of 31M N's in hg19. I now have a better idea what to do with 1-10 N's. But, for now, I'm assuming I'm looking at paired end reads where the ends are ~500bp apart and I've got ~100bp of nucleotides from each end. So, if I've got an end that maps perfectly to only one spot but the paired end "should" (based on the 500bp parameter) sit in what is a region of Ns in hg19, should I go ahead and map it there? Or, should I go find a completely different alignment that fits best based on just that end? (I don't really like the latter option as it "throws away" the information that this read is supposed to be +/- 500bp from the other read.)
majulier,

At UTSW with Solid4 machines, at least half of our reads had 3+ errors; at the time, Bioscope couldn't correctly interpret more than 2 errors, which inspired me to write a new aligner. And until aligning is perfect and perfectly efficient, there's always room for another new aligner!

Anyway, in regards to your question - I won't claim that this is ideal, but BBMap decides the optimal alignment of a read by (100%*score)+(25%*(mate score)), adjusted by mate-pair distance - such that the full +25% of mate score is added if the mate is at average distance, decreasing as the distance diverges from average. In a normal fragment library, one read must be on the plus strand, the other must be on the minus strand, and the plus read must be to the left of the minus read. Otherwise the mapping is considered unpaired and no bonus is added. Nothing ever maps to Ns, so if one read maps perfectly and its mate "should" land on Ns, it is considered unpaired. HG19 is so complete that this is fine virtually all of the time. You could develop heuristics that handle the cases where one read aligns and its mate should land on nearby Ns, but HG19 has so few gaps, that it's probably waste of time. And I'm not an authority on this, but my impression is that areas adjacent to large N-filled gaps are generally unimportant.
Brian Bushnell is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 02:36 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO