Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Ns in hg19 have double meaning?

    my first post here. I've attempted to google this first but couldn't get anywhere with that and am hoping someone will be kind enough to help me so that I don't have to read a dozen papers to maybe find the answer. Thanks in advance.

    in hg19 there are many instances of several thousand Ns. I'm not asking about those. There also happen to be several instances of a single N sitting in the middle of several thousand A|C|G|T s. For those instances, does "N" mean "highly variable" (as in each letter has a 25% chance of being there depending on the specific individual)? Or is it literally the case that there is one single "unknown" nucleotide sitting in a sea of thousands of "known" nucleotides. As a lay person this later case makes no sense to me. If it _is_ the case, could someone please explain?

    Also, most larger regions of Ns seem to have 2-3 significant digits. But there is one chromosome where the N regions are indicated with 7-8 significant digits. Is this even possible? Science has no idea what the actual nucleotides are, but knows that there are _exactly_ 7,263,972 of them?

    If there are good papers that deal with either of these issues I'd be happy to read those instead of having you type in a long explanation... I just can't find any existing discussion of these two issues.

    Thank you!!!!!

  • #2
    Long stretches of Ns can indicate things like the chromosome's centromere or telomeres, or other highly repetitive content that is impossible to assemble or even count accurately (like ALU's). The exact number is not necessarily important, but may be a best estimate based on various data like microscopy and linkage disequilibrium; I'm not sure how they decide for the huge ones. In our denovo assemblies with long mate pair data (usually 8kbp) sometimes scaffolds get a fairly accurate number of N's in a gap based on the average insert size, but the bases themselves are either too repetitive or too low coverage to assemble. The assembler will insert some specific number of Ns, rather than rounding to 1 sig fig, even though it may only be accurate to 1 sig fig. Human was done with different technology, though.

    I assume single Ns represent bases that were unable to be called in sequencing, as usually SNP locations are called in human reference regardless of SNP frequencies, even if 99% of the population has the non-ref allele. NCBI does not allow degenerate IUPAC symbols, though, only Ns. So if you know a site is either A or G, you can't submit an assembly with the symbol 'R' there, it has to be 'N'.

    Comment


    • #3
      Thanks Brian. Sounds like if I want to align samples against these sections I either need to say "matches anything" or push an ACGT into that "hole".

      I've seen that there's actually a fairly smooth spectrum in terms of size of N regions. Any thoughts on how many N's before "matches anything" turns into "we don't know"? Given what I assume has been a massive investment in getting to hg19 it seems odd that NCBI wouldn't want to distinguish between what is "known" and "not known".

      Comment


      • #4
        Originally posted by majulier View Post
        Thanks Brian. Sounds like if I want to align samples against these sections I either need to say "matches anything" or push an ACGT into that "hole".
        Well... you can't align to the n-only regions, of course, but a single N won't cause a problem. It depends on the aligner, but BBMap for example penalizes an alignment where the ref contains an N less than one where the ref is A,C,G,T and it does not match the read. So, leaving Ns as Ns is usually best. And sequences that map to the edge of a large poly-N region can extend into the Ns somewhat. Ns are more like "matches nothing" than "matches anything"; aligners won't place a sequence entirely on Ns.
        Any thoughts on how many N's before "matches anything" turns into "we don't know"?
        I'm not quite sure what you're asking there...?
        Given what I assume has been a massive investment in getting to hg19 it seems odd that NCBI wouldn't want to distinguish between what is "known" and "not known".
        NCBI has some strange rules. I assume it is to make the files easier to process with some of the software they use, that perhaps doesn't understand symbols other than ACGTN.

        Comment


        • #5
          Brian, thanks again. I'm trying to write an aligner from scratch. (Because the world really needs yet another one.) I work for a "CPU" company and am interested in understanding what the processing requirements are for doing genome sequencing such that we could help improve the state of the art. Given that there is an explosion (coming) in the amount of sequencing that can be done it seems like a good idea to try to improve on the downstream ecosystem (aligning, variant calling...) as well. To be clear, I'm just trying to improve the processing efficiency.
          From a computing perspective, the N's in HG19 and the samples are... problematic. I know quite clearly what to do with a region of 31M N's in hg19. I now have a better idea what to do with 1-10 N's. But, for now, I'm assuming I'm looking at paired end reads where the ends are ~500bp apart and I've got ~100bp of nucleotides from each end. So, if I've got an end that maps perfectly to only one spot but the paired end "should" (based on the 500bp parameter) sit in what is a region of Ns in hg19, should I go ahead and map it there? Or, should I go find a completely different alignment that fits best based on just that end? (I don't really like the latter option as it "throws away" the information that this read is supposed to be +/- 500bp from the other read.)

          Comment


          • #6
            Originally posted by majulier View Post
            Brian, thanks again. I'm trying to write an aligner from scratch. (Because the world really needs yet another one.) I work for a "CPU" company and am interested in understanding what the processing requirements are for doing genome sequencing such that we could help improve the state of the art. Given that there is an explosion (coming) in the amount of sequencing that can be done it seems like a good idea to try to improve on the downstream ecosystem (aligning, variant calling...) as well. To be clear, I'm just trying to improve the processing efficiency.
            From a computing perspective, the N's in HG19 and the samples are... problematic. I know quite clearly what to do with a region of 31M N's in hg19. I now have a better idea what to do with 1-10 N's. But, for now, I'm assuming I'm looking at paired end reads where the ends are ~500bp apart and I've got ~100bp of nucleotides from each end. So, if I've got an end that maps perfectly to only one spot but the paired end "should" (based on the 500bp parameter) sit in what is a region of Ns in hg19, should I go ahead and map it there? Or, should I go find a completely different alignment that fits best based on just that end? (I don't really like the latter option as it "throws away" the information that this read is supposed to be +/- 500bp from the other read.)
            majulier,

            At UTSW with Solid4 machines, at least half of our reads had 3+ errors; at the time, Bioscope couldn't correctly interpret more than 2 errors, which inspired me to write a new aligner. And until aligning is perfect and perfectly efficient, there's always room for another new aligner!

            Anyway, in regards to your question - I won't claim that this is ideal, but BBMap decides the optimal alignment of a read by (100%*score)+(25%*(mate score)), adjusted by mate-pair distance - such that the full +25% of mate score is added if the mate is at average distance, decreasing as the distance diverges from average. In a normal fragment library, one read must be on the plus strand, the other must be on the minus strand, and the plus read must be to the left of the minus read. Otherwise the mapping is considered unpaired and no bonus is added. Nothing ever maps to Ns, so if one read maps perfectly and its mate "should" land on Ns, it is considered unpaired. HG19 is so complete that this is fine virtually all of the time. You could develop heuristics that handle the cases where one read aligns and its mate should land on nearby Ns, but HG19 has so few gaps, that it's probably waste of time. And I'm not an authority on this, but my impression is that areas adjacent to large N-filled gaps are generally unimportant.

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Essential Discoveries and Tools in Epitranscriptomics
              by seqadmin


              The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
              Yesterday, 07:01 AM
            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 04-11-2024, 12:08 PM
            0 responses
            37 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 10:19 PM
            0 responses
            41 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 09:21 AM
            0 responses
            35 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-04-2024, 09:00 AM
            0 responses
            54 views
            0 likes
            Last Post seqadmin  
            Working...
            X