Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Sawtooth base frequency, wavy insert size histograms.

    I am analyzing some NextSeq data and see odd patterns in the insert size and base composition histograms, that I can't explain. The library is of a bacteria (M.ruber) and fragmented with sonication to a target 270bp insert size. The run was 2x151bp.

    The base composition graph concatenates read 1 and read 2, so position 0-150 are read 1 and 151-302 are read 2. Each read has a sawtooth pattern for all bases, with a period of exactly 3bp.



    There's obviously a major problem with base-calling as the A/T ratio is quite skewed, but putting that aside for now, has anyone seen the sawtooth pattern before? I saw it once on some MiSeq Nextera data also, and could not explain it then, either. A second run on the NextSeq (on a fungus) does NOT have the sawtooth pattern, but still has the distorted A/T ratio. Bacteria are mostly coding and the fungus is mostly noncoding, so I'm speculating that it could be a real artifact related to codon frequencies and nonrandom fragmentation sites rather than a software bug, but I'm not sure.

    Next, the insert size distribution also has a regular patter, this one with a 10bp period.



    This pattern exists when the insert size is calculated using two independent methods, by mapping and by overlap (overlap is of course restricted to under 300bp). So I am confident that it's actually in the data and not a software problem; and furthermore, it's present in genomic reads, or else it would not show up on the mapping histogram. Has anyone seen that before?
    Attached Files

  • #2
    I wonder what is read duplication rate and the number of reads.

    Comment


    • #3
      The duplication rate appears very low (considering it's only a ~3Mbp organism). Here's a plot of read uniqueness for the first 10m read pairs (out of 124m total pairs):



      The way to interpret this... each read is examined for its first 31-mer and a random 31-mer. These are added to a hashtable. If they were already present, the read is considered non-unique; otherwise, it is considered unique. Errors will inflate the apparent uniqueness. The cumulative ratio of unique vs non-unique reads is reported every 25k reads. The more nonuniform the library, the faster the value drops. There are multiple lines because I track "first" and "random" separately, and I also track read 1 and read 2 both separately and combined.

      The wavyness here is probably due to some problem with the optics, correlating with individual image frames.
      Attached Files

      Comment


      • #4
        I would suggest first to check for sequencer faults which person running the machine should be able to do it. If that is ruled out as a possible cause, I would look next to the library prep and its diversity. The wavyness in base frequency looks similar to what I have seen with low diversity mate pair libraries where a library with below 10M unique fragments have been sequenced in 100sM (though the frequency was larger than 3) and also low diversity amplicon libraries. Out of curiosity, how the duplication rate could be low. In a 3 Mb genome there is only possibility of obtaining 3M unique fragments (at least in this case in initial 100 bp). If this library is sequenced to a depth of 124M reads there would be high level of duplication.

        Comment


        • #5
          Originally posted by nucacidhunter View Post
          Out of curiosity, how the duplication rate could be low. In a 3 Mb genome there is only possibility of obtaining 3M unique fragments (at least in this case in initial 100 bp). If this library is sequenced to a depth of 124M reads there would be high level of duplication.
          So, this is a 2x151bp library; as expected, after 10M read pairs, the number of read1 with a unique first 31-mer drops to around 35%. This is consistent with a high uniqueness - if every starting location on the genome was used, you could only get up to around 31% uniqueness (it's actually about 3.09 Mbp). The fact that some reads have errors pushes it higher to 35% but it's still good.

          But there's also pair uniqueness, for which I use a hash of the middle 31-mer in read 1 and read 2. This represents the fraction of read pairs with a unique start+stop combination, and thus is a much better measure of library duplication rate. By that metric, of the first 10 million read pairs, 99% of them are unique, which indicates the library has a very low duplication rate. Though certainly if I extended the graph all the way to 124 million pairs I would expect that to drop a bit.

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Techniques and Challenges in Conservation Genomics
            by seqadmin



            The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

            Avian Conservation
            Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
            03-08-2024, 10:41 AM
          • seqadmin
            The Impact of AI in Genomic Medicine
            by seqadmin



            Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
            02-26-2024, 02:07 PM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, 03-14-2024, 06:13 AM
          0 responses
          34 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 03-08-2024, 08:03 AM
          0 responses
          72 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 03-07-2024, 08:13 AM
          0 responses
          81 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 03-06-2024, 09:51 AM
          0 responses
          68 views
          0 likes
          Last Post seqadmin  
          Working...
          X