Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Read stacks in RNA-seq

    Hi all,

    In RNA-seq, after mapping (BOWTIE, unique hits only, max 2 mismatches), and annotating, we discovered that our reads generally fall in tall "stacks", as illustrated below:



    Does anyone have a explanation for this? Could it be the sequencing step (or PCR bias), or could it be a artefact of the bioinformatic analysis?

    Thanks in advance,

    JW, Uni of Copenhagen

  • #2
    Maybe I haven't had enough coffee yet, but it looks like they are mapping to exons, which is what you'd expect?

    Comment


    • #3
      Yes, but wouldn´t you expect an even distribution along exons instead of reads mapping in "towers"?

      Comment


      • #4
        Perhaps it has to do with the 200bp strands on the solexa flowcell, of which, only the ends get sequenced. So for exons in that size range, you will expect peaks near the exon ends?
        --
        bioinfosm

        Comment


        • #5
          in our sequencing reads,we also meet such problem, but not high proportion reads in stacks, I guess these are technique problems

          Comment


          • #6
            I am also seeing this in rna-seq data sets. Is this due to the PCR amplification step?

            Comment


            • #7
              Originally posted by bioinfosm View Post
              Perhaps it has to do with the 200bp strands on the solexa flowcell, of which, only the ends get sequenced. So for exons in that size range, you will expect peaks near the exon ends?
              I don't agree, the RNA-seq protocol fragments full length mRNA so you wouldn't expect to see peaks at either ends of the exon.

              It may be due to PCR bias, do you see mutiple copies of reads with the same start and stop co-ordinates? You'd have to give more info about your sample prep though e.g. input amount, number of PCR enrichment cycles used etc.

              Comment


              • #8
                I am looking at a public Illumina dataset. I didn't generate the data, but the methods say they do 20 cycles of PCR on the reverse-transcribed, size-selected, ligation products before doing the actual sequencing reactions.

                There seem to be too many duplicate reads (same start, same end). The reads also "clump" in that there are many reads with start sites that are just a few bases different all together with many repetitions, followed by exonic regions with far less coverage. It seems to happen too often to be just misannotation of the gene model. So is it possible there are also "favored" start sites within a stretch of nucleotides? Or are these duplicates that somehow "lost" a base or two, (either physically or informatically)?

                Here is an example. The height of the curve is coverage in number of reads.
                Attached Files

                Comment


                • #9
                  PCR bias found in the sequencing library can be attributed in part to the GC content of the sample see Kozarewa et al Nat Methods. 2009 Apr;6(4):291-5 for more info. I think Illumina are developing PCR-free library prep protocols but it might be worth checking the GC content of the regions with lower coverage, if lower coverage correlates with AT rich regions it might be part of the answer you're looking for.
                  Have you contacted the owners of the dataset for their opinion, probably the most informative?

                  Comment


                  • #10
                    We are just getting started with the RNA-seq kits on Illumina GA, but I have seen stacks like these in ChIP-seq experiments. I think they are due to PCR amplification bias (some templates amplify many fold more than others). It can't be related to exon ends, since your starting material is fragments of cDNA (transcripts with exons spliced together).

                    We see this much more with less complex starting material (poor yield on the IP). We have also seen the slight offset of 1,2, or 3 bases within the stacks, but there are always 2 parallel stacks ~200 bp apart (length of DNA fragments). 20 cycles of PCR in the sample prep would seem to encourage bias.

                    Comment


                    • #11
                      @elaney_k: Thanks, for your suggestion! That paper was very interesting. I looked at the data in detail, at least for the reads aligning to the gene shown in my image, but, unfortunately, there was no noticeable correlation between GC content and the peaks visible in the figure.

                      On the other hand, everyone seems to be in agreement that PCR amplification is probably part of the cause. I wonder if the behavior I am seeing is a combination of the fragmentation procedure and the amplification. Perhaps certain breakage points are more common in a given transcript? If this were the case, then maybe we wouldn't see this if we did the PCR step before fragmentation. In the dataset I am looking at metal hydrolysis was used for fragmentation. Amplification was done afterwards. The transcript in question is relatively highly expressed.
                      Last edited by behoward; 08-24-2009, 01:15 PM.

                      Comment


                      • #12
                        Though I also think PCR bias is likely the primary culprit here, you might also consider problems caused during read mapping. Depending on which mapping/alignment algorithm you are using (and the parameters set), low complexity of highly orthologous sequences might be underrepresented in your final dataset. If your mapping algorithm only maps 'uniquely mappable reads' and/or suppresses alignments that map to multiple locations, you might see gaps in low complexity or highly orthologous regions.

                        FWIW, I see similar 'towers' in my RNA-Seq data (prepped and sequenced with the Illumina protocol). The patterns are extremely reproducible, even across different biological replicates. Therefore, they don't seem to negatively affect RPKM values across different samples.

                        Comment


                        • #13
                          I agree with stubrown, I have seen similar reads
                          Originally posted by stubrown View Post
                          the slight offset of 1,2, or 3 bases within the stacks, but there are always 2 parallel stacks ~200 bp apart (length of DNA fragments).
                          with ChIP-seq where the input is low but not with RNA-seq, perhaps the yield of mRNA (assuming 2 rounds of polyA selection) was low going into the prep which incouraged the PCR bias?

                          Comment


                          • #14
                            The genome has repeated sequences. So if you plot the "uniqueness" of the genome in a n-sized window you'll find how mappable that region of the genome is for a set read length. It is important to have in mind that this weakness from second generation sequencing techniques. Try comparing your biased peaks with this track so you can define whether the trend found in your data is due to the library and how it was amplified or if it is something you would expect from the uniqueness of that particular region of the genome.

                            Comment


                            • #15
                              @griffon42: Thanks, for your reply! It is very interesting that the tower patterns you see are highly reproducible. This seems to imply there is some consistent bias (at least, given the experimental protocol), rather than some random drift that gets amplified. This could perhaps be exploited by quantification algorithms.

                              @griffon42 and polivares: That is a good point about uniquely mapable reads, and that is, in general, an important issue to consider. However, in my case, I did not exclude reads that get mapped to more than one genomic location. All reads that align to a given gene are considered, even if those reads also map to other homologous genes. But, I still see the towers.

                              I don't have a lot of experience with PCR bias. I mostly work on the bioinformatics end of things. If the observed towers are due to biased amplification, would we see the smooth transition from peaks to valleys? In other words, in the data, if there is an abundance of reads at position x, there generally also appears to be an abundance of reads starting at positions x+1, x+2, gradually tapering off...

                              Also, another (perhaps obvious) point: whatever bias there is, it's not due to the primer, I don't think, because the same primer/adaptor is used for all fragments..

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              30 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              32 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              28 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              52 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X