Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • pair-end sequencing produces single-end read artifact

    Dear all,
    I mapped pair-end RNA-seq reads to RefSeq transcripts. I looked into the mapped but not properly/sensibly paired reads, and found about 75000 among 195000 read pairs in this groups have insert length between 30-36bp. There is still decent number of pairs with insert length down to 20, but essentially no read pairs has insert length >36. Considering the read length is 36 bp and my expected insert length is 250 bp, my statistics suggest that these read pairs with ~36 bp insert seem to be generated by some artifact of the pair-end sequencing. The first round of sequencing is fine (mate 1), but the second round of sequencing seems to take the 36 short sequence synthesized in the first round as templated and sequence by synthesizing the complementary ~36 bp. I am not sure why/how this happens? Does anyone have any ideas? Note that my short read data are of high quality, and this is just some common artifact with Illumina pair-end sequencing.
    Pparg

  • #2
    Hello,
    Has anyone seen this before? Are there any thoughts? Thanks!

    Comment


    • #3
      I've done PE vs RefSeq also, and my impression (not backed by numbers) is that the majority of anomalies are due to positional issues. I did not pursue it, since I was looking for something completely different in the data set, and because it was more or less what I was expecting.

      There are two parts to your question: why are the anomalous pairs restricted to >36 bp inserts, and what is causing the small inserts.

      The first is caused by the algorithm. The pairing estimates the average size of the insert, then places cut-offs at a couple standard deviations out. In your case, the low-end cut-off seems to be 36, which by happenstance concurs with the read-length.

      The "short" inserts are probably caused by a difference between RefSeq and the biological realities. For example, RefSeq does not contain all the possible isoforms generated by alternative splicing. I have also noticed that the amount of UTR reported for various isoforms varies, which can lead to odd insert sizes.

      I am pretty sure the sequencing hypothesis you're describing isn't very likely, since it would A) require the fragments to hang around through the cluster regen steps, which involves several washes; B) would also be present in genomic PE sequencing; and C) would result in read 2 being the read 1 adaptor.

      My suggestion is to put the reads into two bed files (translating the coordinates from RefSeq to genomic) and load them as separate tracks into the UCSC browser: one colored blue with only consistent pairs, and one colored red with only the anomalous pairs. I predict the anomalous pairs will cluster to a limited number of places, and that by examining them in relation to the gene model you will be able to figure out what is going on.

      Comment


      • #4
        Thanks a lot, dcjamison, for all the thoughts and suggestions.
        I think it is not likely due to cut-off setting. Even though over 99% of the abnormal read pairs have insert length <=36, there are still very small number of read pairs have insert length >36.
        It is also not likely due to incomplete Refseq data, because the insert length here covers the whole segment between the outer bounderies of the read pair. When read length =36, insert length <=36 suggest that only the first mate have been re-sequenced from the other direction in the second round. I know my hypothesis seems not likely be the true cause, but I can’t think of any reason that explain the data.
        You last suggestion seems to be promising. I will try it out later.

        Comment


        • #5
          is the insert length you meant from the most right site of mate 1 to the most left site of mate 2?
          Xi Wang

          Comment


          • #6
            Yes, any thoughts?

            Comment


            • #7
              Hi,

              If i understand your description we have seen the same thing in paired end dna sequencing.

              We see a small percentage of pairs where the two reads are either on top of each other or have very short inserts/paired end distances. it depends on the assembler you use as to whether you can see or extract these pairs. in our case suspect they are simply an artifact of the gel extraction steps where a small percentage of short fragments end up in the extraction. i dont know if the same step is done in rna seq so i may be totally wrong. for snp/indel detection we now remove all pairs like this as they the overlapping pairs double the counts for any variants they carry.

              we also see low frequency reads where segments of the read align inverted to each other. as with the above these show random and even/coverage dependent distribution across our ref seqs and we put these down to seq errors (internal priming?). i have seen these described in an rna-seq paper as evidence for unique transcripts but as they are inverted and seen all over the place i am personally doubtful about this, again i may be way off.

              i guess while we are on the subject or pe errors, when we map re-arrangements by looking for extended inserts/read distances or backward forward aligned pairs we have found some libraries where we see more than usual but again they are distributed evenly. Because in these libraries we see equal proportions of B-F and F-B pairs for each rearrangement locus we put these down to ligation errors during library prep. ie some fragments are ligating to each other creating false pairs. anyone have any idea whether this is a possibility? i'm pretty certain they are non-biological what ever the cause.

              EDIT: we did some blast alignments of single reads and found the same B-F, F-B "mirrored" breakpoints so i think we can rule out ligation errors. i have no idea what could be causing the problem, whether biological or experimental. Any ideas welcome.

              cheers,
              The_Roads
              Last edited by The_Roads; 06-08-2010, 11:38 AM. Reason: new evidence

              Comment


              • #8
                Hello, The_Roads,
                Your idea of internal priming is very interesting. This seems to explain read pairs with insert length <36 very well. If this is the case, such internal priming should occur at the PCR step in the library preparation stage, rather than the sequencing stage. One thing I am not sure is that is such internal priming feasible or common in reality?
                Ligation errors, if occur, should lead to false ‘re-arrangement’ calls. But if I understand correctly, the B-F and F-B pairs yo mentioned were not likely due to ligation error. I don’t think ligation error can generate a combined sequence where the two ends are B-F or F-B complementary to each other.
                Anyway, thanks a lot for your informative input!!

                Comment


                • #9
                  Originally posted by pparg View Post
                  Hello, The_Roads,
                  Your idea of internal priming is very interesting. This seems to explain read pairs with insert length <36 very well. If this is the case, such internal priming should occur at the PCR step in the library preparation stage, rather than the sequencing stage. One thing I am not sure is that is such internal priming feasible or common in reality?
                  Ligation errors, if occur, should lead to false ‘re-arrangement’ calls. But if I understand correctly, the B-F and F-B pairs yo mentioned were not likely due to ligation error. I don’t think ligation error can generate a combined sequence where the two ends are B-F or F-B complementary to each other.
                  Anyway, thanks a lot for your informative input!!
                  EST sequences which are created by sequencing ends of cDNA that are in turn generated using oligo-dT primers have evidence of as much as 15-20% internal priming (http://bioinformatics.oxfordjournals...ull/21/18/3691). If your sequencing was done using cDNA generated by oligo-dT primers, I don't see why it couldn't be as prevalent.

                  One way to check for it is, you can see the stretches of A's in the genome where the primer attached itself (instead of the polyA tail). I wonder if you can use this to check for errors in your paired end data.
                  Last edited by thinkRNA; 03-29-2010, 10:44 AM.

                  Comment


                  • #10
                    Hi PPARG,

                    Thanks for the reply. The internal priming i mentioned was meant as internal priming during sequencing, it was something that was suggested by an illumina tech when we were discussing the problem reads. These errors turn up in our data at very low frequency <1000 per 1.5M aligned reads and are evenly distributed in all samples and across all ref seqs we've looked at while doing dna sequencing. hence i think they are seq errors.

                    The rearranged reads occur at similar frequencies but they are not overlapping pairs. what i meant was for the sequence ABCDEFG we see similar frequencies of pairs that align, for instance, C-E as E-C. in our real biological rearrangements we only ever see one form ie only C-E. these suspected ligation errors only seem to have happened in only a few dna sequencing library preps and are always present at low frequency ~3/10,000 reads. i hope this makes things clearer.

                    see edit above
                    Last edited by The_Roads; 06-08-2010, 11:40 AM.

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Current Approaches to Protein Sequencing
                      by seqadmin


                      Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                      04-04-2024, 04:25 PM
                    • seqadmin
                      Strategies for Sequencing Challenging Samples
                      by seqadmin


                      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                      03-22-2024, 06:39 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, 04-11-2024, 12:08 PM
                    0 responses
                    18 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 10:19 PM
                    0 responses
                    22 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 09:21 AM
                    0 responses
                    16 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-04-2024, 09:00 AM
                    0 responses
                    47 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X