Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Can paired-end mapping produce more reads than single-end ?

    Hi,

    I am currently mapping 75bp paired end data to a cDNa library. I have heard that there is a chance there are chimeras in a large portion of the reads and as such have had to change to single end analysis.

    My results seem to show that I have less hits (I counted reads mapped in both paired and single end alignment) in single-end alignment (using bowtie).

    Does anyone know if this is infact possible or more likely an error in the code?

  • #2
    It's certainly possible. Most paired end mapping algorithms are able to use mapping information about one end to infer a likely position for the other, meaning that a read which in isolation couldn't be mapped uniquely can be positioned if the position of its other end is known.

    If you see a big difference in efficiency I'd also double check that you were using the same mapping parameters in both runs just to be sure. Using 75bp reads I'd be surprised if there was a big advantage to using paired end over single end mapping for a cDNA library.

    Comment


    • #3
      Thanks for the reply! I had the same suspicion and after more searching have found the bug

      Thanks again for the help!

      Comment


      • #4
        Originally posted by simonandrews View Post
        It's certainly possible. Most paired end mapping algorithms are able to use mapping information about one end to infer a likely position for the other, meaning that a read which in isolation couldn't be mapped uniquely can be positioned if the position of its other end is known.

        If you see a big difference in efficiency I'd also double check that you were using the same mapping parameters in both runs just to be sure. Using 75bp reads I'd be surprised if there was a big advantage to using paired end over single end mapping for a cDNA library.
        I would argue that the strategy above shows a common misconception about paired end data (or mate-end). For the human genome, inferring one end from the other does not return much, due to the local nature of repeats (think of insert distributions that are >500bp wide and how local repeats would confound placing the unaligned end). I have seen no data to show what inferring one end from the other does to false mappings, especially around large-scale insertion, deletion, and translocation events. At my current state of thinking, using paired end constraints during mapping is a heuristic to make up for the fact you did not map each end sensitively enough.

        Comment


        • #5
          Originally posted by nilshomer View Post
          At my current state of thinking, using paired end constraints during mapping is a heuristic to make up for the fact you did not map each end sensitively enough.
          I think that's an overgeneralisation. I don't believe that paired end mapping is a panacea, but there are certainly cases where it offers benefits in sensitivity over single end mapping.

          You say that using paired end is a way to make up for inadequate initial mapping, but with shorter read lengths there are plenty of reads which could map exactly, with no mismatches at multiple locations in the genome. Even where these are within repeat regions you can find that there are only a small number of locations where this read could map then using a paired end will give you a mapped position where a single end would not.

          Having said that, I'd actually argue that for mapping type applications (eg ChIP Seq) the benefit of paired end comes from the separation between ends. Repeated regions can stretch over many tens of bases so that increasing the length of a single end read provides diminishing returns in terms of mapping efficiency. Using shorter paired end reads with a greater separation between the ends will in many cases offer a greater chance of positioning a read since one end may have escaped the repetitive region.

          Having said all this, I personally would stick with single end reads for ChIP-Seq, mRNA seq and similar applications from a cost point of view. We do a lot of paired end sequencing at our site but normally it's for applications which absolutely require it, such as 4C.

          Comment


          • #6
            Does any body can help me defining: single-end a paired-end aligments? Thanks a lot

            Comment


            • #7
              Originally posted by simonandrews View Post
              I think that's an overgeneralisation. I don't believe that paired end mapping is a panacea, but there are certainly cases where it offers benefits in sensitivity over single end mapping.

              You say that using paired end is a way to make up for inadequate initial mapping, but with shorter read lengths there are plenty of reads which could map exactly, with no mismatches at multiple locations in the genome. Even where these are within repeat regions you can find that there are only a small number of locations where this read could map then using a paired end will give you a mapped position where a single end would not.

              Having said that, I'd actually argue that for mapping type applications (eg ChIP Seq) the benefit of paired end comes from the separation between ends. Repeated regions can stretch over many tens of bases so that increasing the length of a single end read provides diminishing returns in terms of mapping efficiency. Using shorter paired end reads with a greater separation between the ends will in many cases offer a greater chance of positioning a read since one end may have escaped the repetitive region.

              Having said all this, I personally would stick with single end reads for ChIP-Seq, mRNA seq and similar applications from a cost point of view. We do a lot of paired end sequencing at our site but normally it's for applications which absolutely require it, such as 4C.
              Your idea of using paired end information is not flawed. My simple point is that a large fraction of repeats in Humans occur locally, so that with 1kb variability in insert sizes (see ABI SOLiD) the other end wont help. With most new technologies moving to longer reads (>50bp), I don't foresee short reads for whole genome sequencing remaining for long.

              Comment


              • #8
                I'm a bit confused by the comment
                with 1kb variability in insert sizes (see ABI SOLiD) the other end wont help
                . Perhaps it varies by who is preparing the library & how, but in the MoDIL paper their Illumina library had a fragment size distribution with a mean of 208 and standard deviation of 13, which is quite a tight distribution.

                While I would agree that a lot of a human-like genome is unlikely to resolve with paired end mapping and current read lengths, for specific genes which may be of high interest (especially in an array/solution capture approach) this information can be critical. For example, if there are retrotransposed duplicates of your gene of interest, paired reads may enable distinguishing the two. This would happen either (a) one read in the pair maps into unique sequence (intron) for the original copy or (b) the distance between the read pairs is distinctive because they imply either crossing or not crossing an intron.

                Comment


                • #9
                  Originally posted by montera View Post
                  Does any body can help me defining: single-end a paired-end aligments? Thanks a lot
                  When you create a library of fragments to sequence some sequencing technologies offer the ability to sequence either from just one end of each fragment, or to get a pair of fragments, one from each end. In many cases these paired end reads won't meet up so you won't have the complete fragment sequence, but you have two sequences which you know should be separated by a fairly small distance in your reference sequence.

                  Some mapping tools can use the connection between the two tags from the same fragment to aid them in mapping the sequences to a reference.

                  Comment


                  • #10
                    Although there are definitely many more reads mapped for single-end, when I look at the rpkm values the paired end data, produces values 100-300 higher (just browsed through the top few genes)single-end.

                    Can anyone take a stab at what reasons for this are ? although less reads are mapped does it manage to give a more specific result in this case?

                    Comment


                    • #11
                      Originally posted by krobison View Post
                      I'm a bit confused by the comment
                      . Perhaps it varies by who is preparing the library & how, but in the MoDIL paper their Illumina library had a fragment size distribution with a mean of 208 and standard deviation of 13, which is quite a tight distribution.

                      While I would agree that a lot of a human-like genome is unlikely to resolve with paired end mapping and current read lengths, for specific genes which may be of high interest (especially in an array/solution capture approach) this information can be critical. For example, if there are retrotransposed duplicates of your gene of interest, paired reads may enable distinguishing the two. This would happen either (a) one read in the pair maps into unique sequence (intron) for the original copy or (b) the distance between the read pairs is distinctive because they imply either crossing or not crossing an intron.
                      For an tight insert size Illumina library, I would agree, paired ends could help. But for large insert size ABI libraries, it is more ambiguous.

                      Nils

                      Comment


                      • #12
                        Nils, do you have any results showing this or are you just guessing? If you have one end derived from an Alu repeat you would ve comparing it to ~1 M copies for singel end and perhaps 2-3 for paired ends with a 1 kb variabity so you should be able to find many more uniqe reads with paired ends.

                        Comment


                        • #13
                          Originally posted by Chipper View Post
                          Nils, do you have any results showing this or are you just guessing? If you have one end derived from an Alu repeat you would ve comparing it to ~1 M copies for singel end and perhaps 2-3 for paired ends with a 1 kb variabity so you should be able to find many more uniqe reads with paired ends.
                          The data on which I am basing my results is from actually trying this strategy in a version of my own mapping tool BFAST. The discordance between my results and your expectation may come from the sensitive settings I use for mappings (up to 10% raw error). I am always open to incorporating this strategy as it is trivial to implement. Nonetheless, I have myself neither performed nor seen how this strategy increases false-mappings for those cases when this strategy is used (I assessed only sensitivity).

                          This might actually be a good time to rigorously put this debate to rest with some simulations. What if I create some paired end data (from Human) with error-rates coming from our latest Illumina runs and check the false-mapping rates if this strategy is used? I can take those reads for which one end does not map and see how many I can recover, assessing both the sensitivity and false mapping rates. What do you think?

                          Comment


                          • #14
                            Hi ,
                            I tried to map illumina ~2 million reads to Oryza sativa indica reference genome with its reference gtf file using different versions of Tophat 1.1.4, 1.3.0, 1.3.1, 1.3.2, 1.3.3 and the current one 1.4.1 .
                            I used the defalut options just to check if the mapping statistics really gets affected. As a result, I got the following stats:
                            Reads Used Reads Mapped
                            Tophat1.1.4 2,000,000 2,27,554
                            Tophat1.3.0 2,000,000 2,30,817
                            Tophat1.3.1 2,000,000 2,31,935
                            Tophat1.3.2 2,000,000 4,517
                            Tophat1.3.3 2,000,000 2,31,935
                            Tophat1.4.1 2,000,000 1,37,724

                            I wanted to know why the number of reads mapped is varying in each version even though using the same data. Secondly, why there is a drastic change in the mapping stats with version 1.3.2 and 1.4.1 as compared with other versions? Can please anybody throw some light on this matter

                            Comment

                            Latest Articles

                            Collapse

                            • seqadmin
                              Strategies for Sequencing Challenging Samples
                              by seqadmin


                              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                              03-22-2024, 06:39 AM
                            • seqadmin
                              Techniques and Challenges in Conservation Genomics
                              by seqadmin



                              The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                              Avian Conservation
                              Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                              03-08-2024, 10:41 AM

                            ad_right_rmr

                            Collapse

                            News

                            Collapse

                            Topics Statistics Last Post
                            Started by seqadmin, Yesterday, 06:37 PM
                            0 responses
                            10 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, Yesterday, 06:07 PM
                            0 responses
                            9 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 03-22-2024, 10:03 AM
                            0 responses
                            49 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 03-21-2024, 07:32 AM
                            0 responses
                            67 views
                            0 likes
                            Last Post seqadmin  
                            Working...
                            X