Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • senpeng
    Member
    • Sep 2011
    • 10

    Overlapping and non-Overlapping pair-end reads with Tophat

    we recently tried to use Tophat to align our illumina 2*100bp pair end data.

    It seems that part of our data has a overlap from 30bp to 80bp, and some of the data doesn't have overlap.

    if I have a mix of pair-ended reads such as some read pairs are overlapping some are not;
    ---> May I set up the inner distance to(-r option) -10 and then the std deviation(--mate-std-dev ) for the inner distance at 70 for example?
    ---> In that case, will Tophat look for inner distance from -80 to +60? (-80 meaning an overlap of 80

    Thanks so much for your answer.
  • cedance
    Senior Member
    • Feb 2011
    • 108

    #2
    The way I go about figuring these parameters is this:
    I use BWA and align the paired end reads and obtain PE.sam file. From that, I use picard tools to obtain only those reads that are aligned to the reference genome. From this sam file, I calculate the inner distance of all pairs. From the vector of all such inner distances, I compute the 1st and 3rd quantiles, Q1 and Q3 and then calculate the inter quartile range IQ = Q3-Q1. From here, I take all the values that fall between Q1 - 2*IQ to Q1 + 2*IQ. I compute the mean and standard deviation of all these values (this is similar to what BWA does) and provide it to tophat. It seems to work great. Most of my reads are mapped under the given inner distance and standard deviation and the other (few) reads with larger inner distance. There'll always be a few reads that'll have larger inner distance and I consider them outliers and discard in the computation of mean and SD.

    Hope this helps.

    Comment

    • upendra_35
      Senior Member
      • Apr 2010
      • 102

      #3
      I normally do this way.....

      I map the paired end reads using bowtie to obtain SAM files (Bowtie was chosen because of its speed). From the SAM files i extract the optimized insert_length and insert_length_sd. For obtaining the insert_length information from the SAM file i use the following perl script (Attached. The usage is quite simple). The output from this script is then put into R to calculate the mean and sd of the insert_length. You can use then use this information for tophat...

      perl:
      get_insert_sizes_from_sam.pl 300bp_pe_def.filt.sam > 300bp_pe_def.sizes

      R:
      sizes <- as.numeric(readLines("300bp_pe_def.sizes"))
      mean(sizes, na.rm=TRUE)
      median(sizes, na.rm=TRUE)
      sd(sizes, na.rm=TRUE)

      Hope this helps
      Attached Files

      Comment

      • senpeng
        Member
        • Sep 2011
        • 10

        #4
        Thanks for your answers, cedance & upendra_35. Does that mean I have to run Bowtie or BWA first to get these numbers? I tried to apply different inner distances to test data(small of course) and didn't see much difference. How on earth will this parameter affect the final result?
        For example, if I set the -r as 50 and SDV as 0, does that mean the program will not check paired of distance greater than 50?
        Thanks again for your attention.

        Comment

        • upendra_35
          Senior Member
          • Apr 2010
          • 102

          #5
          Unfortunately you have to run either BWA or Bowtie to get the insert_length and insert_length SD info for tophat. If i am correct the parameters are quite important for correct mapping of the reads onto the reference. If you had tried a few more insert_lengths you would have seen the difference. The best way to compare the different lengths is to apply samtools flagstat on the BAM files and see how many reads mapped vs non mapped.

          Comment

          Latest Articles

          Collapse

          • GATTACAT
            Reply to Nine Things a Sample Prep Scientist Thinks About Before Sequencing
            by GATTACAT
            Love this - good data definitely starts from good input, and poor input can only give relatively poor data. I particularly like the mention of Nanodrop/absorbance based methods for quantification. It's such a toss up if you'll get an accurate reading or what amounts to a randomly generated number, and a lot of library/sequencing related issues can be traced back to poor quant.
            07-01-2026, 11:43 AM
          • SEQadmin2
            Nine Things a Sample Prep Scientist Thinks About Before Sequencing
            by SEQadmin2


            I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

            Here are nine questions we think about, in roughly the order they matter, before...
            06-18-2026, 07:11 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by SEQadmin2, Yesterday, 11:08 AM
          0 responses
          7 views
          0 reactions
          Last Post SEQadmin2  
          Started by SEQadmin2, 06-30-2026, 05:37 AM
          0 responses
          11 views
          0 reactions
          Last Post SEQadmin2  
          Started by SEQadmin2, 06-26-2026, 11:10 AM
          0 responses
          19 views
          0 reactions
          Last Post SEQadmin2  
          Started by SEQadmin2, 06-17-2026, 06:09 AM
          0 responses
          53 views
          0 reactions
          Last Post SEQadmin2  
          Working...