Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Estimating Paired insert size with Picard

    HI all, I have paired reads, for & rev of 150 bp.
    I used the picard tools to estimate insert size.
    The output file has median size for both for & reverse with different sizes,

    MEDIAN_INSERT_SIZE
    185 ... FR
    133 ... RF

    my understanding is that Mean Inner Distance between Mate Pairs is: mode - 2*read length.

    In this case however which number should I use? the FR or RF since they give very different results.
    - I used tophat2 to generate the bam file first then I used picard with the following commands,

    java -Xmx2g -jar pathtopicard\CollectInsertSizeMetrics.jar INPUT=accepted_hits.bam OUTPUT=size.txt

    Thanks.
    Last edited by Alex Lee; 08-30-2014, 08:30 PM.

  • #2
    Unfortunately, both of those are wrong, since it is providing one median for reads with insert size over 150bp and one for reads under 150bp. It looks like your insert size may be pretty close to 150bp, but it's hard to say for sure.

    You can get the correct insert size, and insert size histogram, with BBMerge (which uses overlap, and thus does not require a reference) or BBMap (which uses a reference). It only takes a few hundred thousand to get a good estimate; for example (in Linux/bash):

    bbmerge.sh in1=r1.fq in2=r2.fq ihist=ihist_merge.txt reads=400000

    or
    bbmap.sh -Xmx24g in1=r1.fq in2=r2.fq ihist=ihist_map.txt reads=400000 ref=ref.fa

    The command would be different on a different OS like Windows, though, so let me know if you encounter any trouble. As for the equation "median_insert_size - 2*read length", that's for calculating the unsequenced fraction in the middle, which is not the insert size. The insert size includes both reads and the unsequenced part, if any.

    Comment


    • #3
      wow thanks Brian - bbmerge is more than that. Sorry for my mistake what I meant was to calculate "Mean Inner Distance between Mate Pairs". A benefit right of is that I see that its written in JAVA so possibly running on windows. I tried this one Windows 7 but got an error so had to do this on linux. Result was mode: 127 going to realign with tophat with this setting. Oh and you were right about being close to 150 - not sure how you figure that out but awesome all around. thanks again.
      Last edited by Alex Lee; 08-30-2014, 08:35 PM.

      Comment


      • #4
        Alex,

        Since most of your inserts appear to be shorter than read length, you will have substantial adapter contamination. I recommend removing them before mapping (with for example BBDuk) which will greatly increase the mapping rate and accuracy. The BBTools package includes the TruSeq adapters (bbmap/resources/truseq.fa) but it's possible some other kind of adapters were use, so I suggest you find out first.

        Also, it's possible that the mode at 127bp was an artifact peak; you may want to use the median (reported as 50th percentile) or average instead of the mode. This will be obvious if you graph the data as a scatterplot in Excel - either the peak at 127bp will be super-sharp, or the middle of a broad peak. If it is super-sharp, you should find out what the 127bp reads are and remove them. The difference in Tophat results from a 20bp difference in estimated insert size will probably be very small, though.

        If you want to run these tools in Windows, you can replace "bbmerge.sh" with "java -ea -Xmx2g -cp path_to_bbmap/current/ jgi.BBMerge" or "bbduk.sh" with "java -ea -Xmx2g path_to_bbmap/current/ jgi.BBDukF".

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Current Approaches to Protein Sequencing
          by seqadmin


          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
          04-04-2024, 04:25 PM
        • seqadmin
          Strategies for Sequencing Challenging Samples
          by seqadmin


          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
          03-22-2024, 06:39 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, 04-11-2024, 12:08 PM
        0 responses
        31 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 10:19 PM
        0 responses
        32 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 09:21 AM
        0 responses
        28 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-04-2024, 09:00 AM
        0 responses
        53 views
        0 likes
        Last Post seqadmin  
        Working...
        X