SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
The insert-size in paired-end data louis7781x Illumina/Solexa 15 02-06-2017 09:42 AM
BWA insert size calculation in paired end sequencing dg13 Bioinformatics 3 07-30-2013 10:17 PM
Tophat insert size of paired-end reads ozs2006 Bioinformatics 15 07-30-2013 07:40 PM
Average insert size from paired end data Petrichor Bioinformatics 4 04-29-2013 02:13 AM
Picard Collect Insert Size Java Problem chongm Bioinformatics 0 02-11-2013 12:01 PM

Reply
 
Thread Tools
Old 08-30-2014, 05:59 PM   #1
Alex Lee
Junior Member
 
Location: N. Cal

Join Date: Apr 2014
Posts: 10
Talking Estimating Paired insert size with Picard

HI all, I have paired reads, for & rev of 150 bp.
I used the picard tools to estimate insert size.
The output file has median size for both for & reverse with different sizes,

MEDIAN_INSERT_SIZE
185 ... FR
133 ... RF

my understanding is that Mean Inner Distance between Mate Pairs is: mode - 2*read length.

In this case however which number should I use? the FR or RF since they give very different results.
- I used tophat2 to generate the bam file first then I used picard with the following commands,

java -Xmx2g -jar pathtopicard\CollectInsertSizeMetrics.jar INPUT=accepted_hits.bam OUTPUT=size.txt

Thanks.

Last edited by Alex Lee; 08-30-2014 at 09:30 PM.
Alex Lee is offline   Reply With Quote
Old 08-30-2014, 08:24 PM   #2
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Unfortunately, both of those are wrong, since it is providing one median for reads with insert size over 150bp and one for reads under 150bp. It looks like your insert size may be pretty close to 150bp, but it's hard to say for sure.

You can get the correct insert size, and insert size histogram, with BBMerge (which uses overlap, and thus does not require a reference) or BBMap (which uses a reference). It only takes a few hundred thousand to get a good estimate; for example (in Linux/bash):

bbmerge.sh in1=r1.fq in2=r2.fq ihist=ihist_merge.txt reads=400000

or
bbmap.sh -Xmx24g in1=r1.fq in2=r2.fq ihist=ihist_map.txt reads=400000 ref=ref.fa

The command would be different on a different OS like Windows, though, so let me know if you encounter any trouble. As for the equation "median_insert_size - 2*read length", that's for calculating the unsequenced fraction in the middle, which is not the insert size. The insert size includes both reads and the unsequenced part, if any.
Brian Bushnell is offline   Reply With Quote
Old 08-30-2014, 09:27 PM   #3
Alex Lee
Junior Member
 
Location: N. Cal

Join Date: Apr 2014
Posts: 10
Default

wow thanks Brian - bbmerge is more than that. Sorry for my mistake what I meant was to calculate "Mean Inner Distance between Mate Pairs". A benefit right of is that I see that its written in JAVA so possibly running on windows. I tried this one Windows 7 but got an error so had to do this on linux. Result was mode: 127 going to realign with tophat with this setting. Oh and you were right about being close to 150 - not sure how you figure that out but awesome all around. thanks again.

Last edited by Alex Lee; 08-30-2014 at 09:35 PM.
Alex Lee is offline   Reply With Quote
Old 08-31-2014, 09:51 AM   #4
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Alex,

Since most of your inserts appear to be shorter than read length, you will have substantial adapter contamination. I recommend removing them before mapping (with for example BBDuk) which will greatly increase the mapping rate and accuracy. The BBTools package includes the TruSeq adapters (bbmap/resources/truseq.fa) but it's possible some other kind of adapters were use, so I suggest you find out first.

Also, it's possible that the mode at 127bp was an artifact peak; you may want to use the median (reported as 50th percentile) or average instead of the mode. This will be obvious if you graph the data as a scatterplot in Excel - either the peak at 127bp will be super-sharp, or the middle of a broad peak. If it is super-sharp, you should find out what the 127bp reads are and remove them. The difference in Tophat results from a 20bp difference in estimated insert size will probably be very small, though.

If you want to run these tools in Windows, you can replace "bbmerge.sh" with "java -ea -Xmx2g -cp path_to_bbmap/current/ jgi.BBMerge" or "bbduk.sh" with "java -ea -Xmx2g path_to_bbmap/current/ jgi.BBDukF".
Brian Bushnell is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 03:51 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO