SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
SOLiD paired-end read analysis with BFAST 0.7.0a idonaldson Bioinformatics 4 03-20-2012 08:31 PM
Using Bfast to align paired end Illumina reads gavin.oliver Bioinformatics 14 01-14-2012 06:51 AM
BFAST mapping paired end reads. tanghz Bioinformatics 10 04-29-2011 06:29 AM
How to map SOLiD paired end reads by Bfast beliefbio Bioinformatics 1 12-29-2010 12:55 AM
BFAST indexing for 50+36 SOLiD paired end? krobison Bioinformatics 6 11-02-2010 05:25 AM

Reply
 
Thread Tools
Old 10-01-2010, 08:26 AM   #1
epigen
Senior Member
 
Location: Germany

Join Date: May 2010
Posts: 101
Question BFAST for SOLiD paired end reads

I'm using BFAST with BWA (version 0.6.4e) on SOLiD paired end reads (read1: 50 bp, read2: 35 bp). So far, these steps work nicely (thanks to help from the BFAST team) and quite fast:
find CALs separately for the 50 bp ends:
bfast match -n 4 -t -f $REF -A 1 -z -T $TEMPDIR -r reads_r1.fastq > reads_r1.bmf

and for the 35 bp ends:
bfast bwaaln -c $REF reads_r2.fastq > reads_r2.bmf

bring them together:
bfast localalign -f $REF -1 reads_r1.bmf -2 reads_r2.bmf -A 1 -t -U -n 8 > reads.baf

But the postprocess step, which was done in a few minutes for single end, can take > 100 hours on 16 CPUs for 50 Mio read pairs:
bfast postprocess -f $REF -i reads.baf -a 3 -A 1 -R -z -t -n 16 > reads.sam

Also, the reported values seem quite strange to me. Often there are negative means and large standard deviations, e.g.
*********************************************************
Estimating paired end distance...
Used 7438 paired end distances to infer the insert size distribution.
The paired end distance range was from -10240 to 2847.
The paired end distance mean and standard deviation was -5413.80 and 4928.88.
The inversion ratio was 0.999866 (7437 / 7438).
Reads processed: 2700000
*********************************************************
Estimating paired end distance...
Used 9477 paired end distances to infer the insert size distribution.
The paired end distance range was from -4985 to 1899.
The paired end distance mean and standard deviation was -1038.24 and 2097.46.
The inversion ratio was 0.999894 (9476 / 9477).
Reads processed: 2750000
*********************************************************
*********************************************************
Estimating paired end distance...
Used 9925 paired end distances to infer the insert size distribution.
The paired end distance range was from -17 to 7508.
The paired end distance mean and standard deviation was 110.48 and 79.03.
The inversion ratio was 1.000000 (9925 / 9925).
Reads processed: 2800000
*********************************************************

If I use -g for gapped rescue it's even slower. (By the way, where to find the documentation how gapped rescue works?)
For the whole set, ABI BioScope could map 34% as proper pairs, 42% of the reads were unmapped, and it reported Insert range 94-206. I split the data set for BFAST and from the 2 parts that finished, least 60% mapped but <20% in proper pairs.
Am I doing something wrong? Or might it be because of bad read quality? Any help will be very much appreciated.

Barbara
epigen is offline   Reply With Quote
Old 10-01-2010, 09:02 AM   #2
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Deep breath, you're doing nothing wrong.

I am not sure (yet) how much read-rescue helps on this type of data. I released this feature early, even though it might not add value and considerably slow down the process. You can turn it off if it is not working.
nilshomer is offline   Reply With Quote
Old 10-07-2010, 05:39 AM   #3
epigen
Senior Member
 
Location: Germany

Join Date: May 2010
Posts: 101
Default negative insert sizes

Thanks Nils.
Now I found the bfast git version with additional parameters and run
bfast postprocess -f $REF -i reads.baf -a 3 -A 1 -R -z -v 160 -s 20 -S 4.0 > reads.sam
This results in a tremendous speedup compared to version 0.6.4e with options -a 3 -A 1 -R -z (more than 20 times faster) and even finds more pairs - definitively an improvement! However, I'm confused that the reported insert sizes are all negative.

Example:
2221_1132_511 99 chrX 141904000 255 50M = 141904140 -106 ATTTATCATGATTAACACCATTGTCTTCATTGTATATTTTCTAAGCTGCT ````````````````````````````````````````````````^; PG:Z:bfast AS:i:2500 MQ:i:255 NM:i:0 NH:i:1 IH:i:1 HI:i:1 MD:Z:50 CS:Z:T33003321312303011101301122021301133330002230232132 CQ:Z:BBBBBB@A?B5BBABA@@@BA@A>BBB?;BB=B=BAB=??@?B9@@A?:; CM:i:0 XA:i:3 XE:Z:--------------------------------------------------
2221_1132_511 147 chrX 141904140 255 35M = 141904000 -106 ACCAGAAGCGTCTCTGATTCTGGGTGAGCAGTGAC 9W0")""^]`)"XX_()`_``^UZ`53```````` PG:Z:bfast AS:i:1000 MQ:i:255 NM:i:0 NH:i:0 IH:i:1 HI:i:1 MD:Z:35 CS:Z:T11211213211100122031122211332121301 CQ:Z:BB?A?BBB@.B?2:;B9=@8?7879A7=8>1)&59 CM:i:6 XA:i:3 XE:Z:--31-1----1----1---------1---------

Result from version 0.6.4e:
2221_1132_511 99 chrX 141904000 255 50M = 141904140 140 ATTTATCATGATTAACACCATTGTCTTCATTGTATATTTTCTAAGCTGCT ````````````````````````````````````````````````^; PG:Z:bfast AS:i:2500 MQ:i:255 NM:i:0 NH:i:1 IH:i:1 HI:i:1 MD:Z:50 CS:Z:T33003321312303011101301122021301133330002230232132 CQ:Z:BBBBBB@A?B5BBABA@@@BA@A>BBB?;BB=B=BAB=??@?B9@@A?:; CM:i:0 XA:i:3 XE:Z:--------------------------------------------------"""
"""2221_1132_511 147 chrX 141904140 255 35M = 141904000 -140 ACCAGAAGCGTCTCTGATTCTGGGTGAGCAGTGAC 9W0")""^]`)"XX_()`_``^UZ`53```````` PG:Z:bfast AS:i:1000 MQ:i:255 NM:i:0 NH:i:0 IH:i:1 HI:i:1 MD:Z:35 CS:Z:T11211213211100122031122211332121301 CQ:Z:BB?A?BBB@.B?2:;B9=@8?7879A7=8>1)&59 CM:i:6 XA:i:3 XE:Z:--31-1----1----1---------1---------

Obviously postprocess now calculates the isize differently than before, but for consistency it should still be positive for one read of the pair. samtools flagstat does not complain, other tools might.
Edit: I just noticed that some reads labeled as "proper pairs" (flag 67) have insert sizes up to 200 Mio bp. How comes?!

Looking forward to bfast 0.6.4f (or will there be bfast v1.0 ?)

Barbara

Last edited by epigen; 10-07-2010 at 08:58 AM. Reason: detected weird proper pairs
epigen is offline   Reply With Quote
Old 10-07-2010, 10:47 PM   #4
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Thank-you for reporting it, I will take a look since it is probably a bug. I am on vacation for a few days for [Canadian] Thanksgiving!

As for the 1.0 release, it's like google software (for the most part): always in beta.
nilshomer is offline   Reply With Quote
Old 11-02-2010, 02:06 AM   #5
epigen
Senior Member
 
Location: Germany

Join Date: May 2010
Posts: 101
Default insert size bug

Sorry to ask again, but have you found the bug(s)? Especially the one causing "properly paired" reads with insert sizes of several 100 Mio bp is weird considering the lengths of the human chromosomes ...
epigen is offline   Reply With Quote
Old 11-02-2010, 10:09 AM   #6
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Quote:
Originally Posted by epigen View Post
Sorry to ask again, but have you found the bug(s)? Especially the one causing "properly paired" reads with insert sizes of several 100 Mio bp is weird considering the lengths of the human chromosomes ...
Sorry I have not had the chance to take a look at it. Could you send the report to bfast-help@lists.sourceforge.net. Sorry again.
nilshomer is offline   Reply With Quote
Old 12-15-2010, 11:21 PM   #7
Jerry-cs
Junior Member
 
Location: Vancouver

Join Date: Dec 2010
Posts: 5
Red face

I have the same problem . Below is the alignment result:

20000000 in total
0 QC failure
0 duplicates
8331480 mapped (41.66%)
20000000 paired in sequencing
10000000 read1
10000000 read2
3581888 properly paired (17.91%)
3833680 with itself and mate mapped
4497800 singletons (22.49%)
241894 with mate mapped to a different chr
221791 with mate mapped to a different chr (mapQ>=5)

Because I'm new to NGS, I'm wondering if the above results reasonable.Thank you.


Quote:
Originally Posted by epigen View Post
For the whole set, ABI BioScope could map 34% as proper pairs, 42% of the reads were unmapped, and it reported Insert range 94-206. I split the data set for BFAST and from the 2 parts that finished, least 60% mapped but <20% in proper pairs.
Am I doing something wrong? Or might it be because of bad read quality? Any help will be very much appreciated.

Barbara
Jerry-cs is offline   Reply With Quote
Old 12-16-2010, 05:27 AM   #8
drio
Senior Member
 
Location: 4117'49"N / 24'42"E

Join Date: Oct 2008
Posts: 323
Default

@jerry-cs Is this data from a 50+25 PE run? These are only 5 millions reads do you see those number across the whole run? bfast does not do very well with 25bp reads. That's what the bfast+bwa branch(git) is for.
__________________
-drd
drio is offline   Reply With Quote
Old 12-16-2010, 07:15 AM   #9
Jerry-cs
Junior Member
 
Location: Vancouver

Join Date: Dec 2010
Posts: 5
Default

Hi Drio, thank you. The reads are from a 50+35 PE run (The F3 reads have 50 bases, while the F5-P2 reads are 30 bases long). In addition, the localalign step also takes much time.

Quote:
Originally Posted by drio View Post
@jerry-cs Is this data from a 50+25 PE run? These are only 5 millions reads do you see those number across the whole run? bfast does not do very well with 25bp reads. That's what the bfast+bwa branch(git) is for.
Jerry-cs is offline   Reply With Quote
Old 03-17-2011, 05:52 AM   #10
sdvie
Member
 
Location: Spain

Join Date: Jul 2010
Posts: 68
Default a related question

Dear all using bfast bwaaln,

I have a related question; in fact I am running an experiment similar to the one that epigen described, that is paired end SOLiD reads of 50+25 lenght.

For bfast match, I split the 50-reads in several files containing each 10 Mio reads, as recommended.
However, I am not sure, whether I can do the same with the 25-reads for bfast bwaaln, run bfast bwaaln separately for each reads fragment and put them all together again in the locaalign step?
When using one single fastq file containing all the reads (105 Mio) this step seems to take a long time (2 days and still running).

Any comments will be appreciated.

Thanks,
Sophia
sdvie is offline   Reply With Quote
Old 03-17-2011, 07:07 AM   #11
epigen
Senior Member
 
Location: Germany

Join Date: May 2010
Posts: 101
Default

Quote:
Originally Posted by sdvie View Post
Dear all using bfast bwaaln,

I have a related question; in fact I am running an experiment similar to the one that epigen described, that is paired end SOLiD reads of 50+25 lenght.

For bfast match, I split the 50-reads in several files containing each 10 Mio reads, as recommended.
However, I am not sure, whether I can do the same with the 25-reads for bfast bwaaln, run bfast bwaaln separately for each reads fragment and put them all together again in the locaalign step?
When using one single fastq file containing all the reads (105 Mio) this step seems to take a long time (2 days and still running).

Any comments will be appreciated.

Thanks,
Sophia
I also split the short reads for bwaaln, then feed the resulting shortreads.<nr>.bmf file and the longreads.<nr>.bmf from match into localalign, like that:
bfast localalign -f $REF -1 longreads.1.bmf -2 shortreads.1.bmf -A 1 -t -U -n 8 > local.1.baf

Then I run postprocess, convert the sam into a sorted bam, and at the end merge all bam files.

I also see that localalign is the most time-consuming step, it takes up to 4 days on 8 CPUs for 50 Mio read pairs. (I have >500 read pairs for each slide.) Next I'll try using -M 200 to see if that results in a speedup.
epigen is offline   Reply With Quote
Old 03-17-2011, 07:10 AM   #12
sdvie
Member
 
Location: Spain

Join Date: Jul 2010
Posts: 68
Default

Quote:
Originally Posted by epigen View Post
I also split the short reads for bwaaln, then feed the resulting shortreads.<nr>.bmf file and the longreads.<nr>.bmf from match into localalign, like that:
bfast localalign -f $REF -1 longreads.1.bmf -2 shortreads.1.bmf -A 1 -t -U -n 8 > local.1.baf

Then I run postprocess, convert the sam into a sorted bam, and at the end merge all bam files.

I also see that localalign is the most time-consuming step, it takes up to 4 days on 8 CPUs for 50 Mio read pairs. (I have >500 read pairs for each slide.) Next I'll try using -M 200 to see if that results in a speedup.
good to know, thanks!

cheers,
Sophia
sdvie is offline   Reply With Quote
Old 05-06-2011, 12:40 PM   #13
NanYu
Member
 
Location: USA

Join Date: Apr 2011
Posts: 21
Default

I'm having the same problem of having very slow speed in the postprocess step (mated-pair end Solid sequences of 50bp each). My testing data set is only 5000 pairs of sequences. but it took a long time to run the postporcess step. Is there a bug for the mated-pair end?

Also, in #3 posts, it uses -v option, but those options are not mentioned in the manual. Can anyone let me know what it means?

Thanks!
NanYu is offline   Reply With Quote
Old 05-06-2011, 02:30 PM   #14
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

What options are you using and what version of BFAST?
nilshomer is offline   Reply With Quote
Old 05-07-2011, 03:56 AM   #15
NanYu
Member
 
Location: USA

Join Date: Apr 2011
Posts: 21
Default

Thanks for replying.

I'm using "Version: 0.6.5a git:Revision: undefined$"

Here is the command line I used for postprocess:

bfast postprocess -f reference.fa -A 1 -O 1 -i bfast.aligned_subset.baf >bfast_subset.sam

It took about 1 hour to finish this (using 1 CPU), with 5000 pairs. Memory usage is <2G while the system has 48G memory.


Some output from the BFAST:

************************************************************
Postprocessing...
************************************************************
Estimating paired end distance...
Used 2478 paired end distances to infer the insert size distribution.
The paired end distance range was from -12772 to 8032.
The paired end distance mean and standard deviation were -2466.06 and 3543.15.
The inversion ratio was 0.004036 (10 / 2478).
Reads processed: 5000
************************************************************
Reads processed: 5000
Alignment complete.
************************************************************
Found 178 reads with no ends mapped.
Found 491 reads with 1 end mapped.
Found 4331 reads with 2 ends mapped.
Found 4822 reads with at least one end mapping.
************************************************************
Terminating successfully!
************************************************************

Last edited by NanYu; 05-07-2011 at 04:20 AM.
NanYu is offline   Reply With Quote
Old 05-07-2011, 07:52 AM   #16
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Try using the "-U" option.
nilshomer is offline   Reply With Quote
Old 05-07-2011, 11:32 AM   #17
NanYu
Member
 
Location: USA

Join Date: Apr 2011
Posts: 21
Default

with the -U option added, it finished within 1 min!
but it means BFAST will not consider the results for "mated-pair". How much it will affect the final result (compared with not using -U option)?
If I use the -U option, may I use the SAMTools or other software to further filter the result before SNP discovery?
Thanks!
NanYu is offline   Reply With Quote
Old 05-07-2011, 11:47 AM   #18
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Quote:
Originally Posted by NanYu View Post
with the -U option added, it finished within 1 min!
but it means BFAST will not consider the results for "mated-pair".
It means that it will not try the mate-pair rescue strategy, but it will be annotated as mate pair in the SAM file (check the flag field).

Quote:
Originally Posted by NanYu View Post
How much it will affect the final result (compared with not using -U option)?
It should affect it by 42. On a serious note, I don't know how much it will affect your results. Please post back if you notice an interesting difference.

Quote:
Originally Posted by NanYu View Post
If I use the -U option, may I use the SAMTools or other software to further filter the result before SNP discovery?
Thanks!
I am not sure why you are concerned. Did you look at the data and see something odd?
nilshomer is offline   Reply With Quote
Old 05-07-2011, 11:50 AM   #19
Chipper
Senior Member
 
Location: Sweden

Join Date: Mar 2008
Posts: 324
Default

The problem seems to be the estmation of distances and standar deviations. If you set these parameters it will be much faster.
Chipper is offline   Reply With Quote
Old 06-07-2011, 06:01 AM   #20
NanYu
Member
 
Location: USA

Join Date: Apr 2011
Posts: 21
Default

Quote:
Originally Posted by nilshomer View Post
I am not sure why you are concerned. Did you look at the data and see something odd?

I'm wondering in the postprocess step, if BFAST will do something like: choose the best for F and second best for R within a pair because that is within the expected distance, yet the best for R is in another chromsome.
I know it is a rare scenario and maybe I should not worry about it.
NanYu is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 07:57 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO