SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Notes from trying BreakDancer wjeck Bioinformatics 26 09-09-2015 05:37 AM
breakdancer compilation help xxb0316 Bioinformatics 13 08-11-2015 11:05 PM
BreakDancer problems totalnew Bioinformatics 5 08-04-2015 01:38 AM
SRA - SRR*.lite.sra adrian Bioinformatics 2 03-19-2012 09:43 AM
Breakdancer options bawee Bioinformatics 3 03-23-2010 06:01 PM

Reply
 
Thread Tools
Old 07-13-2010, 07:18 AM   #1
stubrown
Junior Member
 
Location: NYC

Join Date: Nov 2008
Posts: 4
Default Need workflow from SRA to BreakDancer

I am flummoxed by "fastq" formats. My goal for the moment is to test out BreakDancer to find translocations at low coverage. I don't have our own data yet, so want to use data from SRA. I have downloaded a paired-end data set (two 'fastq' files called SRR031156_1.fastq and SRR031156_2.fastq) from SRA but I cant get the SRA file format to be accepted by anything.


@SRR031156.3 HWI-EAS367_fcID30810AAXXr1:6:1:1564:603 length=31
GTTAGGGTTAGGGTTAGGGTTAGGGTTAGGG
+SRR031156.3 HWI-EAS367_fcID30810AAXXr1:6:1:1564:603 length=31
<<<<<<<<<<<<<<<<<<<<<<4<<<<<<<<
@SRR031156.4 HWI-EAS367_fcID30810AAXXr1:6:1:1183:938 length=31
GATCGGAAGAGCGGTTCAGCAGGAATGCCGA
+SRR031156.4 HWI-EAS367_fcID30810AAXXr1:6:1:1183:938 length=31
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<


My first effort is to align these nice fastq files with BWA, but I get a segmentation fault. I have checked some other threads on this forum, and it seems clear that the BWA segfault is due to problems with fastq format. This is just bad. The most popular aligner should work with the official standard file archive format in the public data repository.

[genomes@hactar BreakDancer]$ ./bwa-0.5.8a/bwa aln Mouse_mm9_fa/MUS SRX014041/SRR031156_1.fastq > MUS_aln.sai
[bwa_aln] 17bp reads: max_diff = 2
[bwa_aln] 38bp reads: max_diff = 3
[bwa_aln] 64bp reads: max_diff = 4
[bwa_aln] 93bp reads: max_diff = 5
[bwa_aln] 124bp reads: max_diff = 6
[bwa_aln] 157bp reads: max_diff = 7
[bwa_aln] 190bp reads: max_diff = 8
[bwa_aln] 225bp reads: max_diff = 9
[bwa_aln_core] calculate SA coordinate... Segmentation fault


My second effort is to try to convert to SAM or BAM format using FastqToSam.jar from the SAMTOOLS/Picard toolkit. This fails because there is not a /1 or /2 at the end of the header line for each read.

I really do not want to write some new Perl script for such an obvious task. There must be a well validated workflow out there from SRA to BWA.

... And from BWA output to BreakDancer (to avoid my next post when that also turns out to be problematic)
stubrown is offline   Reply With Quote
Old 07-13-2010, 08:17 AM   #2
nickloman
Senior Member
 
Location: Birmingham, UK

Join Date: Jul 2009
Posts: 356
Default

Isn't BWA called as follows:

bwa aln <database> <FASTQ file>

a la

http://bio-bwa.sourceforge.net/bwa.shtml

Edit: Actually I can see you did that already. Did you run bwa index on the mouse database first?
nickloman is offline   Reply With Quote
Old 07-13-2010, 08:33 AM   #3
raela
Member
 
Location: Ithaca, NY

Join Date: Apr 2010
Posts: 39
Default

Also, if you did index it, are you sure you indexed correctly? I had the same issue until I made another index for the genome, then it ran fine. If you want to test if it's the reads, only grab one chromosome, index it, and see if it aligns then.
raela is offline   Reply With Quote
Old 07-13-2010, 09:50 AM   #4
stubrown
Junior Member
 
Location: NYC

Join Date: Nov 2008
Posts: 4
Default

OK, that was a very helpful idea to work with the index of just one chromosome. That worked fine. So I guess my problem is really about how do I index an entire genome of chromosome *.fa files? (i.e. the Mouse mm9 genome from UCSC); or once indexed, how do I point to all of them as a database for the 'bwa aln' command?

I did this by running the 'bwa index' command inside of a 'foreach' loop for each of my *.fa files. I got a lot of index files (.pac, .rpac, etc), but perhaps this is the source of my segmentation fault error.
stubrown is offline   Reply With Quote
Old 07-13-2010, 10:19 AM   #5
raela
Member
 
Location: Ithaca, NY

Join Date: Apr 2010
Posts: 39
Default

That would most likely be the issue! You need to combine the chromosomes into one file. On a *nix machine, say each is named chr##.fa (I know it's this way for the horse.. chr1, chr2, chr3, ... chrX) - you would do
cat chr*.fa > genome.fa
This tells it to put the contents of all files in the new file genome.fa. Then you index, but, you probably want to use -a bwtsw.. I believe not including that flag was my error. So, you'd do
bwa index -p prefix -a bwtsw genome.fa
raela is offline   Reply With Quote
Old 07-03-2011, 07:39 PM   #6
history_of_robots
Junior Member
 
Location: Caltech

Join Date: May 2011
Posts: 9
Default

Quote:
Originally Posted by raela View Post
That would most likely be the issue! You need to combine the chromosomes into one file. On a *nix machine, say each is named chr##.fa (I know it's this way for the horse.. chr1, chr2, chr3, ... chrX) - you would do
cat chr*.fa > genome.fa
This tells it to put the contents of all files in the new file genome.fa. Then you index, but, you probably want to use -a bwtsw.. I believe not including that flag was my error. So, you'd do
bwa index -p prefix -a bwtsw genome.fa
You are suggesting to use '-a bwtsw'. On BWA manual (http://bio-bwa.sourceforge.net/bwa.shtml) it says: "BWA-SW can also be used to align ~100bp reads, but it is slower than the short-read algorithm." and "On low-error short queries, BWA-SW is slower and less accurate than the first algorithm [IS], but on long queries, it is better". So it is '-a is' that is seemed to be required for BWA indexing a genome for subsequent short read alignment. However, in the same manual page it says: 'IS is moderately fast, but does not work with database larger than 2GB'. A complete genome can be larger than that (for example mm9.fa is 2.5GB). So I am wondering if chromosomes should be indexed separately. Unfortunately, it seems that in this case BWA will have to be run on each chromosome separately it seems. Or is there another way to use IS on the whole genome?
__________________
"Letís start with the three fundamental Rules of Robotics...."
history_of_robots is offline   Reply With Quote
Old 07-03-2011, 08:24 PM   #7
raela
Member
 
Location: Ithaca, NY

Join Date: Apr 2010
Posts: 39
Default

Quote:
Originally Posted by history_of_robots View Post
You are suggesting to use '-a bwtsw'. On BWA manual (http://bio-bwa.sourceforge.net/bwa.shtml) it says: "BWA-SW can also be used to align ~100bp reads, but it is slower than the short-read algorithm." and "On low-error short queries, BWA-SW is slower and less accurate than the first algorithm [IS], but on long queries, it is better". So it is '-a is' that is seemed to be required for BWA indexing a genome for subsequent short read alignment. However, in the same manual page it says: 'IS is moderately fast, but does not work with database larger than 2GB'. A complete genome can be larger than that (for example mm9.fa is 2.5GB). So I am wondering if chromosomes should be indexed separately. Unfortunately, it seems that in this case BWA will have to be run on each chromosome separately it seems. Or is there another way to use IS on the whole genome?
No, what I said is correct - you are thinking of "bwa bwasw" as an alternate to aln + samse/sampe. The -a bwtsw uses the algorithm for indexing a large genome. BWTSW vs. IS has nothing to do with read size - just genome size.
raela is offline   Reply With Quote
Old 07-03-2011, 09:00 PM   #8
history_of_robots
Junior Member
 
Location: Caltech

Join Date: May 2011
Posts: 9
Default

Oh cool, you are absolutely right! That's the right option to index a genome for short read alignment. Thanks very much.
__________________
"Letís start with the three fundamental Rules of Robotics...."
history_of_robots is offline   Reply With Quote
Old 12-28-2011, 11:07 AM   #9
niti217
Member
 
Location: USA

Join Date: Dec 2011
Posts: 10
Default

I am having similar problem - any help would be greatly appreciated.

I am trying to index Homo_sapiens.GRCh37 ...fa file using the command

bwa index -p myGenome -a bwtsw /directory/myGenome.fa

but it keeps giving me the following error

[bwa_index] Pack FASTA... 56.76 sec
[bwa_index] Reverse the packed sequence... Segmentation fault

Can someone please help me with possible suggestion to fix this. Thank you.
niti217 is offline   Reply With Quote
Old 12-31-2011, 02:11 PM   #10
houkto
Junior Member
 
Location: UK

Join Date: Sep 2011
Posts: 5
Default

Quote:
Originally Posted by niti217 View Post
I am having similar problem - any help would be greatly appreciated.

I am trying to index Homo_sapiens.GRCh37 ...fa file using the command

bwa index -p myGenome -a bwtsw /directory/myGenome.fa

but it keeps giving me the following error

[bwa_index] Pack FASTA... 56.76 sec
[bwa_index] Reverse the packed sequence... Segmentation fault

Can someone please help me with possible suggestion to fix this. Thank you.
I saw a similar error in another forum http://biostar.stackexchange.com/que...entation-fault it turns out to be a hardware issue i.e memory less than 2G


N.B happy new year
houkto is offline   Reply With Quote
Old 01-03-2012, 09:13 AM   #11
niti217
Member
 
Location: USA

Join Date: Dec 2011
Posts: 10
Default

Quote:
Originally Posted by houkto View Post
I saw a similar error in another forum http://biostar.stackexchange.com/que...entation-fault it turns out to be a hardware issue i.e memory less than 2G


N.B happy new year
Thanks so much for your help. I really appreciate it.
niti217 is offline   Reply With Quote
Reply

Tags
breakdancer, bwa, fastq, sra format

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 03:39 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO