SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
bwa sampe segmentation fault papori Bioinformatics 5 09-22-2013 11:05 PM
BWA Alignment Segmentation Fault adkostic Bioinformatics 24 09-09-2013 01:52 AM
segmentation fault in BWA sampe papori Illumina/Solexa 0 07-28-2011 09:12 AM
bwa aln Segmentation fault DNAjunk Bioinformatics 4 03-02-2011 07:28 AM
BWA Segmentation Fault (aln) raela Bioinformatics 0 05-18-2010 07:41 AM

Reply
 
Thread Tools
Old 02-02-2010, 04:33 AM   #21
javijevi
Member
 
Location: Spain

Join Date: Jan 2010
Posts: 38
Default

When checking the file produced by csfastaToFastq in my case, I realized that it trims out the last 4 characters in the read names, so that, for example, reads with names '2_6_241_F3' and '2_6_242_F3' in csfasta and qual files get both the same name in the fastq file produced, that is, '2_6_24'.

Checking the cpp source code of csfastaToFastq I have seen that it removes _F3 and _R3 suffixes, using the underscore to locate that suffix, so that, in the case of having underscores in the read names, the output gets strange results.

The portion of the code doing that stuff can be seen at the following lines:

// this construction removes '_F3' or '_R3' from the sequence name
while (csFastaLine[c] != '\n' && csFastaLine[c] != 'F' && csFastaLine[c] != 'R'){
if (csFastaLine[c] != '_'){
underscorePosition = c;
}
seqName[c] = csFastaLine[c];
c++;
}
seqName[underscorePosition] = '\n';

Changing them to the following lines seems to work fine for me:

// this construction does NOT remove '_F3' or '_R3' from the sequence name
while (csFastaLine[c] != '\n'){
seqName[c] = csFastaLine[c];
c++;
}
seqName[c] = '\n';

So my question is: is it necessary to get rid of the _F3 and _R3 suffixes for downstream analyisis?

Thanks in advance.
javijevi is offline   Reply With Quote
Old 02-02-2010, 07:42 AM   #22
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Quote:
Originally Posted by javijevi View Post
When checking the file produced by csfastaToFastq in my case, I realized that it trims out the last 4 characters in the read names, so that, for example, reads with names '2_6_241_F3' and '2_6_242_F3' in csfasta and qual files get both the same name in the fastq file produced, that is, '2_6_24'.

Checking the cpp source code of csfastaToFastq I have seen that it removes _F3 and _R3 suffixes, using the underscore to locate that suffix, so that, in the case of having underscores in the read names, the output gets strange results.

The portion of the code doing that stuff can be seen at the following lines:

// this construction removes '_F3' or '_R3' from the sequence name
while (csFastaLine[c] != '\n' && csFastaLine[c] != 'F' && csFastaLine[c] != 'R'){
if (csFastaLine[c] != '_'){
underscorePosition = c;
}
seqName[c] = csFastaLine[c];
c++;
}
seqName[underscorePosition] = '\n';

Changing them to the following lines seems to work fine for me:

// this construction does NOT remove '_F3' or '_R3' from the sequence name
while (csFastaLine[c] != '\n'){
seqName[c] = csFastaLine[c];
c++;
}
seqName[c] = '\n';

So my question is: is it necessary to get rid of the _F3 and _R3 suffixes for downstream analyisis?

Thanks in advance.
It depends. For aligners like BFAST, to recognize reads that are from the same DNA fragment (mate or pairs) the read names must be the same. Other aligners separate the mates into two different files.

Why not just use the "solid2fastq" program included in BFAST (I am the author)?
nilshomer is offline   Reply With Quote
Old 02-02-2010, 09:57 AM   #23
javijevi
Member
 
Location: Spain

Join Date: Jan 2010
Posts: 38
Default

Quote:
Originally Posted by nilshomer View Post
Why not just use the "solid2fastq" program included in BFAST (I am the author)?
Of course, I tried it before, but I got segmentation faults (as indicated also by other users previously in this thread), so I shifted to csfastaToFastq, which seemed to fix the problem.

In a previous post from November 2008 in this thread, you told that you were going to fix the problem in solid2fastq and provide it in a posterior release. I downloaded BFAST on January 2010, so I think the problem is not absolutely fixed. Can you confirm that?
javijevi is offline   Reply With Quote
Old 02-02-2010, 10:40 AM   #24
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Quote:
Originally Posted by javijevi View Post
Of course, I tried it before, but I got segmentation faults (as indicated also by other users previously in this thread), so I shifted to csfastaToFastq, which seemed to fix the problem.

In a previous post from November 2008 in this thread, you told that you were going to fix the problem in solid2fastq and provide it in a posterior release. I downloaded BFAST on January 2010, so I think the problem is not absolutely fixed. Can you confirm that?
I apologize, I did not read the context. Could you try the latest release of BFAST to see if solid2fastq (the C-version) works for you? I have not had any problems with converting SOLiD data for BFAST (over 1 Trillion bases and counting).

Nils
nilshomer is offline   Reply With Quote
Old 02-03-2010, 02:40 AM   #25
javijevi
Member
 
Location: Spain

Join Date: Jan 2010
Posts: 38
Default

Quote:
Originally Posted by nilshomer View Post
I apologize, I did not read the context. Could you try the latest release of BFAST to see if solid2fastq (the C-version) works for you? I have not had any problems with converting SOLiD data for BFAST (over 1 Trillion bases and counting).

Nils
I've tried the solid2fastq C-version of BFAST 0.6.3a. It worked apparently fine, since read names are not truncated as original csfastaToFastq script does. However, using the fastq produced by solid2fastq keeps on raising the segmentation fault error mentioned in this thread when running 'bwa aln', while using the modified csfastaToFastq is fine.

Please note that the error raises when running bwa (not BFAST) using the fastq file produced by BFAST's solid2fastq script.

Any idea?
javijevi is offline   Reply With Quote
Old 02-03-2010, 07:54 AM   #26
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Quote:
Originally Posted by javijevi View Post
I've tried the solid2fastq C-version of BFAST 0.6.3a. It worked apparently fine, since read names are not truncated as original csfastaToFastq script does. However, using the fastq produced by solid2fastq keeps on raising the segmentation fault error mentioned in this thread when running 'bwa aln', while using the modified csfastaToFastq is fine.

Please note that the error raises when running bwa (not BFAST) using the fastq file produced by BFAST's solid2fastq script.

Any idea?
Have you tried the solid2fastq.pl included in BWA? I apologize if I am repeating myself.
nilshomer is offline   Reply With Quote
Old 02-05-2010, 07:21 AM   #27
javijevi
Member
 
Location: Spain

Join Date: Jan 2010
Posts: 38
Default

Quote:
Originally Posted by nilshomer View Post
Have you tried the solid2fastq.pl included in BWA? I apologize if I am repeating myself.
I didn't realize that bwa includes its own solid2fastq.pl...

I've just tried it and seems to work fine: running 'bwa aln' with the fastq file produced in this way does not raises the segmentation fault error.

By the way, I've realized that sequences in the fastq file produced by the BFAST's solid2fastq script are in standard color space (0123.), while the ones produced by either bwa's solid2fastq or csfastaToFastq scripts are in double-encoded color space (ACTGN). Could it be the problem? I cannot see any parameter in the 'bwa aln' command to specify the code expected in the fastq file to use, other than '-c' to work in color space.
javijevi is offline   Reply With Quote
Old 02-05-2010, 03:52 PM   #28
javijevi
Member
 
Location: Spain

Join Date: Jan 2010
Posts: 38
Default

Quote:
Originally Posted by javijevi View Post
I've just tried it and seems to work fine: running 'bwa aln' with the fastq file produced in this way does not raises the segmentation fault error.
I'm sorry for the previous message. It is not true. The fastq file produced by the solid2fastq.pl script shipped with bwa distribution also causes segmentation fault error in my computer.
javijevi is offline   Reply With Quote
Old 02-05-2010, 07:29 PM   #29
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

You may try solid2fastq.pl here. The "-1" issue should be resolved, although I have not tested this on real data and I do not know if segfault is caused by other issues.

http://bio-bwa.svn.sourceforge.net/v...pl?revision=29

In addition, there are bugs in bwa-0.5.5. You'd better use the SVN version, which will become 0.5.6 in the near future.
lh3 is offline   Reply With Quote
Old 02-08-2010, 03:47 AM   #30
javijevi
Member
 
Location: Spain

Join Date: Jan 2010
Posts: 38
Default solved

Quote:
Originally Posted by lh3 View Post
You may try solid2fastq.pl here. The "-1" issue should be resolved, although I have not tested this on real data and I do not know if segfault is caused by other issues.

http://bio-bwa.svn.sourceforge.net/v...pl?revision=29

In addition, there are bugs in bwa-0.5.5. You'd better use the SVN version, which will become 0.5.6 in the near future.
In my case, using the solid2fastq.pl shipped with the SVN above indicated solved the problems: fastq file is correctly produced (read names are properly trimmed), and using that fastq file does not raises segmentation fault errors.

Thanks a lot to everybody for the good work.
javijevi is offline   Reply With Quote
Old 02-10-2010, 04:13 AM   #31
francois.sabot
Member
 
Location: France

Join Date: Dec 2009
Posts: 41
Default

Hi guys
I have also the same error. However my data are pure solexa, not solid, so no colour to base space transition is requested.

Maybe another problem ?
__________________
Francois Sabot, PhD

Be realistic. Demand the Impossible.
www.wikiposon.org
francois.sabot is offline   Reply With Quote
Old 02-10-2010, 04:40 AM   #32
javijevi
Member
 
Location: Spain

Join Date: Jan 2010
Posts: 38
Default

Quote:
Originally Posted by francois.sabot View Post
Hi guys
I have also the same error. However my data are pure solexa, not solid, so no colour to base space transition is requested.

Maybe another problem ?
As far as I know, if you use fastq files the problem is the same: you've got a wrong fastq file.
javijevi is offline   Reply With Quote
Old 02-10-2010, 04:53 AM   #33
francois.sabot
Member
 
Location: France

Join Date: Dec 2009
Posts: 41
Default

Ok... Meaning that Illumina pipe (Gerald...) doesn't provide correct fastq data formatting :s
__________________
Francois Sabot, PhD

Be realistic. Demand the Impossible.
www.wikiposon.org
francois.sabot is offline   Reply With Quote
Old 02-10-2010, 06:44 AM   #34
javijevi
Member
 
Location: Spain

Join Date: Jan 2010
Posts: 38
Default

Quote:
Originally Posted by francois.sabot View Post
Ok... Meaning that Illumina pipe (Gerald...) doesn't provide correct fastq data formatting :s
At least not correct enough for bwa pipeline, it seems. If you have the corresponding FASTA and QUAL files, you can try some script to producing fastq files from them. In the case of SOLiD data, the last solid2fastq.pl script provided in this thread worked perfect for me, but it is for color space reads. I do not know about similar tools for nucleotide space (not color space) reads.
javijevi is offline   Reply With Quote
Old 02-12-2010, 02:03 PM   #35
Lisa
Member
 
Location: Ohio

Join Date: Jan 2010
Posts: 10
Default

Hi Everyone,
I am a newbie for Illumina data analysis. I had a segmentation fault too. Here is how I got this error.

I first converted Illumina sequence file to Sanger quality fastq and run bwa aln to create .sai without any problem.

Then I ran bwa sampe to get the following error:

[bwa_read_seq] 0.0% bases are trimmed.
[bwa_read_seq] 0.0% bases are trimmed.
[bwa_sai2sam_pe_core] convert to sequence coordinate...
Segmentation fault

My server is 16G RAM and had 12G available at that moment.

What could be the problem for this kind of segmentation fault? Any suggestion or comments would be absolutely appreciated!

Lisa
Lisa is offline   Reply With Quote
Old 02-13-2010, 04:06 PM   #36
javijevi
Member
 
Location: Spain

Join Date: Jan 2010
Posts: 38
Default

Quote:
Originally Posted by Lisa View Post
Hi Everyone,
I am a newbie for Illumina data analysis. I had a segmentation fault too. Here is how I got this error.

I first converted Illumina sequence file to Sanger quality fastq and run bwa aln to create .sai without any problem.

Then I ran bwa sampe to get the following error:

[bwa_read_seq] 0.0% bases are trimmed.
[bwa_read_seq] 0.0% bases are trimmed.
[bwa_sai2sam_pe_core] convert to sequence coordinate...
Segmentation fault

My server is 16G RAM and had 12G available at that moment.

What could be the problem for this kind of segmentation fault? Any suggestion or comments would be absolutely appreciated!

Lisa
Hi, Lisa. As you can see in previous posts, it seems to be related to a wrong construction of the fastq file by the solid2fastq (or similar) script. At least in my case, using the last solidfastq provided by last post of user 'lh3' in this thread solved the problem.

Cheers,
javijevi is offline   Reply With Quote
Old 02-14-2010, 05:10 PM   #37
Lisa
Member
 
Location: Ohio

Join Date: Jan 2010
Posts: 10
Default

Thanks a lot for your reply. Do you or anyone else know solid2fastqq is working for Illumina sequence files? or are there some scripts of Illumina2fastq.

Thanks.

Lisa
Lisa is offline   Reply With Quote
Old 02-14-2010, 05:13 PM   #38
Lisa
Member
 
Location: Ohio

Join Date: Jan 2010
Posts: 10
Default

One more question. I don't have problem to run bwa aln, but got problem to run bwa sampe. Could that be some other reasons other than fastq format?

Thanks in advance!

Lisa
Lisa is offline   Reply With Quote
Old 02-15-2010, 01:56 AM   #39
francois.sabot
Member
 
Location: France

Join Date: Dec 2009
Posts: 41
Default

After tests on the 0.5.6 version, I still have the same problem... In the bio-bwa mailing list, Heng Li (lh3 user here) proposed me to try my data (ref and reads) to check where is the problem... As the fastq file was constructed using the official Illumina softwares (by integragen company), it could be nice to adapt someway bwa or to know what to do before running bwa to ensure a correct mapping with these type of data. Most of people that will use illumina data do not know at all how to reconfigure the fastq file itself...

EDIT: after few research on my own side, as the dmesg is 'bwa[21956]: segfault at 0 ip 00007fc4cf69bc63 sp 00007fff11af84a8 error 6 in libc-2.10.1.so[7fc4cf61a000+166000]', it is possible that the error is linked to the libc6 version too... How to correct it ??
__________________
Francois Sabot, PhD

Be realistic. Demand the Impossible.
www.wikiposon.org

Last edited by francois.sabot; 02-15-2010 at 02:12 AM.
francois.sabot is offline   Reply With Quote
Old 02-22-2010, 07:48 PM   #40
yenhuahuang1
Junior Member
 
Location: Taiwan

Join Date: Jan 2010
Posts: 7
Default colour space alignment, reverse complement, and deletions in alignments

Quote:
Originally Posted by lh3 View Post
You may try solid2fastq.pl here. The "-1" issue should be resolved, although I have not tested this on real data and I do not know if segfault is caused by other issues.

http://bio-bwa.svn.sourceforge.net/v...pl?revision=29

In addition, there are bugs in bwa-0.5.5. You'd better use the SVN version, which will become 0.5.6 in the near future.

In my case I found two issues when using BWA:

- the quality string length is inconsistent with that of the sequence string when there are deletions in the alignment.

e.g.
solid_20091208_XX_SureSelect_6505all_exon_:879_1653_1194 16 11 56020361 0 2S6M1D40M * 0 0 CCAGAAGAGAAACACCCAAGATAACTCTATCAGTGATAGGACTAACAG : XT:A:R CM:i:1 X0:i:2 X1:i:0 XM:i:2 XO:i:1 XG:i:1MD:Z:6^A40


- In the colour space alignment of a reverse and complement read, the BWA output sequence in the sam file will be complement (but not reverse) to the reference sequence (the human genome GRCh37).

e.g.
solid_20091208_XX_SureSelect_6505all_exon_:853_23_233 16 7 107877340 25
48M * 0 0 CACGTTTTGACCACCGAAGTCTCCGTCTCGAAGATAAGTGTCCAGACG
R;8J]]L<>K]NCO]]O,(Q]TJ@N]TPQW]L<R]]MC@O]]Y]YHPR XT:A:U CM:i:1 X0:i:1 X1:i
:0 XM:i:3 XO:i:0 XG:i:0 MD:Z:7T40

Have these issues been resolved in the SVN version? Many thanks.

Last edited by yenhuahuang1; 02-22-2010 at 09:25 PM.
yenhuahuang1 is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 02:00 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO