Seqanswers Leaderboard Ad

**jperin** · 12-18-2009, 01:38 PM

Has anyone found any solution to this problem? I've just tried this C program, which seems to work well, but I am still getting the segmentation fault. I have 32GB of RAM on my system, so again its not memory.

[p@c0-0]$ bwa samse /share/apps/genome/human/bowtie/hg18/hg18.fa /data/Mk/FpMb.sai /data/Mk/FpMb.part1.fastq > /data/Mk/FpMb.2.sam
[bwa_aln_core] convert to sequence coordinate... 5.31 sec
[bwa_aln_core] refine gapped alignments... Segmentation fault

**luisczul** · 12-18-2009, 01:50 PM

solution

The fastq file is the problem.

You need to use a third party script or program to convert your reads to a fastq files. For ex, for processing form the solid machine reads, on my case, the MAQ to fastq command didn't work. I had to use a third party program or script. Tofasta in these case.

Hope this works.

**jperin** · 12-18-2009, 02:14 PM

I tried the provided solid2fastq.pl script with both bwa and maq (they're the same). Diff'd various versions, but they're all the same. They all threw segmentation faults during the bwa samse step. I saw the last response posted about using the attached C program. That was my last failed attempt. The fastq file is in a different order completely so I can't quite tell whether they are much different. The file sizes, however, are quite different the BWA version giving me 6.4 gb roughly and the C version giving me 6.8gb of data. I don't see how QValues alone could make such a difference...

What other third party tools are there that convert csfasta and qval files to fastq? The BWA tool and the C version posted on this thread are the only ones I have been able to find... Thanks!

**fpruzius** · 12-20-2009, 09:17 AM

C script fixed

I made a 'mistake' in de C script. Every 3rd line in a FASTQ file begins with a '+', and the rest of that line is an optional comment. However I put the name of the read there, but shorter than the first line '@'. During alignment this is no problem with BWA, however with MAQ and the postproccessing with BWA/SAMtools this gives segmentation errors.

I fixed this in the script. The '+' line contains now only that. And this also reason why the FASTQ file with this script is that much bigger than with the perl script. I'm using this script now for weeks and so far it has worked every time.

I changed the attachment above. However I'll repost it below too:

csfastaToFastq.tar.gz

And no, so far I also have not found any other means to convert (cs)fasta to Fastq.

**javijevi** · 01-28-2010, 07:45 AM

solved for me

Originally posted by fpruzius View Post

I made a 'mistake' in de C script. Every 3rd line in a FASTQ file begins with a '+', and the rest of that line is an optional comment. However I put the name of the read there, but shorter than the first line '@'. During alignment this is no problem with BWA, however with MAQ and the postproccessing with BWA/SAMtools this gives segmentation errors.

I fixed this in the script. The '+' line contains now only that. And this also reason why the FASTQ file with this script is that much bigger than with the perl script. I'm using this script now for weeks and so far it has worked every time.

I changed the attachment above. However I'll repost it below too:

[ATTACH]213[/ATTACH]

And no, so far I also have not found any other means to convert (cs)fasta to Fastq.

This is just to share that I was having the same segfault error when running 'bwa aln' using a fastq file produced by solid2fastq (C version) script (BFAST 0.6.2a downloaded Jan/2010), in a 32 GB RAM machine with 2 quad-core Intel Xeon processors for a 800 MB reference genome and 300,000,000 25 bp-long SOLiD reads.

Using the last csfastaToFastq script provided by fpruzius to produce the fastq file solved the problem.

(Please, nilshomer, fix it in your great tool package distribution; it does not deserve such a disturbing, although minor, trouble).

**javijevi** · 02-02-2010, 04:33 AM

When checking the file produced by csfastaToFastq in my case, I realized that it trims out the last 4 characters in the read names, so that, for example, reads with names '2_6_241_F3' and '2_6_242_F3' in csfasta and qual files get both the same name in the fastq file produced, that is, '2_6_24'.

Checking the cpp source code of csfastaToFastq I have seen that it removes _F3 and _R3 suffixes, using the underscore to locate that suffix, so that, in the case of having underscores in the read names, the output gets strange results.

The portion of the code doing that stuff can be seen at the following lines:

// this construction removes '_F3' or '_R3' from the sequence name
while (csFastaLine[c] != '\n' && csFastaLine[c] != 'F' && csFastaLine[c] != 'R'){
if (csFastaLine[c] != '_'){
underscorePosition = c;
}
seqName[c] = csFastaLine[c];
c++;
}
seqName[underscorePosition] = '\n';

Changing them to the following lines seems to work fine for me:

// this construction does NOT remove '_F3' or '_R3' from the sequence name
while (csFastaLine[c] != '\n'){
seqName[c] = csFastaLine[c];
c++;
}
seqName[c] = '\n';

So my question is: is it necessary to get rid of the _F3 and _R3 suffixes for downstream analyisis?

Thanks in advance.

**nilshomer** · 02-02-2010, 07:42 AM

Originally posted by javijevi View Post

When checking the file produced by csfastaToFastq in my case, I realized that it trims out the last 4 characters in the read names, so that, for example, reads with names '2_6_241_F3' and '2_6_242_F3' in csfasta and qual files get both the same name in the fastq file produced, that is, '2_6_24'.

Checking the cpp source code of csfastaToFastq I have seen that it removes _F3 and _R3 suffixes, using the underscore to locate that suffix, so that, in the case of having underscores in the read names, the output gets strange results.

The portion of the code doing that stuff can be seen at the following lines:

// this construction removes '_F3' or '_R3' from the sequence name
while (csFastaLine[c] != '\n' && csFastaLine[c] != 'F' && csFastaLine[c] != 'R'){
if (csFastaLine[c] != '_'){
underscorePosition = c;
}
seqName[c] = csFastaLine[c];
c++;
}
seqName[underscorePosition] = '\n';

Changing them to the following lines seems to work fine for me:

// this construction does NOT remove '_F3' or '_R3' from the sequence name
while (csFastaLine[c] != '\n'){
seqName[c] = csFastaLine[c];
c++;
}
seqName[c] = '\n';

So my question is: is it necessary to get rid of the _F3 and _R3 suffixes for downstream analyisis?

Thanks in advance.

It depends. For aligners like BFAST, to recognize reads that are from the same DNA fragment (mate or pairs) the read names must be the same. Other aligners separate the mates into two different files.

Why not just use the "solid2fastq" program included in BFAST (I am the author)?

**javijevi** · 02-02-2010, 09:57 AM

Originally posted by nilshomer View Post

Why not just use the "solid2fastq" program included in BFAST (I am the author)?

Of course, I tried it before, but I got segmentation faults (as indicated also by other users previously in this thread), so I shifted to csfastaToFastq, which seemed to fix the problem.

In a previous post from November 2008 in this thread, you told that you were going to fix the problem in solid2fastq and provide it in a posterior release. I downloaded BFAST on January 2010, so I think the problem is not absolutely fixed. Can you confirm that?

**nilshomer** · 02-02-2010, 10:40 AM

Originally posted by javijevi View Post

Of course, I tried it before, but I got segmentation faults (as indicated also by other users previously in this thread), so I shifted to csfastaToFastq, which seemed to fix the problem.

In a previous post from November 2008 in this thread, you told that you were going to fix the problem in solid2fastq and provide it in a posterior release. I downloaded BFAST on January 2010, so I think the problem is not absolutely fixed. Can you confirm that?

I apologize, I did not read the context. Could you try the latest release of BFAST to see if solid2fastq (the C-version) works for you? I have not had any problems with converting SOLiD data for BFAST (over 1 Trillion bases and counting).

Nils

**javijevi** · 02-03-2010, 02:40 AM

Originally posted by nilshomer View Post

I apologize, I did not read the context. Could you try the latest release of BFAST to see if solid2fastq (the C-version) works for you? I have not had any problems with converting SOLiD data for BFAST (over 1 Trillion bases and counting).

Nils

I've tried the solid2fastq C-version of BFAST 0.6.3a. It worked apparently fine, since read names are not truncated as original csfastaToFastq script does. However, using the fastq produced by solid2fastq keeps on raising the segmentation fault error mentioned in this thread when running 'bwa aln', while using the modified csfastaToFastq is fine.

Please note that the error raises when running bwa (not BFAST) using the fastq file produced by BFAST's solid2fastq script.

Any idea?

**nilshomer** · 02-03-2010, 07:54 AM

Originally posted by javijevi View Post

I've tried the solid2fastq C-version of BFAST 0.6.3a. It worked apparently fine, since read names are not truncated as original csfastaToFastq script does. However, using the fastq produced by solid2fastq keeps on raising the segmentation fault error mentioned in this thread when running 'bwa aln', while using the modified csfastaToFastq is fine.

Please note that the error raises when running bwa (not BFAST) using the fastq file produced by BFAST's solid2fastq script.

Any idea?

Have you tried the solid2fastq.pl included in BWA? I apologize if I am repeating myself.

**javijevi** · 02-05-2010, 07:21 AM

Originally posted by nilshomer View Post

Have you tried the solid2fastq.pl included in BWA? I apologize if I am repeating myself.

I didn't realize that bwa includes its own solid2fastq.pl...

I've just tried it and seems to work fine: running 'bwa aln' with the fastq file produced in this way does not raises the segmentation fault error.

By the way, I've realized that sequences in the fastq file produced by the BFAST's solid2fastq script are in standard color space (0123.), while the ones produced by either bwa's solid2fastq or csfastaToFastq scripts are in double-encoded color space (ACTGN). Could it be the problem? I cannot see any parameter in the 'bwa aln' command to specify the code expected in the fastq file to use, other than '-c' to work in color space.

**javijevi** · 02-05-2010, 03:52 PM

Originally posted by javijevi View Post

I've just tried it and seems to work fine: running 'bwa aln' with the fastq file produced in this way does not raises the segmentation fault error.

I'm sorry for the previous message. It is not true. The fastq file produced by the solid2fastq.pl script shipped with bwa distribution also causes segmentation fault error in my computer.

**lh3** · 02-05-2010, 07:29 PM

You may try solid2fastq.pl here. The "-1" issue should be resolved, although I have not tested this on real data and I do not know if segfault is caused by other issues.

Burrows-Wheeler Aligner

http://bio-bwa.svn.sourceforge.net/viewvc/bio-bwa/trunk/bwa/solid2fastq.pl?revision=29

Download Burrows-Wheeler Aligner for free. BWA is a program for aligning sequencing reads against a large reference genome (e.g. human genome).

In addition, there are bugs in bwa-0.5.5. You'd better use the SVN version, which will become 0.5.6 in the near future.

**javijevi** · 02-08-2010, 03:47 AM

solved

Originally posted by lh3 View Post

You may try solid2fastq.pl here. The "-1" issue should be resolved, although I have not tested this on real data and I do not know if segfault is caused by other issues.

Burrows-Wheeler Aligner

http://bio-bwa.svn.sourceforge.net/viewvc/bio-bwa/trunk/bwa/solid2fastq.pl?revision=29

Download Burrows-Wheeler Aligner for free. BWA is a program for aligning sequencing reads against a large reference genome (e.g. human genome).

In addition, there are bugs in bwa-0.5.5. You'd better use the SVN version, which will become 0.5.6 in the near future.

In my case, using the solid2fastq.pl shipped with the SVN above indicated solved the problems: fastq file is correctly produced (read names are properly trimmed), and using that fastq file does not raises segmentation fault errors.

Thanks a lot to everybody for the good work.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 25 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 27 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 24 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News