Hi everyone! This is my first post and I would first like to thank everyone on this forum for their expertise, which I've consulted many many times over the past few weeks. I have not, however, been able to find an answer to my latest problem, so I figured I'd see if anyone else has some ideas.
I am trying to map SOLiD exome sequencing data to the human genome. I am new to the project and bioinformatics in general, and from what I understand, a portion of the data has already been mapped using the SOLiD software (which doesn't map indels), and my task is to use BWA to map indels for the remaining reads.
The files were provided to me in bam format with the sequence in color space and the qualities in what I believe to be the standard (not SOLiD) format.
bam file:
110_857_1962 4 * 0 0 * * 0 0 * * RG:Z:20100728125836730 CS:Z:T00220012130000301222223222122120232021122012121222 CQ:Z:%)''%')%%''&%'&&(((&&%%'&%&%)&)%%%(&&(&%(%%'%''()&
110_857_2034 4 * 0 0 * * 0 0 * * RG:Z:20100728125836730 CS:Z:T030.12130..311.0330.022330000003322010022213030022 CQ:Z:<@?!86:><!!7.2!<<0,!30>0719?10-8:>9%8')%824/+%,&<0
110_858_72 4 * 0 0 * * 0 0 * * RG:Z:20100728125836730 CS:Z:T32022000230003012211032210202202201203100122102222 CQ:Z?>36@&<;.%<)7'(4)1(&2+81,:&%+&1.1;.9<&.+)%=%4'93;
Because the sequences and qualities were in the comments section, the file was incompatible with BWA so I converted it to fastq format. I tried to mimic the process used in the solid2fastq.pl file that comes with BWA, in which I converted each color space sequence from 0123. to ACGTN. I clipped off the first two bases of each read (adapter+first base), and I also clipped off the first quality score so that the read length would match the quality length.
fastq file:
@110_857_1962
AGGAACGCTAAAATACGGGGGTGGGCGGCGAGTGAGCCGGACGCGCGGG
+
)''%')%%''&%'&&(((&&%%'&%&%)&)%%%(&&(&%(%%'%''()&
@110_857_2034
TANCGCTANNTCCNATTANAGGTTAAAAAATTGGACAAGGGCTATAAGG
+
@?!86:><!!7.2!<<0,!30>0719?10-8:>9%8')%824/+%,&<0
@110_858_72
GAGGAAAGTAAATACGGCCATGGCAGAGGAGGACGATCAACGGCAGGGG
+
?>36@&<;.%<)7'(4)1(&2+81,:&%+&1.1;.9<&.+)%=%4'93;
However, when I run BWA v0.5.9 with default values and the color space option, out of 1000000 sequences, I only had 30 mapped reads. I'm stumped -- is there something I'm blatantly doing wrong, or does anyone have any other idea where the problem might be stemming from?
My command line for BWA:
bwa index -a bwtsw -c hg19.fasta
bwa aln -c hg19.fasta sample.fastq > output.sai
bwa samse hg19.fasta output.sai sample.fastq > output.sam
Thanks,
Jason
I am trying to map SOLiD exome sequencing data to the human genome. I am new to the project and bioinformatics in general, and from what I understand, a portion of the data has already been mapped using the SOLiD software (which doesn't map indels), and my task is to use BWA to map indels for the remaining reads.
The files were provided to me in bam format with the sequence in color space and the qualities in what I believe to be the standard (not SOLiD) format.
bam file:
110_857_1962 4 * 0 0 * * 0 0 * * RG:Z:20100728125836730 CS:Z:T00220012130000301222223222122120232021122012121222 CQ:Z:%)''%')%%''&%'&&(((&&%%'&%&%)&)%%%(&&(&%(%%'%''()&
110_857_2034 4 * 0 0 * * 0 0 * * RG:Z:20100728125836730 CS:Z:T030.12130..311.0330.022330000003322010022213030022 CQ:Z:<@?!86:><!!7.2!<<0,!30>0719?10-8:>9%8')%824/+%,&<0
110_858_72 4 * 0 0 * * 0 0 * * RG:Z:20100728125836730 CS:Z:T32022000230003012211032210202202201203100122102222 CQ:Z?>36@&<;.%<)7'(4)1(&2+81,:&%+&1.1;.9<&.+)%=%4'93;
Because the sequences and qualities were in the comments section, the file was incompatible with BWA so I converted it to fastq format. I tried to mimic the process used in the solid2fastq.pl file that comes with BWA, in which I converted each color space sequence from 0123. to ACGTN. I clipped off the first two bases of each read (adapter+first base), and I also clipped off the first quality score so that the read length would match the quality length.
fastq file:
@110_857_1962
AGGAACGCTAAAATACGGGGGTGGGCGGCGAGTGAGCCGGACGCGCGGG
+
)''%')%%''&%'&&(((&&%%'&%&%)&)%%%(&&(&%(%%'%''()&
@110_857_2034
TANCGCTANNTCCNATTANAGGTTAAAAAATTGGACAAGGGCTATAAGG
+
@?!86:><!!7.2!<<0,!30>0719?10-8:>9%8')%824/+%,&<0
@110_858_72
GAGGAAAGTAAATACGGCCATGGCAGAGGAGGACGATCAACGGCAGGGG
+
?>36@&<;.%<)7'(4)1(&2+81,:&%+&1.1;.9<&.+)%=%4'93;
However, when I run BWA v0.5.9 with default values and the color space option, out of 1000000 sequences, I only had 30 mapped reads. I'm stumped -- is there something I'm blatantly doing wrong, or does anyone have any other idea where the problem might be stemming from?
My command line for BWA:
bwa index -a bwtsw -c hg19.fasta
bwa aln -c hg19.fasta sample.fastq > output.sai
bwa samse hg19.fasta output.sai sample.fastq > output.sam
Thanks,
Jason
Comment