Hello,
I'm working with RNA-Seq data (100 bp, Single End, Illumina) from Drosophila melanogaster. I'm hoping to map the reads to the transcriptome using bwa, obtain feature (transcript) counts using HTSeq, and perform the differential expression analysis in EdgeR. I'm aware that HTSeq was designed for splicing-aware aligners and I also plan to align my reads to the Genome using Tophat so that I can compare the two approaches. While aligning to the transcriptome with bwa, ~70% of my reads have mapping quality scores of 0. I did not realize this until I used HTSeq got the following output (in addition to 0 reads mapping to any feature):
__no_feature 6122078
__ambiguous 0
__too_low_aQual 16002186
__not_aligned 1410838
__alignment_not_unique
It seems very odd that the majority of my reads (and indeed all of the reads that either didn't have alignments or aligned to no feature) would have a low quality map score. If I understand a quality score of 0 correctly, this means that a read maps to two or more places equally. This occurs regardless if I use bwa mem or bwa aln or if I use different versions of the transcriptome (contiguous protein coding sequence versus mRNA from FlyBase) . Why do all of my reads that align to transcripts map to two or more locations?
My pipeline is listed below. I used loops for the trim and alignment steps since I have numerous fastq files. Full disclosure, this is the first time I've done this and thus my loops could be erroneous, so I have included them below. The HTSeq output above is for one sam file that was generated via bwa mem, but I get similar results for all other alignment files.
#Trim and Clean
#!/bin/bash
FILES=~/mouchka/raw_reads/fly_raw_reads/*.fastq
for f in $FILES
do
NAME=${f%.*}_CT.fastq
java -jar /opt/modules/biology/trimmomatic/0.33/bin/trimmomatic-0.33.jar SE -phred33 $f $NAME ILLUMINACLIP:TruSeq3-SE.fa:2:30:10 LEADING:13 TRAILING:13 SLIDINGWINDOW:4:20 MINLEN:25
echo $f
done
#Align
bwa index dmel-all-transcript-r6.05.fasta
#!/bin/bash
FILES=~/mouchka/trimmed/fly_trim_reads/test/*.fastq
for f in $FILES
do
NAME=${f%.*}_P.sam
bwa mem -t 4 dmel-all-transcript-r6.05.fasta $f > $NAME
echo $f
done
Sample samtools flagstat ouput:
23535102 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
22124264 + 0 mapped (94.01% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
HTSeq
htseq-count -i transcript_id alignment.sam dmel-all-r6.05.gtf
Any advice would be much appreciated. Thanks so much.
I'm working with RNA-Seq data (100 bp, Single End, Illumina) from Drosophila melanogaster. I'm hoping to map the reads to the transcriptome using bwa, obtain feature (transcript) counts using HTSeq, and perform the differential expression analysis in EdgeR. I'm aware that HTSeq was designed for splicing-aware aligners and I also plan to align my reads to the Genome using Tophat so that I can compare the two approaches. While aligning to the transcriptome with bwa, ~70% of my reads have mapping quality scores of 0. I did not realize this until I used HTSeq got the following output (in addition to 0 reads mapping to any feature):
__no_feature 6122078
__ambiguous 0
__too_low_aQual 16002186
__not_aligned 1410838
__alignment_not_unique
It seems very odd that the majority of my reads (and indeed all of the reads that either didn't have alignments or aligned to no feature) would have a low quality map score. If I understand a quality score of 0 correctly, this means that a read maps to two or more places equally. This occurs regardless if I use bwa mem or bwa aln or if I use different versions of the transcriptome (contiguous protein coding sequence versus mRNA from FlyBase) . Why do all of my reads that align to transcripts map to two or more locations?
My pipeline is listed below. I used loops for the trim and alignment steps since I have numerous fastq files. Full disclosure, this is the first time I've done this and thus my loops could be erroneous, so I have included them below. The HTSeq output above is for one sam file that was generated via bwa mem, but I get similar results for all other alignment files.
#Trim and Clean
#!/bin/bash
FILES=~/mouchka/raw_reads/fly_raw_reads/*.fastq
for f in $FILES
do
NAME=${f%.*}_CT.fastq
java -jar /opt/modules/biology/trimmomatic/0.33/bin/trimmomatic-0.33.jar SE -phred33 $f $NAME ILLUMINACLIP:TruSeq3-SE.fa:2:30:10 LEADING:13 TRAILING:13 SLIDINGWINDOW:4:20 MINLEN:25
echo $f
done
#Align
bwa index dmel-all-transcript-r6.05.fasta
#!/bin/bash
FILES=~/mouchka/trimmed/fly_trim_reads/test/*.fastq
for f in $FILES
do
NAME=${f%.*}_P.sam
bwa mem -t 4 dmel-all-transcript-r6.05.fasta $f > $NAME
echo $f
done
Sample samtools flagstat ouput:
23535102 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
22124264 + 0 mapped (94.01% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
HTSeq
htseq-count -i transcript_id alignment.sam dmel-all-r6.05.gtf
Any advice would be much appreciated. Thanks so much.
Comment