I have a bit of a mystery on my hands about an RNA-Seq Project.
I am working with some mice lines, standard RNA-seq prep, paired end. Initial adapter trimming is done with cutadapt. At this point fastqc looks good across the board, so on to alignment.
I'm aligning my reads to the pre-build mouse MM37(mm9) build from ensemble. I'm using tophat2 for alignment and am getting pretty good mapping. Here is an example.
I ran tophat with the following options:
Here are some stats on the alignment from samtools:
Next, I ran featureCounts as following:
Note that I included the -M option to include multimapping reads. I did that because I was already losing a large amount of reads.
Here is the problem I'm having, I am getting a very large number of "Unassigned_NoFeatures" when I run featureCounts. For example:
Sometimes it's even around 50% of the reads are unassigned_nofeatures.
I used the ensemble prebuilt bowtie indexes and accompanying .gtf annotation, so I can't imagine its an annotation issue. When I look at the reads in a genome browser they seem to largely (not completely) map to exons, and there aren't any super-over-expressed genes in the count list generated by featurecounts.
Does anyone know what might be the cause of this issue? I am currently attempting to get counts with cufflinks to see if it's any better, or I might try htseq, but I am concerned the results will be similar.
I read here that MAPQ scores are generated differently for all the different alignment tools and wondered if featureCounts might take that into account, but I couldn't find documentation about how featurecounts handles MAPQ scores from tophat2.
Why am I losing so many reads at the count stage!?
I would appreciate any help or insight you can offer, thanks very much in advance.
I am working with some mice lines, standard RNA-seq prep, paired end. Initial adapter trimming is done with cutadapt. At this point fastqc looks good across the board, so on to alignment.
I'm aligning my reads to the pre-build mouse MM37(mm9) build from ensemble. I'm using tophat2 for alignment and am getting pretty good mapping. Here is an example.
Left reads:
Input : 27528769
Mapped : 24331602 (88.4% of input)
these: 3339296 (13.7%) have multiple alignments (92032 have >20)
Right reads:
Input : 27528769
Mapped : 24655147 (89.6% of input)
of these: 3347668 (13.6%) have multiple alignments (91524 have >20)
89.0% overall read mapping rate.
Aligned pairs: 23141474
of these: 3172252 (13.7%) have multiple alignments
760681 ( 3.3%) are discordant alignments
81.3% concordant pair alignment rate.
Input : 27528769
Mapped : 24331602 (88.4% of input)
these: 3339296 (13.7%) have multiple alignments (92032 have >20)
Right reads:
Input : 27528769
Mapped : 24655147 (89.6% of input)
of these: 3347668 (13.6%) have multiple alignments (91524 have >20)
89.0% overall read mapping rate.
Aligned pairs: 23141474
of these: 3172252 (13.7%) have multiple alignments
760681 ( 3.3%) are discordant alignments
81.3% concordant pair alignment rate.
tophat2 -p 6 -N 5 --read-edit-dist 5 -o $OUTPUT $INDEX_BASE $R1_FILE_PATH $R2_FILE_PATH
SN raw total sequences: 48986749
SN filtered sequences: 0
SN sequences: 48986749
SN is sorted: 1
SN 1st fragments: 24331602
SN last fragments: 24655147
SN reads mapped: 48986749
SN reads mapped and paired: 46282948 # paired-end technology bit set + both mates mapped
SN reads unmapped: 0
SN reads properly paired: 41476118 # proper-pair bit set
SN reads paired: 48986749 # paired-end technology bit set
SN reads duplicated: 0 # PCR or optical duplicate bit set
SN reads MQ0: 720045 # mapped and MQ=0
SN reads QC failed: 0
SN non-primary alignments: 16364564
SN total length: 4947661649 # ignores clipping
SN bases mapped: 4947661649 # ignores clipping
SN bases mapped (cigar): 4947661649 # more accurate
SN bases trimmed: 0
SN bases duplicated: 0
SN mismatches: 58293637 # from NM fields
SN error rate: 1.178206e-02 # mismatches / bases mapped (cigar)
SN average length: 101
SN maximum length: 101
SN average quality: 34.4
SN insert size average: 730.1
SN insert size standard deviation: 1723.8
SN inward oriented pairs: 22043384
SN outward oriented pairs: 262611
SN pairs with other orientation: 40818
SN pairs on different chromosomes: 727628
SN filtered sequences: 0
SN sequences: 48986749
SN is sorted: 1
SN 1st fragments: 24331602
SN last fragments: 24655147
SN reads mapped: 48986749
SN reads mapped and paired: 46282948 # paired-end technology bit set + both mates mapped
SN reads unmapped: 0
SN reads properly paired: 41476118 # proper-pair bit set
SN reads paired: 48986749 # paired-end technology bit set
SN reads duplicated: 0 # PCR or optical duplicate bit set
SN reads MQ0: 720045 # mapped and MQ=0
SN reads QC failed: 0
SN non-primary alignments: 16364564
SN total length: 4947661649 # ignores clipping
SN bases mapped: 4947661649 # ignores clipping
SN bases mapped (cigar): 4947661649 # more accurate
SN bases trimmed: 0
SN bases duplicated: 0
SN mismatches: 58293637 # from NM fields
SN error rate: 1.178206e-02 # mismatches / bases mapped (cigar)
SN average length: 101
SN maximum length: 101
SN average quality: 34.4
SN insert size average: 730.1
SN insert size standard deviation: 1723.8
SN inward oriented pairs: 22043384
SN outward oriented pairs: 262611
SN pairs with other orientation: 40818
SN pairs on different chromosomes: 727628
featureCounts -T $NUM_THREADS -M -a $GTF_FILE -o $out_file $TOPHAT_BAMFILE
Here is the problem I'm having, I am getting a very large number of "Unassigned_NoFeatures" when I run featureCounts. For example:
Assigned 32997164
Unassigned_Ambiguity 4502244
Unassigned_MultiMapping 0
Unassigned_NoFeatures 14503113
Unassigned_Unmapped 0
Unassigned_MappingQuality 0
Unassigned_FragmentLength 0
Unassigned_Chimera 0
Unassigned_Secondary 0
Unassigned_Nonjunction 0
Unassigned_Duplicate 0
Unassigned_Ambiguity 4502244
Unassigned_MultiMapping 0
Unassigned_NoFeatures 14503113
Unassigned_Unmapped 0
Unassigned_MappingQuality 0
Unassigned_FragmentLength 0
Unassigned_Chimera 0
Unassigned_Secondary 0
Unassigned_Nonjunction 0
Unassigned_Duplicate 0
I used the ensemble prebuilt bowtie indexes and accompanying .gtf annotation, so I can't imagine its an annotation issue. When I look at the reads in a genome browser they seem to largely (not completely) map to exons, and there aren't any super-over-expressed genes in the count list generated by featurecounts.
Does anyone know what might be the cause of this issue? I am currently attempting to get counts with cufflinks to see if it's any better, or I might try htseq, but I am concerned the results will be similar.
I read here that MAPQ scores are generated differently for all the different alignment tools and wondered if featureCounts might take that into account, but I couldn't find documentation about how featurecounts handles MAPQ scores from tophat2.
Why am I losing so many reads at the count stage!?
I would appreciate any help or insight you can offer, thanks very much in advance.
Comment