We have questions regarding TopHat "-M/--prefilter-multihits" option and Unmapped.bam file.
First, adapter sequences and low quality bases were removed from FASTQ file, all of reads should be mapped on Mouse genome sequences.
The versioin of TopHat2 and Bowtie2 used in this test were v.2.0.3 and v.2.0.0-beta6 respectively.
We run TopHat with following options.
# tophat2 -o $output_dir -G $annotation_gtf -p 2 $bowtie2_index $fastq1 $fastq2
From this result, only 60% of total reads were without "secondary aligment" flag in BAM file, which means that 60% of reads were correctly mapped to Mouse genome.
On the other hand, many reads, 30% of total reads were saved into Unmapped.bam.
Since we don't know why so many reads were saved in Unmapped.bam, we investigated what kind of reads were in Unmapped.bam.
We found that many kind of Mouse repetitive sequences, such as Transposable element, Ribosomal proteins and Ribosomal rRNAs were in Unmapped.bam.
From the result, we have three questions regarding "-M" option and Unmapped.bam file.
---------------------------------------
Q1. Was -M option automatically enabled when -G option was used?
The TopHat Manual told that "-M/--prefilter-multihits" option must be used with -G/--GTF option as follows.
------------
(The following options in this section are only used when the transcriptome search was activated with -G/--GTF and/or --transcriptome-index)
------------
But we didn't use -M option but -G option.
So, Repetitive sequences (ie, multihit reads) were saved into Unmapped.bam file without -M option.
"-M" option was automatically enabled when -G option was used?
---------------------------------------
Q2. The filtered reads were dumped into Unmapped.bam file if -M option was used?
We are wondering why many reads were dumped into Unmapped.bam file.
Multihit reads were saved into Unmapped.bam file if "-M" option is used?
------------
Q3. How to distinguish between multihit reads and unmapped reads in Unmapped.bam file?
What kind of reads were dumped in Unmapped.bam file?
If the sample was contaminated with Bacteria, these Bacterial unmapped reads will be saved in same Unmapped.bam file?
If so, how to distinguish between multihit reads and Bacterial unmapped reads in Unmapped.bam file?
------------
Thank you for your coperation.
First, adapter sequences and low quality bases were removed from FASTQ file, all of reads should be mapped on Mouse genome sequences.
The versioin of TopHat2 and Bowtie2 used in this test were v.2.0.3 and v.2.0.0-beta6 respectively.
We run TopHat with following options.
# tophat2 -o $output_dir -G $annotation_gtf -p 2 $bowtie2_index $fastq1 $fastq2
From this result, only 60% of total reads were without "secondary aligment" flag in BAM file, which means that 60% of reads were correctly mapped to Mouse genome.
On the other hand, many reads, 30% of total reads were saved into Unmapped.bam.
Since we don't know why so many reads were saved in Unmapped.bam, we investigated what kind of reads were in Unmapped.bam.
We found that many kind of Mouse repetitive sequences, such as Transposable element, Ribosomal proteins and Ribosomal rRNAs were in Unmapped.bam.
From the result, we have three questions regarding "-M" option and Unmapped.bam file.
---------------------------------------
Q1. Was -M option automatically enabled when -G option was used?
The TopHat Manual told that "-M/--prefilter-multihits" option must be used with -G/--GTF option as follows.
------------
(The following options in this section are only used when the transcriptome search was activated with -G/--GTF and/or --transcriptome-index)
------------
But we didn't use -M option but -G option.
So, Repetitive sequences (ie, multihit reads) were saved into Unmapped.bam file without -M option.
"-M" option was automatically enabled when -G option was used?
---------------------------------------
Q2. The filtered reads were dumped into Unmapped.bam file if -M option was used?
We are wondering why many reads were dumped into Unmapped.bam file.
Multihit reads were saved into Unmapped.bam file if "-M" option is used?
------------
Q3. How to distinguish between multihit reads and unmapped reads in Unmapped.bam file?
What kind of reads were dumped in Unmapped.bam file?
If the sample was contaminated with Bacteria, these Bacterial unmapped reads will be saved in same Unmapped.bam file?
If so, how to distinguish between multihit reads and Bacterial unmapped reads in Unmapped.bam file?
------------
Thank you for your coperation.
Comment