Seqanswers Leaderboard Ad

**Brian Bushnell** · 06-16-2015, 10:23 PM

Transcriptomes of organisms with differential splicing will have a lot of multi-mapped reads because the same exons occur in multiple transcripts. If you want to map the reads unambiguously, use a splice-aware aligner such as BBMap and map to the genome.

**dariober** · 06-16-2015, 11:13 PM

My suggestion is to go for alignment to the genome using the known trascriptome as guide (somthing like tophat2 [options] --GTF myannotation.gtf bwtindex myreads.fq).

Comparing this result to the alignment to the transcriptome only, as you tried, might still be a good exercise. About the reason why you see so many reads mapping with 0 quality (ambiguously) my bet is that your reference transcriptome is redundant and it includes all the splice variants. For example, if a gene has exons A, B, C and it can produce transcripts with exons AB or AC or CB than all the reads will be ambiguously mapped since each exon appear twice in your reference fasta.

Edit: Oops, just realized that Brian has already answered...!

**mouchkam** · 06-18-2015, 08:11 AM

Thanks for your timely feedback Brian and dariober. I'm certainly planning to align to the genome as well, although I had hoped to do both. I have seen others align reads to the Drosophila transcriptome in the literature, but their bioinformatic methods are sparse. Perhaps they use a modified transcriptome that is less redundant?

dariober, would you mind clarifying what you mean by using the known transcriptome as a guide? Is the GTF file referenced above the genome or transcriptome? I apologize for my ignorance concerning the various approaches...my previous work was in a non-model organism and although not ideal, the only option was to map the reads to a de novo transcriptome.

Thanks again!

**dariober** · 06-18-2015, 11:44 PM

Originally posted by mouchkam View Post

dariober, would you mind clarifying what you mean by using the known transcriptome as a guide?

The GTF file you have, dmel-all-r6.05.gtf, has the coordinates of the gene features (exons, cds etc). You pass it to tophat so that the aligner "knows" where reads are expected to map. Have a look at the tophat documentation, which is quite extensive. So if your indexed genome files have prefix (for example) dmel.genome you would run tophat like

Code:

tophat2 [more options] -G dmel-all-r6.05.gtf dmel.genome reads.fq

**mouchkam** · 06-19-2015, 09:48 AM

Thanks dariboer...this is actually what I did yesterday and I ran into yet another stumbling block. Perhaps you or Brian or others could shed some light on this problem?

Here is my code for the alignment:

bowtie2-build dmel-all-chromosome-r6.05.fa dmel-all-chromosome-r6.05

tophat -p 4 -i 50 -I 5000 -o C8B1_R1_CT.tophat -G dmel-all-r6.05.gtf dmel-all-chromosome-r6.05 C8B1_R1_CT.fastq

The align summary stats are as follows:
Input : 23535102
Mapped : 21714703 (92.3% of input)
of these: 365295 ( 1.7%) have multiple alignments (5959 have >20)
92.3% overall read mapping rate.

HTSeq code:
htseq-count C8B1_accepted_hits_sorted.sam dmel-all-r6.05.gtf

But the report said that 20 million of my 23 million hits aligned to no feature! I know this is not possible because the sequencing facility provided a list of count data (via their generic pipeline) and it shows thousands of genes that have counts. I noticed that the features that do have counts are all non-protein coding mRNA. I have a feeling that my coordinates are not lining up between the sam file and the gtf file in HTSeq. But I'm not entirely sure how do deal with this? Do I need to realign with tophat? Or specify a different feature id when using HTSeq? I don't get an error message from tophat when running the alignment, so I think the the 1st column of my gtf file matches the name of the reference sequence in the bowtie index. I tried to confirm this running bowtie2-inspect on my base index, but to be honest, I'm having a really hard time finding the chromosome/contig name in the index names. Here is the code and a few lines of the output.

bowtie2-inspect -n dmel-all-chromosome-r6.05

dmel_mitochondrion_genome type=chromosome; loc=dmel_mitochondrion_genome:1..19517; ID=dmel_mitochondrion_genome; dbxref=GB:NC_001709; MD5=61af8db53361cd5744f41f773d21c3d4; length=19517; release=r6.05; species=Dmel;
211000022279114 type=golden_path_region; loc=211000022279114:1..14983; ID=211000022279114; dbxref=GB

S483726,GB

S483726,REFSEQ:NW_001845015; MD5=cf55405ff9a66546f5d6dad8cf539944; length=14983; release=r6.05; species=Dmel;
211000022280270 type=golden_path_region; loc=211000022280270:1..13108; ID=211000022280270; dbxref=GB

S483666,GB

S483666,REFSEQ:NW_001844955; MD5=366559f8dc9b57982af33ec0aac92804; length=13108; release=r6.05; species=Dmel;
211000022280187 type=golden_path_region; loc=211000022280187:1..13079; ID=211000022280187; dbxref=GB

S483725,GB

S483725,REFSEQ:NW_001845014; MD5=aaf75285222c30583ffafc89dbf37b0c; length=13079; release=r6.05; species=Dmel;
211000022280742 type=golden_path_region; loc=211000022280742:1..12513; ID=211000022280742; dbxref=GB

S483677,GB

S483677,REFSEQ:NW_001844966; MD5=10425aa97972c5f2aa93c1e01c885505; length=12513; release=r6.05; species=Dmel;
211000022280763 type=golden_path_region; loc=211000022280763:1..12001; ID=211000022280763; dbxref=GB

S483690,GB

S483690,REFSEQ:NW_001844979; MD5=e2bbfb2660294c610bf16fb4587f1a2b; length=12001; release=r6.05; species=Dmel;

Thanks in advance!

**mouchkam** · 06-19-2015, 01:37 PM

As an addition to my last post, after doing a more thorough investigation, I noticed that for the Drosophila chromosomes, the 1st column of the gtf lists "3R," where as the index has a much longer name (e.g. 3R type=golden_path_region; loc=3R:1..32079331; ID=3R; dbxref=GB:AE014297,GB:AE014297,REFSEQ:NT_033777; MD5=420540d26d86fbf14185d2f2d68af9c4; length=32079331; release=r6.05; species=Dmel). The tophat manual states that "Please note that the values in the first column of the provided GTF/GFF file (column which indicates the chromosome or contig on which the feature is located), must match the name of the reference sequence in the Bowtie index you are using with TopHat." So does this mean that I need to go into the gtf and replace the 1st column with the much longer name of the reference sequence from the reference genome?

**Brian Bushnell** · 06-19-2015, 01:44 PM

You need to rename them so that they match. The easiest way is to rename it in the reference, like this:

reformat.sh in=reference.fa out=renamed.fa trd

"trd" means "trimreaddescription" and will trim everything after the first whitespace.

**mouchkam** · 06-19-2015, 01:52 PM

Got it! Will try this. Thanks!

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 30 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 32 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 53 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Large proportion of mapping quality scores of 0 with bwa

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News