Seqanswers Leaderboard Ad

**Studentlost** · 10-28-2014, 05:14 PM

I'm not sure I follow.

Here is my prep_reads.info file for one of my samples. This one is 396,555 base pairs:
left_min_read_len=251
left_max_read_len=251
left_reads_in =396555
left_reads_out=396555
right_min_read_len=251
right_max_read_len=251
right_reads_in =396555
right_reads_out=396555

Now here is a prep_reads.info file for a sample from another study that actually produces near perfect alignment:

left_min_read_len=20
left_max_read_len=50
left_reads_in =33601862
left_reads_out=33599795
right_min_read_len=20
right_max_read_len=50
right_reads_in =33601862
right_reads_out=33599859

The only difference seems to be the read length?

**danwiththeplan** · 10-28-2014, 05:26 PM

Can I ask for some context here? Reads of what? Derived from mRNA, total RNA, DNA? Is it a 250bp PE MiSeq run? What is your genome and is it eukaryotic? Does it have known annotated genes and did you use the annotation track in your tophat run?

**Studentlost** · 10-28-2014, 05:33 PM

The reads are of mRNA sequenced from single cells of primates. I'm not sure how it was run, my knowledge starts off at the point of raw reads given to me. The genome is the Mmul_1 build by Ensembl, of the rhesus monkey. It has annotated genes and I used a reference GTF in tophat. I also assembled a transcriptome index and tried that. None of this made a difference.

**danwiththeplan** · 10-28-2014, 05:42 PM

When you used the reference GTF, did you also set the option to only map to annotated genes (-T/--transcriptome-only)?
I'm a bit confused as to why you are expecting Tophat and Bowtie to behave identically. Tophat is splice-aware, bowtie is not. I don't understand why you would ever use Bowtie (or any other non-splice-aware mapper) to map RNA-derived sequence to a genome.

**danwiththeplan** · 10-28-2014, 05:44 PM

Also:

I'm not sure how it was run, my knowledge starts off at the point of raw reads given to me

While I understand that this happens occasionally and it's sometimes not under your control, this is a terrible situation and if you can get as much information on the context of the run (Platform? Run type? Library prep kit used? Size selection? Method of size selection?) then you should.

**Studentlost** · 10-28-2014, 06:55 PM

I used bowtie2 (since tophat uses bowtie2 for alignment) as a test to see what was going on. I shouldn't be getting 0.4% alignment with tophat and 46% alignment with bowtie2. That's astronomically different. I am told that GSNAP produces roughly 46% alignment for the same data. So tophat is off...

**danwiththeplan** · 10-28-2014, 07:10 PM

I still don't understand why you want to use a non-splice-aware mapper at all, since every mRNA-derived read that spans a splice junction will fail to map properly. Why use Bowtie or SNAP at all? It makes no sense to compare between mapping rates for mRNA onto a genome between Bowtie and Tophat in a eukaryotic system.

Both SNAP and Bowtie (and BWA) are not splice-aware so yes, they should all produce more or less similar alignment rates.

Tophat is splice-aware. Your data is mRNA-derived. So, you should actually expect Tophat to map at a higher rate than SNAP/Bowtie/BWA since it would correctly map reads that span splice junctions.

One situation in which Tophat may have a low mapping rate is if you're telling it to only map to known genes (as defined in your GFF/GTF file) and maybe there aren't that many known genes in the monkey? your're using (or the GFF/GTF file you have).

Could you post the code you used?

**Studentlost** · 10-28-2014, 07:30 PM

I tried many variations. They all gave me the same results. I've ran it with and without a library type (first strand and unstranded), with and without a reference transcriptome index, with and without a GTF file, and I always get low values. I also tried it with coverage search, and with setting -r 150.
$tophat -p12 --no-coverage-search -o $tophat_dir $reference_genome $refined_reads/R1.atqt.fq $refined_reads/R2.atqt.fq

**danwiththeplan** · 10-28-2014, 07:53 PM

I've ran it with and without a library type

and with setting -r 150

You should know both the library type and the size of the fragments that were selected and sequenced. This should be info that the sequencing provider gives you, and if they don't, they are doing you a disservice & you probably need to hassle them or check any documentation they've sent.

with and without a reference transcriptome index

This would make no difference except to the speed of the run. Tophat requires a bowtie index for the reference genome, if none is around it will make one, which will slow the run down but won't change the results.

Otherwise I'm at a bit of a loss to help. The situation you describe doesn't make any sense to me so I think I'm missing something

**Studentlost** · 10-28-2014, 08:27 PM

If I knew the size of the fragments selected and sequenced, how would I implement that into my tophat code? Could you give me an example?

**sdriscoll** · 10-28-2014, 09:19 PM

I still think it would be useful to try running tophat with only a single end of the reads however this is more and more sounding like something Tophat may not be able to overcome. then you have to ask if your goal is to use that data or if it is to get Tophat to work. it sounds like GSNAP is a usable option for you. I also recommend STAR. tools don't always work every time on all data...sometimes they just glitch out and you have to change the pipeline and use a different tool.

ideally the way to solve this issue is to provide us a sample set of your reads so that someone can try running the alignment themselves and possibly figure out what is going on.

**danwiththeplan** · 10-29-2014, 01:28 PM

Originally posted by Studentlost View Post

If I knew the size of the fragments selected and sequenced, how would I implement that into my tophat code? Could you give me an example?

You can use the -r option to set the known size of the insert, and there are also options to set the standard deviation of the size. Size selection is done on a Covaris or simply on a agarose gel, and it's not precise, so all fragments are not exactly the defined size.. hence the ability to set a range..
Of course there's always the possibility that whoever got this data didn't bother to get this info from the sequencing provider. You could always try setting a gigantic -mate-std-dev (so that paired reads map even if they are not that close to the expected inner distance) and then look at the resulting SAM file to see how far apart the reads are typically mapping. There are various tools to do this.

From the manual:

Code:

-r/--mate-inner-dist <int> 	This is the expected (mean) inner distance between mate pairs. For, example, for paired end runs with fragments selected at 300bp, where each end is 50bp, you should set -r to be 200. The default is 50bp.
--mate-std-dev <int> 	The standard deviation for the distribution on inner distances between mate pairs. The default is 20bp.

But I do want to reiterate that a side-by-side comparison of tophat and bowtie simply does not make sense for mapping mRNA onto a genome. It's apples and oranges, they are doing quite different things.

tools don't always work every time on all data...sometimes they just glitch out and you have to change the pipeline and use a different tool.

Respectfully disagree with this. Nothing "glitches out"; something specific has happened, and you need to understand what, or you may miss something important about your data and end up analysing things in a totally inappropriate way.

**sdriscoll** · 10-29-2014, 01:49 PM

the comparison to bowtie2 makes sense if you understand the comparison. anytime you map RNA to a genome with a splice aware aligner versus a DNA aligner the splice aware one should produce higher mapping rates. the fact that when he tried that and got a higher mapping rate with the DNA mapper excluded the possibility that there is something specifically wrong with the reads and isolated the issue to Tophat. he also has additional knowledge that GSNAP can align the PE reads just fine (or at least as well as bowtie2 could). the comparison to bowtie2 makes even more sense when you realize the steps Tophat takes in alignment. step one (without a transcriptome reference) is unspliced alignment to the genome with bowtie2. so AT MINIMUM you'd expect Tophat to at least replicate that mapping rate. step 2 on is all about fragmenting reads into 25 base pieces and mapping to find potential splice sites and then refining down to final alignments.

I do have one question for OP. how did you measure the alignment percentage from Tophat? any chance you could post the output of 'samtools flagstat' for both the accepted_hits.bam from Tophat and the bowtie2 aligned bam file?

**danwiththeplan** · 10-29-2014, 02:12 PM

Originally posted by sdriscoll View Post

anytime you map RNA to a genome with a splice aware aligner versus a DNA aligner the splice aware one should produce higher mapping rates.

Agree totally, and I made this point downthread. This is why I thought it might be an issue of Tophat being restricted to assembling to known transcripts in the GTF, but apparently that's not the case. As you say, I struggle to see any situation in which Tophat would give a lower alignment rate than Bowtie on mRNA data mapped to a genome. It's odd.

he also has additional knowledge that GSNAP can align the PE reads just fine (or at least as well as bowtie2 could).

Not really. According to my reading of the OP, (s)he has second-hand knowledge that SNAP (not GSNAP) alignes OK.

OP: can you run GSNAP (or another splice-aware aligner) on this data yourself? Posting samtools flagstat outputs is also a very good idea as sdriscoll suggested.

Another question for OP: Did you use Bowtie or Bowtie2 for the 46% mapping? Not that it should make that much difference, but Tophat defaults to using Bowtie2 unless you tell it not to.

**sdriscoll** · 10-29-2014, 02:42 PM

the funny part in all of this is that Tophat bothered me enough for me to stop using it a couple years ago so my best solution is to use something else but OP seems pretty committed to it. I checked back in the thread and OP mentioned gsnap in post #12 so maybe I missed if that was redacted at some point.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 30 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 32 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News