![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Strange Tophat prep_reads behavior on small files | krobison | Bioinformatics | 16 | 10-31-2016 07:48 AM |
Strange Tophat crash | sphil | Bioinformatics | 5 | 06-07-2011 02:55 AM |
Strange/erroneous TopHat output with paired-end data | hoho | Bioinformatics | 7 | 03-09-2011 01:21 PM |
How does Tophat rank alignment? | hyj004 | Bioinformatics | 0 | 01-19-2011 02:54 PM |
![]() |
|
Thread Tools |
![]() |
#1 |
Member
Location: china Join Date: Jun 2011
Posts: 14
|
![]()
hi,we have conducted RNA_seq mapping and alignment by tophat. Before tophat finished,we monitored the logs for bowtie ,it shows a mapping ratio as below:
# reads processed: 20788750 # reads with at least one reported alignment: 6392838 (30.75%) # reads that failed to align: 14247986 (68.54%) # reads with alignments suppressed due to -m: 147926 (0.71%) Reported 14363386 alignments to 1 output stream(s) however,after tophat finished, we rechecked the logs for bowie, the ratio of reads involved in alignment dramatically declined to a very low level: reads processed: 20814986 # reads with at least one reported alignment: 862668 (4.14%) # reads that failed to align: 19946494 (95.83%) # reads with alignments suppressed due to -m: 5824 (0.03%) Reported 1347297 alignments to 1 output stream(s) it seems that some of our alignments have been filtered out, could anyone tell us what happened with our alignments? |
![]() |
![]() |
![]() |
#2 |
Senior Member
Location: Boston Join Date: Nov 2009
Posts: 224
|
![]()
Could you provide more information about the type of reads such as their length, quality, paired vs. single end?
What command did you use for tophat? |
![]() |
![]() |
![]() |
#3 |
Member
Location: china Join Date: Jun 2011
Posts: 14
|
![]()
we used a pair-end RNA-seq,and the reads length is 90. i don't quite understand what is the 'quality' you refered to. the command line is like:
tophat -r 200 -G /path/to/gtf/dog.gtf -o tophat_dog /path/to/index/dog /path/to/read1/read1/read1.fq /path/to/read2/read2.fq dog is not the species which we sequenced. because the species we sequenced doesn't have a high quality genome available, we try using dog genome to conduct mapping and alignment thanks for help! Last edited by asling; 07-24-2011 at 12:45 AM. |
![]() |
![]() |
![]() |
#4 |
Senior Member
Location: Boston Join Date: Nov 2009
Posts: 224
|
![]()
By quality, I mean the average quality of each base along the length of the read. If the quality scores drop to a low level after a certain point, the low quality bases might make the mapping difficult.
Also, you might consider supplying a number for --mate-std-dev depending on the size distribution of your library. Mapping to a closely related species can also be a tricky problem. Do you have an idea how much variation exists between the two genomes? Have you tried assembling your reads first and then aligning to that assembly? |
![]() |
![]() |
![]() |
#5 |
Senior Member
Location: East Coast, US Join Date: Jun 2010
Posts: 177
|
![]()
Hi,
I agree with pbluescript. Low mapping rate usually is caused by two things (aside from running the program wrong): quality of the reads and choice of the reference sequence(s). FastQC is a good program to check the read quality. As to the choice of reference, I usually map the same set of reads with a few mappers to see the general mapping percentage. De novo transcriptome assembly with Abyss or SOAPDenovo may be used when you do not have a good reference sequence. Hope this helps. Douglas www.contigexpress.com |
![]() |
![]() |
![]() |
#6 |
Member
Location: china Join Date: Jun 2011
Posts: 14
|
![]()
pbluescript and Douglas, thanks for your advice!
according to your suggestion,i use FastQC to check my reads' quality. the result shows that my reads have big problems on several tests(very unusual(red cross) for Per base sequence quality,per base sequence content,per base GC content,per sequence GC content tests, and slightly abnormal (orange triangle) equence for duplicate level,overrepresented sequences,kmer content tests). Does it mean that my library was contaminated by overrepresented sequenses? we have used several closely related species' genome as our references for tophat, but the same alignments decreasing problem appeared everytime. our highest score for reads with at least one reported alignment is 20%, do you think the bam file we get from tophat reliable for further analysis in cufflink? our purpose is to test gene differential expression in different samples, so when we use SOAPdenovo to assemble contigs, we cannot trace how much reads was used in each contig.do you have any suggestion about that? we are also trying using trinity to estimate gene differential expression, it is still computing now. |
![]() |
![]() |
![]() |
#7 | |
Senior Member
Location: USA, Midwest Join Date: May 2008
Posts: 1,178
|
![]() Quote:
|
|
![]() |
![]() |
![]() |
#8 |
Member
Location: china Join Date: Jun 2011
Posts: 14
|
![]()
hi,kmcarr
Thanks for your reply! our tophat has generated 4 different logs named like "bowtie.left_kept_reads_seg1.fixmap.log",do you set any argument so that the repeated time could be higher? what software or command you use to caculate the overall alignment % in bam? |
![]() |
![]() |
![]() |
#9 |
Senior Member
Location: Boston Join Date: Nov 2009
Posts: 224
|
![]()
If your quality scores are declining, you might have to do some trimming before mapping. I use FASTX for this:
http://hannonlab.cshl.edu/fastx_toolkit/ You can use a command line something like: fastq_quality_trimmer -t 20 -l 20 -i sample.fq -o sample_trimmed.fq This will trim bases of quality lower than 20 from the end of the read and discard reads that are lower than 20bp long after trimming (Tophat requires reads to be a minimum of 20bp). But, if you have several areas with bad results from FastQC, there might be other issues. If you can share some of the FastQC graphs, someone might be able to offer some more help. |
![]() |
![]() |
![]() |
#10 |
Member
Location: china Join Date: Jun 2011
Posts: 14
|
![]()
Your advice is quite helpful, our FastQC graphs show that the quality scores are declining dramatically especially in the last 15~20 bp region of our reads. So we use fastx_toolkit to cut the low quality region. Other problems related to GC content mainly appears in the first 10 bp, we didn't process that part of region.
|
![]() |
![]() |
![]() |
#11 |
Senior Member
Location: Boston Join Date: Nov 2009
Posts: 224
|
![]()
Did you use a random hexamer to prime the RNA? That might explain the aberrant GC content at the beginning of the reads.
Here's a paper exploring this: http://nar.oxfordjournals.org/conten.../e131.abstract |
![]() |
![]() |
![]() |
#12 |
Member
Location: china Join Date: Jun 2011
Posts: 14
|
![]()
hi~I read the paper, and we have used a random hexamer to prime the RNA, it seems the most common way to prime the RNA. The author suggest that trimming the sequence at the beginning of the reads will not work since the pattern will appeared again in the trimmed reads, but i don't quite understand about that. What's more, I have a question about FASTX. When we use FASTX to trim pair-end reads, properly paired reads in tophat drop to 0%. trimmed reads lost the pair information. have you met this problem?
|
![]() |
![]() |
![]() |
#13 | |
Senior Member
Location: USA, Midwest Join Date: May 2008
Posts: 1,178
|
![]() Quote:
Search this site for fastx and pair to find threads discussing this problem and possible solutions. |
|
![]() |
![]() |
![]() |
#14 |
Senior Member
Location: Boston Join Date: Nov 2009
Posts: 224
|
![]()
The paper on the random hexamer bias mentions a method for correcting the reads, but in my experience, and in that of a number of other people I have talked to, I notice no significant difference in mapping when I use unaltered reads, reads that are corrected, or reads that have the first few bases trimmed off.
One reason you are having problems getting proper pairs might be an increased standard deviation of the distance between reads. If you have 2X100 reads with a 300 bp insert size, your distance between the reads would be 100bp. If the reads are trimmed using the command I mentioned above, you could have reads with a wide range of sizes and some as small as 20bp, but the insert size would not have changed. This could increase the standard deviation of the insert size above the tophat default. You will need to add a --mate-std-dev option to the tophat command and supply it with a large enough number to cover the range of distances between reads you will have. |
![]() |
![]() |
![]() |
Thread Tools | |
|
|