SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Strange Tophat prep_reads behavior on small files krobison Bioinformatics 16 10-31-2016 07:48 AM
Strange Tophat crash sphil Bioinformatics 5 06-07-2011 02:55 AM
Strange/erroneous TopHat output with paired-end data hoho Bioinformatics 7 03-09-2011 01:21 PM
How does Tophat rank alignment? hyj004 Bioinformatics 0 01-19-2011 02:54 PM

Reply
 
Thread Tools
Old 07-23-2011, 08:12 PM   #1
asling
Member
 
Location: china

Join Date: Jun 2011
Posts: 14
Default strange decreased alignment by tophat

hi,we have conducted RNA_seq mapping and alignment by tophat. Before tophat finished,we monitored the logs for bowtie ,it shows a mapping ratio as below:
# reads processed: 20788750
# reads with at least one reported alignment: 6392838 (30.75%)
# reads that failed to align: 14247986 (68.54%)
# reads with alignments suppressed due to -m: 147926 (0.71%)
Reported 14363386 alignments to 1 output stream(s)

however,after tophat finished, we rechecked the logs for bowie, the ratio of reads involved in alignment dramatically declined to a very low level:
reads processed: 20814986
# reads with at least one reported alignment: 862668 (4.14%)
# reads that failed to align: 19946494 (95.83%)
# reads with alignments suppressed due to -m: 5824 (0.03%)
Reported 1347297 alignments to 1 output stream(s)

it seems that some of our alignments have been filtered out,
could anyone tell us what happened with our alignments?
asling is offline   Reply With Quote
Old 07-23-2011, 09:49 PM   #2
pbluescript
Senior Member
 
Location: Boston

Join Date: Nov 2009
Posts: 224
Default

Could you provide more information about the type of reads such as their length, quality, paired vs. single end?
What command did you use for tophat?
pbluescript is offline   Reply With Quote
Old 07-24-2011, 12:37 AM   #3
asling
Member
 
Location: china

Join Date: Jun 2011
Posts: 14
Default

we used a pair-end RNA-seq,and the reads length is 90. i don't quite understand what is the 'quality' you refered to. the command line is like:

tophat -r 200 -G /path/to/gtf/dog.gtf -o tophat_dog /path/to/index/dog /path/to/read1/read1/read1.fq /path/to/read2/read2.fq

dog is not the species which we sequenced. because the species we sequenced doesn't have a high quality genome available, we try using dog genome to conduct mapping and alignment
thanks for help!

Last edited by asling; 07-24-2011 at 12:45 AM.
asling is offline   Reply With Quote
Old 07-24-2011, 05:50 AM   #4
pbluescript
Senior Member
 
Location: Boston

Join Date: Nov 2009
Posts: 224
Default

By quality, I mean the average quality of each base along the length of the read. If the quality scores drop to a low level after a certain point, the low quality bases might make the mapping difficult.
Also, you might consider supplying a number for --mate-std-dev depending on the size distribution of your library.
Mapping to a closely related species can also be a tricky problem. Do you have an idea how much variation exists between the two genomes? Have you tried assembling your reads first and then aligning to that assembly?
pbluescript is offline   Reply With Quote
Old 07-24-2011, 07:08 AM   #5
DZhang
Senior Member
 
Location: East Coast, US

Join Date: Jun 2010
Posts: 177
Default

Hi,

I agree with pbluescript. Low mapping rate usually is caused by two things (aside from running the program wrong): quality of the reads and choice of the reference sequence(s).

FastQC is a good program to check the read quality. As to the choice of reference, I usually map the same set of reads with a few mappers to see the general mapping percentage. De novo transcriptome assembly with Abyss or SOAPDenovo may be used when you do not have a good reference sequence.

Hope this helps.

Douglas
www.contigexpress.com
DZhang is offline   Reply With Quote
Old 07-25-2011, 12:59 AM   #6
asling
Member
 
Location: china

Join Date: Jun 2011
Posts: 14
Default

pbluescript and Douglas, thanks for your advice!
according to your suggestion,i use FastQC to check my reads' quality. the result shows that my reads have big problems on several tests(very unusual(red cross) for Per base sequence quality,per base sequence content,per base GC content,per sequence GC content tests, and slightly abnormal (orange triangle) equence for duplicate level,overrepresented sequences,kmer content tests). Does it mean that my library was contaminated by overrepresented sequenses?

we have used several closely related species' genome as our references for tophat, but the same alignments decreasing problem appeared everytime. our highest score for reads with at least one reported alignment is 20%, do you think the bam file we get from tophat reliable for further analysis in cufflink?

our purpose is to test gene differential expression in different samples, so when we use SOAPdenovo to assemble contigs, we cannot trace how much reads was used in each contig.do you have any suggestion about that?

we are also trying using trinity to estimate gene differential expression, it is still computing now.
asling is offline   Reply With Quote
Old 07-25-2011, 06:34 AM   #7
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,178
Default

Quote:
Originally Posted by asling View Post
hi,we have conducted RNA_seq mapping and alignment by tophat. Before tophat finished,we monitored the logs for bowtie ,it shows a mapping ratio as below:
# reads processed: 20788750
# reads with at least one reported alignment: 6392838 (30.75%)
# reads that failed to align: 14247986 (68.54%)
# reads with alignments suppressed due to -m: 147926 (0.71%)
Reported 14363386 alignments to 1 output stream(s)

however,after tophat finished, we rechecked the logs for bowie, the ratio of reads involved in alignment dramatically declined to a very low level:
reads processed: 20814986
# reads with at least one reported alignment: 862668 (4.14%)
# reads that failed to align: 19946494 (95.83%)
# reads with alignments suppressed due to -m: 5824 (0.03%)
Reported 1347297 alignments to 1 output stream(s)

it seems that some of our alignments have been filtered out,
could anyone tell us what happened with our alignments?
I believe what your are seeing is completely normal. TopHat runs Bowtie repeatedly with different subsets of reads, different target databases and different parameters. Each of these runs produces a log file named something like fileXXXXXX.log (XXXXXX=some random characters). For example, a recent TopHat run I performed has 10 of these files in the log directory. Some of the files report very low % alignment, but that only refers to that particular step. No single Bowtie run (or log file) gives the complete picture of the success or failure of the TopHat run. Look at your accepted_hits.bam file to determine the overall alignment %.
kmcarr is offline   Reply With Quote
Old 07-25-2011, 10:39 PM   #8
asling
Member
 
Location: china

Join Date: Jun 2011
Posts: 14
Default

hi,kmcarr
Thanks for your reply! our tophat has generated 4 different logs named like "bowtie.left_kept_reads_seg1.fixmap.log",do you set any argument so that the repeated time could be higher? what software or command you use to caculate the overall alignment % in bam?
asling is offline   Reply With Quote
Old 07-26-2011, 06:41 AM   #9
pbluescript
Senior Member
 
Location: Boston

Join Date: Nov 2009
Posts: 224
Default

If your quality scores are declining, you might have to do some trimming before mapping. I use FASTX for this:
http://hannonlab.cshl.edu/fastx_toolkit/
You can use a command line something like:
fastq_quality_trimmer -t 20 -l 20 -i sample.fq -o sample_trimmed.fq
This will trim bases of quality lower than 20 from the end of the read and discard reads that are lower than 20bp long after trimming (Tophat requires reads to be a minimum of 20bp).
But, if you have several areas with bad results from FastQC, there might be other issues. If you can share some of the FastQC graphs, someone might be able to offer some more help.
pbluescript is offline   Reply With Quote
Old 08-03-2011, 12:53 AM   #10
asling
Member
 
Location: china

Join Date: Jun 2011
Posts: 14
Default

Your advice is quite helpful, our FastQC graphs show that the quality scores are declining dramatically especially in the last 15~20 bp region of our reads. So we use fastx_toolkit to cut the low quality region. Other problems related to GC content mainly appears in the first 10 bp, we didn't process that part of region.
asling is offline   Reply With Quote
Old 08-03-2011, 03:26 PM   #11
pbluescript
Senior Member
 
Location: Boston

Join Date: Nov 2009
Posts: 224
Default

Did you use a random hexamer to prime the RNA? That might explain the aberrant GC content at the beginning of the reads.
Here's a paper exploring this:
http://nar.oxfordjournals.org/conten.../e131.abstract
pbluescript is offline   Reply With Quote
Old 08-10-2011, 07:09 PM   #12
asling
Member
 
Location: china

Join Date: Jun 2011
Posts: 14
Default

hi~I read the paper, and we have used a random hexamer to prime the RNA, it seems the most common way to prime the RNA. The author suggest that trimming the sequence at the beginning of the reads will not work since the pattern will appeared again in the trimmed reads, but i don't quite understand about that. What's more, I have a question about FASTX. When we use FASTX to trim pair-end reads, properly paired reads in tophat drop to 0%. trimmed reads lost the pair information. have you met this problem?
asling is offline   Reply With Quote
Old 08-11-2011, 06:18 AM   #13
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,178
Default

Quote:
Originally Posted by asling View Post
What's more, I have a question about FASTX. When we use FASTX to trim pair-end reads, properly paired reads in tophat drop to 0%. trimmed reads lost the pair information. have you met this problem?
Most software packages dealing with NGS paired-end data (including TopHat) require that the two input fastq files have exactly the same number of reads and the reads be in exactly the same order. This is the only method they use for pairing reads (read names are not considered). When you run your data through any of the fastx tools which may remove bad reads it is inevitable that the two files will become "out of sync". Yes, it would be wonderful if the FASTX Toolkit were "pair aware" but it's not. It is up to you to then fix up the filtered files to make sure that the pairs are put back in sync.

Search this site for fastx and pair to find threads discussing this problem and possible solutions.
kmcarr is offline   Reply With Quote
Old 08-11-2011, 06:53 AM   #14
pbluescript
Senior Member
 
Location: Boston

Join Date: Nov 2009
Posts: 224
Default

The paper on the random hexamer bias mentions a method for correcting the reads, but in my experience, and in that of a number of other people I have talked to, I notice no significant difference in mapping when I use unaltered reads, reads that are corrected, or reads that have the first few bases trimmed off.
One reason you are having problems getting proper pairs might be an increased standard deviation of the distance between reads. If you have 2X100 reads with a 300 bp insert size, your distance between the reads would be 100bp. If the reads are trimmed using the command I mentioned above, you could have reads with a wide range of sizes and some as small as 20bp, but the insert size would not have changed. This could increase the standard deviation of the insert size above the tophat default. You will need to add a --mate-std-dev option to the tophat command and supply it with a large enough number to cover the range of distances between reads you will have.
pbluescript is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 02:42 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO