Seqanswers Leaderboard Ad

**pbluescript** · 07-23-2011, 08:49 PM

Could you provide more information about the type of reads such as their length, quality, paired vs. single end?
What command did you use for tophat?

**asling** · 07-23-2011, 11:37 PM

we used a pair-end RNA-seq,and the reads length is 90. i don't quite understand what is the 'quality' you refered to. the command line is like:

tophat -r 200 -G /path/to/gtf/dog.gtf -o tophat_dog /path/to/index/dog /path/to/read1/read1/read1.fq /path/to/read2/read2.fq

dog is not the species which we sequenced. because the species we sequenced doesn't have a high quality genome available, we try using dog genome to conduct mapping and alignment
thanks for help!

**pbluescript** · 07-24-2011, 04:50 AM

By quality, I mean the average quality of each base along the length of the read. If the quality scores drop to a low level after a certain point, the low quality bases might make the mapping difficult.
Also, you might consider supplying a number for --mate-std-dev depending on the size distribution of your library.
Mapping to a closely related species can also be a tricky problem. Do you have an idea how much variation exists between the two genomes? Have you tried assembling your reads first and then aligning to that assembly?

**DZhang** · 07-24-2011, 06:08 AM

Hi,

I agree with pbluescript. Low mapping rate usually is caused by two things (aside from running the program wrong): quality of the reads and choice of the reference sequence(s).

FastQC is a good program to check the read quality. As to the choice of reference, I usually map the same set of reads with a few mappers to see the general mapping percentage. De novo transcriptome assembly with Abyss or SOAPDenovo may be used when you do not have a good reference sequence.

Hope this helps.

Douglas

https://www.contigexpress.com

**asling** · 07-24-2011, 11:59 PM

pbluescript and Douglas, thanks for your advice!
according to your suggestion,i use FastQC to check my reads' quality. the result shows that my reads have big problems on several tests(very unusual(red cross) for Per base sequence quality,per base sequence content,per base GC content,per sequence GC content tests, and slightly abnormal (orange triangle) equence for duplicate level,overrepresented sequences,kmer content tests). Does it mean that my library was contaminated by overrepresented sequenses?

we have used several closely related species' genome as our references for tophat, but the same alignments decreasing problem appeared everytime. our highest score for reads with at least one reported alignment is 20%, do you think the bam file we get from tophat reliable for further analysis in cufflink?

our purpose is to test gene differential expression in different samples, so when we use SOAPdenovo to assemble contigs, we cannot trace how much reads was used in each contig.do you have any suggestion about that?

we are also trying using trinity to estimate gene differential expression, it is still computing now.

**kmcarr** · 07-25-2011, 05:34 AM

Originally posted by asling View Post

hi,we have conducted RNA_seq mapping and alignment by tophat. Before tophat finished,we monitored the logs for bowtie ,it shows a mapping ratio as below:
# reads processed: 20788750
# reads with at least one reported alignment: 6392838 (30.75%)
# reads that failed to align: 14247986 (68.54%)
# reads with alignments suppressed due to -m: 147926 (0.71%)
Reported 14363386 alignments to 1 output stream(s)

however,after tophat finished, we rechecked the logs for bowie, the ratio of reads involved in alignment dramatically declined to a very low level:
reads processed: 20814986
# reads with at least one reported alignment: 862668 (4.14%)
# reads that failed to align: 19946494 (95.83%)
# reads with alignments suppressed due to -m: 5824 (0.03%)
Reported 1347297 alignments to 1 output stream(s)

it seems that some of our alignments have been filtered out,
could anyone tell us what happened with our alignments?

I believe what your are seeing is completely normal. TopHat runs Bowtie repeatedly with different subsets of reads, different target databases and different parameters. Each of these runs produces a log file named something like fileXXXXXX.log (XXXXXX=some random characters). For example, a recent TopHat run I performed has 10 of these files in the log directory. Some of the files report very low % alignment, but that only refers to that particular step. No single Bowtie run (or log file) gives the complete picture of the success or failure of the TopHat run. Look at your accepted_hits.bam file to determine the overall alignment %.

**asling** · 07-25-2011, 09:39 PM

hi,kmcarr
Thanks for your reply! our tophat has generated 4 different logs named like "bowtie.left_kept_reads_seg1.fixmap.log",do you set any argument so that the repeated time could be higher? what software or command you use to caculate the overall alignment % in bam?

**pbluescript** · 07-26-2011, 05:41 AM

If your quality scores are declining, you might have to do some trimming before mapping. I use FASTX for this:

FASTX-Toolkit

http://hannonlab.cshl.edu/fastx_toolkit/

You can use a command line something like:
fastq_quality_trimmer -t 20 -l 20 -i sample.fq -o sample_trimmed.fq
This will trim bases of quality lower than 20 from the end of the read and discard reads that are lower than 20bp long after trimming (Tophat requires reads to be a minimum of 20bp).
But, if you have several areas with bad results from FastQC, there might be other issues. If you can share some of the FastQC graphs, someone might be able to offer some more help.

**asling** · 08-02-2011, 11:53 PM

Your advice is quite helpful, our FastQC graphs show that the quality scores are declining dramatically especially in the last 15~20 bp region of our reads. So we use fastx_toolkit to cut the low quality region. Other problems related to GC content mainly appears in the first 10 bp, we didn't process that part of region.

**pbluescript** · 08-03-2011, 02:26 PM

Did you use a random hexamer to prime the RNA? That might explain the aberrant GC content at the beginning of the reads.
Here's a paper exploring this:

http://nar.oxfordjournals.org/content/38/12/e131.abstract

**asling** · 08-10-2011, 06:09 PM

hi~I read the paper, and we have used a random hexamer to prime the RNA, it seems the most common way to prime the RNA. The author suggest that trimming the sequence at the beginning of the reads will not work since the pattern will appeared again in the trimmed reads, but i don't quite understand about that. What's more, I have a question about FASTX. When we use FASTX to trim pair-end reads, properly paired reads in tophat drop to 0%. trimmed reads lost the pair information. have you met this problem?

**kmcarr** · 08-11-2011, 05:18 AM

Originally posted by asling View Post

What's more, I have a question about FASTX. When we use FASTX to trim pair-end reads, properly paired reads in tophat drop to 0%. trimmed reads lost the pair information. have you met this problem?

Most software packages dealing with NGS paired-end data (including TopHat) require that the two input fastq files have exactly the same number of reads and the reads be in exactly the same order. This is the only method they use for pairing reads (read names are not considered). When you run your data through any of the fastx tools which may remove bad reads it is inevitable that the two files will become "out of sync". Yes, it would be wonderful if the FASTX Toolkit were "pair aware" but it's not. It is up to you to then fix up the filtered files to make sure that the pairs are put back in sync.

Search this site for fastx and pair to find threads discussing this problem and possible solutions.

**pbluescript** · 08-11-2011, 05:53 AM

The paper on the random hexamer bias mentions a method for correcting the reads, but in my experience, and in that of a number of other people I have talked to, I notice no significant difference in mapping when I use unaltered reads, reads that are corrected, or reads that have the first few bases trimmed off.
One reason you are having problems getting proper pairs might be an increased standard deviation of the distance between reads. If you have 2X100 reads with a 300 bp insert size, your distance between the reads would be 100bp. If the reads are trimmed using the command I mentioned above, you could have reads with a wide range of sizes and some as small as 20bp, but the insert size would not have changed. This could increase the standard deviation of the insert size above the tophat default. You will need to add a --mate-std-dev option to the tophat command and supply it with a large enough number to cover the range of distances between reads you will have.

Topics	Statistics	Last Post
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Yesterday, 08:47 AM	0 responses 12 views 0 likes	Last Post by seqadmin Yesterday, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 59 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 54 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM

Seqanswers Leaderboard Ad

Announcement

strange decreased alignment by tophat

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News