SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
For MAQ: Is there a Tool to convert sanger-format fastq file to illumina-fotmat fastq byb121 Bioinformatics 6 12-20-2013 02:26 AM
Problem with quality in fastq file ThePresident Bioinformatics 11 08-10-2012 05:54 AM
Ideas on collecting quality scores per base in an illumina fastq file brachysclereid Bioinformatics 11 12-05-2011 02:00 PM
mapping quality score in tophat sam file jjpurwar Bioinformatics 5 06-22-2011 12:55 PM
Reduce file size after Illumina FASTQ to Sanger FASTQ conversion? jjw14 Illumina/Solexa 2 06-01-2010 05:35 PM

Reply
 
Thread Tools
Old 09-14-2012, 07:09 AM   #1
narges
Member
 
Location: Finland

Join Date: Aug 2012
Posts: 29
Default quality value of fastq file and tophat

Hi all,
First forgive me if my question is trivial. I have an Illumina rna-seq dataset of two samples (the fastq files).
I have mapped them using tophat and the result of mapping so satisfactory, like nearly 70% of the fragments are mapped and then I used cuffdiff in order to know just the differentially expressed genes like below:
cuffdiff -c 0 -o output -p 12 -u /Genes/genes.gtf -L F,M sample1/accepted_hits.bam sample2/accepted_hits.bam

But the problem is that there are too few significant differentially expressed genes in the gene_exp.diff result file and I know that the correct number is really more than this.
according to my search through the net, some people believe it might be due to the inconsistency between the quality value version of my fastq flies and tophat version I have used. But I do not know how should I check it or can it be the real reason of this strange result or not?
I have used tophat 1.4.0 and this is the quality value I have gained by applying "ShortRead"R package on my fastq file:
> fastqs <- readFastq("sample1.fastq")
> qualities <- quality (fastqs)
> head(qualities)
class: FastqQuality
quality:
A BStringSet instance of length 6
width seq
[1] 46 B@@BABA5!.68A<9>67:<.43540+7(7B%%%%%%%%%%%%%%%
[2] 46 BCBA8<8@A5==9::A61?70;B=((/0:?)5:&0(:=5%%%%%%%
[3] 46 A6AA7ABBBBAA;B=B%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
[4] 46 BAA=>AABB=:BAA>>>%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
[5] 46 BCBBBBA>62@B<0ABBA798-=2?82@?B%%%%%%%%%%%%%%%%
[6] 46 BA@0@/,:B<AA;A7B=<3&5B59?.%%%%%%%%%%%%%%%%%%%%

Any suggestion would be appreciated.
narges is offline   Reply With Quote
Old 09-14-2012, 09:39 AM   #2
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

You might try quality trimming your reads. Those quality scores are phred+33, so the %'s at the ends of the reads are low quality and are probably decreasing mapping efficiency. Regarding the number of differentially regulated genes, there are a couple of comments. Firstly, with only two samples you would not expect to find a huge amount simply due to the inability to properly judge biological variability. Secondly, how do you know that the correct number is higher? The only way to know that would be to do the experiment that you're currently doing.
dpryan is offline   Reply With Quote
Old 09-16-2012, 02:17 AM   #3
narges
Member
 
Location: Finland

Join Date: Aug 2012
Posts: 29
Default

Quote:
Originally Posted by dpryan View Post
Firstly, with only two samples you would not expect to find a huge amount simply due to the inability to properly judge biological variability. Secondly, how do you know that the correct number is higher? The only way to know that would be to do the experiment that you're currently doing.
Dear dpryan, first thank you so much for your always useful answers.
Actually, I also have increased number of biological replicates up to 12 and still there is no significant differentially expressed genes. I know there should be around 480 differentially expressed genes because this dataset has been under analysis using R packages.
narges is offline   Reply With Quote
Old 09-16-2012, 05:37 AM   #4
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Assuming you mean DESeq or edgeR by "R packages", I would tend to believe them over whatever cuffdiff tells you. You didn't mention what version of the cufflinks suite you're using, so perhaps try upgrading (or maybe downgrading) it. There have been a number of threads on here regarding the changes in output produced by various version of cufflinks.

Oh, one note, I just noticed that you're using "-c 0". You're going to seriously decrease the number of DE genes by using that (more tests == crappier p-values for everything).

Last edited by dpryan; 09-16-2012 at 05:45 AM. Reason: mention -c 0 usage.
dpryan is offline   Reply With Quote
Old 09-16-2012, 06:04 AM   #5
narges
Member
 
Location: Finland

Join Date: Aug 2012
Posts: 29
Default

Quote:
Originally Posted by dpryan View Post
Assuming you mean DESeq or edgeR by "R packages", I would tend to believe them over whatever cuffdiff tells you. You didn't mention what version of the cufflinks suite you're using, so perhaps try upgrading (or maybe downgrading) it. There have been a number of threads on here regarding the changes in output produced by various version of cufflinks.
.
Yes the R packages are exactly what you mentioned.
About the cufflinks version, first I used the latest version :2.0.2-beta
and I gained only 8 DE genes with pvalue <0.01. But then after reading this thread :
http://seqanswers.com/forums/showthr...t=17662&page=2
I decided to use cufflinks/1.3.0 and I used -c 0 .
So I will omit this -c 0 parameter but do you think still I should work with cufflinks/1.3.0 to be less conservative or not?
narges is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 09:44 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO