Hi everyone,
This is my first post here, be sure to let me know if I break a rule of conduct or anything. Tricks & Tips are appreciated.
The situation:
I'm currently trying to analyze RNA-seq data from Illumina Body Map 2.0. I've built a pipeline that seems reasonable and used it to asses quality, trim, map, analyze RNA-seq data with the standard tools. The pipeline is for 100bp single end reads.
The pipeline:
Quality (fastx_tools for trimming and filtering, FastQC for reporting)
fastq_quality_filter -Q33 -q 20 -p 80 <FASTQC_FILE>
fastq_quality_trimmer -Q33 -t 20 -l 50 <FILTER_OUT>
fastqc <TRIM_OUT>
Assembling
tophat --solexa-quals <UCSC hg19 REF> <TRIM_OUT>
Analysis
cufflinks <TOPHAT_OUT>
cuffcompare -r <UCSC hg19 ANNOTATION> -R <CUFFLINKS_OUT>
The problems
The output of cuffcompare (cuffcmp.tacking) identifies:
13586 [23.32%] novel (class code j)
6127 [10.51%] intronic (class code i)
19145 [32.86%] contained (class code c)
In this sample, novel+intronic > contained. I'm highly dubious of the trustfulness of those results since one would not expect such high number of non previously reported transcripts. If anyone could point out a flaw in the pipeline or my interpretation of the obtained results I would greatly appreciate it. Do tell if I need to give more details on any part.
Best regards,
Simon
This is my first post here, be sure to let me know if I break a rule of conduct or anything. Tricks & Tips are appreciated.
The situation:
I'm currently trying to analyze RNA-seq data from Illumina Body Map 2.0. I've built a pipeline that seems reasonable and used it to asses quality, trim, map, analyze RNA-seq data with the standard tools. The pipeline is for 100bp single end reads.
The pipeline:
Quality (fastx_tools for trimming and filtering, FastQC for reporting)
fastq_quality_filter -Q33 -q 20 -p 80 <FASTQC_FILE>
fastq_quality_trimmer -Q33 -t 20 -l 50 <FILTER_OUT>
fastqc <TRIM_OUT>
Assembling
tophat --solexa-quals <UCSC hg19 REF> <TRIM_OUT>
Analysis
cufflinks <TOPHAT_OUT>
cuffcompare -r <UCSC hg19 ANNOTATION> -R <CUFFLINKS_OUT>
The problems
The output of cuffcompare (cuffcmp.tacking) identifies:
13586 [23.32%] novel (class code j)
6127 [10.51%] intronic (class code i)
19145 [32.86%] contained (class code c)
In this sample, novel+intronic > contained. I'm highly dubious of the trustfulness of those results since one would not expect such high number of non previously reported transcripts. If anyone could point out a flaw in the pipeline or my interpretation of the obtained results I would greatly appreciate it. Do tell if I need to give more details on any part.
Best regards,
Simon