Hi all,
I am working on some RNAseq data (Single end reads,36 bp from an Illumina instrument) from a prostrate cancer cell line. All I have for this is a Fasta file of all the reads.
I have assembled the reads using Tophat and Cufflinks, and then ran Cuffcompare to look at the quality of transcriptome reconstruction. This was the profile of transfrags I got.
[I just grep'ed the tmap file to find no of rows with each class code]
I am new to RNAseq data, so I have no idea what to expect. But I find it surprising to see that only 1.73% of the total transfrags matched to a known transcript. And that over 32% mapped to intergenic regions. Even accounting for the fact that it is a cancer cell line and some amount of changes are to be expected.
I was hoping someone with experience could take a look at this and give their opinion. Are these kind of numbers common..? Or does this mean the data I got has some problems?
Also, in general.. are there any standard quality assurance steps I can use to check RNAseq data?
Would greatly appreciate any help that I can get on this..
thanks..!
I am working on some RNAseq data (Single end reads,36 bp from an Illumina instrument) from a prostrate cancer cell line. All I have for this is a Fasta file of all the reads.
I have assembled the reads using Tophat and Cufflinks, and then ran Cuffcompare to look at the quality of transcriptome reconstruction. This was the profile of transfrags I got.
HTML Code:
Category No.of transfrags % of total Match 1533 1.73 Novel 3561 4.02 Contained 24080 27.18 Repeat 0 0 Intronic 10115 11.42 Polymerase 1889 2.13 run-on Intergenic 28752 32.46 Overlap on 14340 16.19 opp.strand Total 88580 100
I am new to RNAseq data, so I have no idea what to expect. But I find it surprising to see that only 1.73% of the total transfrags matched to a known transcript. And that over 32% mapped to intergenic regions. Even accounting for the fact that it is a cancer cell line and some amount of changes are to be expected.
I was hoping someone with experience could take a look at this and give their opinion. Are these kind of numbers common..? Or does this mean the data I got has some problems?
Also, in general.. are there any standard quality assurance steps I can use to check RNAseq data?
Would greatly appreciate any help that I can get on this..
thanks..!