Hello, all.
I am new to the forums and the whole of bioinformatics (I've been at it two weeks), but I have done a good deal of reading and have been playing around with the tophat->cufflinks pipeline.
Currently, I have RNA-seq libraries constructed from pineal glands of 3 aged patients. I am attempting to identify novel transcripts in this relatively small library, at which point I will move up to a larger library.
However, after assembling the transcripts with cufflinks (using the latest Ensembl human genome as my reference for the RABT), running cuffcompare to compare this pooled data back to the same Ensembl genome results in a very high percentage of transcripts identified as novel. Specifically, 47.4% of exons and 25.8% of introns are identified as novel, as are 82.4% of loci.
Now, I am fairly certain these numbers cannot be correct. I recognize that we expect to find some number of new annotations, but this seems ludicrously high. I was wondering
1) What could account for this very high report of novel transcripts? Could it just be lousy coverage resulting in many sparse transcripts being 'false positives'? I know that we did not have large amounts of RNA from these pineal glands (they're small, of course). If it's the data that is indeed the problem, how could I demonstrate this fact?
2) Do you have any suggestions on enhancing this method to identify novel transcripts? I had hoped to use Cuffcompare's 'j' tag to look at possible novel transcripts, but I am either getting identity to the reference genome (code =) or totally unknown transcripts (code u) at this point, with very little exception. I had had an idea to run the cufflinks assembly with the reference genome listed as both a reference and the mask...I think I will try that out and see how it works until I get a better idea.
Hopefully this is enough information. I look forward to any advice the sages of this forum can give.
-RP
I am new to the forums and the whole of bioinformatics (I've been at it two weeks), but I have done a good deal of reading and have been playing around with the tophat->cufflinks pipeline.
Currently, I have RNA-seq libraries constructed from pineal glands of 3 aged patients. I am attempting to identify novel transcripts in this relatively small library, at which point I will move up to a larger library.
However, after assembling the transcripts with cufflinks (using the latest Ensembl human genome as my reference for the RABT), running cuffcompare to compare this pooled data back to the same Ensembl genome results in a very high percentage of transcripts identified as novel. Specifically, 47.4% of exons and 25.8% of introns are identified as novel, as are 82.4% of loci.
Now, I am fairly certain these numbers cannot be correct. I recognize that we expect to find some number of new annotations, but this seems ludicrously high. I was wondering
1) What could account for this very high report of novel transcripts? Could it just be lousy coverage resulting in many sparse transcripts being 'false positives'? I know that we did not have large amounts of RNA from these pineal glands (they're small, of course). If it's the data that is indeed the problem, how could I demonstrate this fact?
2) Do you have any suggestions on enhancing this method to identify novel transcripts? I had hoped to use Cuffcompare's 'j' tag to look at possible novel transcripts, but I am either getting identity to the reference genome (code =) or totally unknown transcripts (code u) at this point, with very little exception. I had had an idea to run the cufflinks assembly with the reference genome listed as both a reference and the mask...I think I will try that out and see how it works until I get a better idea.
Hopefully this is enough information. I look forward to any advice the sages of this forum can give.
-RP
Comment