I'm a beginner to the field. I'm trying to align my reads to a reference genome. However, I'm not quite sure I understand what tophat is trying to do. In fact, I want to make sure my understanding to RNA sequencing is correct.
Let's say,
"tophat -p 8 -G genes.gtf -o C1_R1_thout genome C1_R1_1.fq C1_R1_2.fq"
From my understanding, the *.fq files are raw sequence reads generated by experiments. Those raw reads are aligned to sequences defined in a reference genome. The reference genome in this example is "genome". genes.gtf lists some known transcripts.
The point of doing it is to analyse unknown transcripts. Say, if some of my reads can't be aligned to the reference genome, we know it might mean that there's a new mutation. Am I correct?
Another question, why do we need to provide an annotation file (genes.gtf) to tophat? Tophat already has the information from the reference genome to align with?
Thanks,
Let's say,
"tophat -p 8 -G genes.gtf -o C1_R1_thout genome C1_R1_1.fq C1_R1_2.fq"
From my understanding, the *.fq files are raw sequence reads generated by experiments. Those raw reads are aligned to sequences defined in a reference genome. The reference genome in this example is "genome". genes.gtf lists some known transcripts.
The point of doing it is to analyse unknown transcripts. Say, if some of my reads can't be aligned to the reference genome, we know it might mean that there's a new mutation. Am I correct?
Another question, why do we need to provide an annotation file (genes.gtf) to tophat? Tophat already has the information from the reference genome to align with?
Thanks,
Comment