Hello all,
I've been trying to make TopHat align some reads to the known human exome fore some time now, with no success so far. In the web manual, they state that with the -G option you can supply a reference gff3 file for the program to map the reads to. I've found it impossible to find a human exome gff3 file, so I proceeded to download the hg18 exome gtf file from the UCSC browser, and tried to make it into a gff3 file, since the formats are very close already.
The UCSC gtf file reads as
From the gff3 format specification found in the Sequence Ontology webpage, one can see that the main difference resides in the last column, the ID tags. Following their indications, as well as the TopHat requirements as stated in the manual, I've manually converted the gtf file to this hopefully-gff3 format:
which I thought was correct. However, whenever I try to make an alignment using this pseudo-gff3 file, TopHat does not undesrtand it and the alignment does not proceed correctly:
Once this is solved, the other question that remains is if TopHat needs the gff3 file to be zero or one-based; I think UCSC works with zero-based coordinates but the web-manual does not specify which convention does TopHat use, and in splice-junction finding, that single bp is what really makes or breaks the deal.
So, if anyone has any expertise on this or has an idea on why this is not working at all, I'm very interested to know.
Thanks a lot!
I've been trying to make TopHat align some reads to the known human exome fore some time now, with no success so far. In the web manual, they state that with the -G option you can supply a reference gff3 file for the program to map the reads to. I've found it impossible to find a human exome gff3 file, so I proceeded to download the hg18 exome gtf file from the UCSC browser, and tried to make it into a gff3 file, since the formats are very close already.
The UCSC gtf file reads as
Code:
chr1 hg18_knownGene exon 24476 25037 0.000000 - . gene_id "uc001aak.1"; transcript_id "uc001aak.1"; chr1 hg18_knownGene exon 25140 25344 0.000000 - . gene_id "uc001aak.1"; transcript_id "uc001aak.1"; chr1 hg18_knownGene exon 25584 25940 0.000000 - . gene_id "uc001aak.1"; transcript_id "uc001aak.1"; chr1 hg18_knownGene exon 55425 55436 0.000000 + . gene_id "uc009vjh.1"; transcript_id "uc009vjh.1"; chr1 hg18_knownGene exon 55752 55834 0.000000 + . gene_id "uc009vjh.1"; transcript_id "uc009vjh.1"; chr1 hg18_knownGene start_codon 58954 58956 0.000000 + . gene_id "uc009vjh.1"; transcript_id "uc009vjh.1"; chr1 hg18_knownGene CDS 58954 59691 0.000000 + 0 gene_id "uc009vjh.1"; transcript_id "uc009vjh.1"; chr1 hg18_knownGene exon 58900 59692 0.000000 + . gene_id "uc009vjh.1"; transcript_id "uc009vjh.1"; chr1 hg18_knownGene start_codon 58954 58956 0.000000 + . gene_id "uc001aal.1"; transcript_id "uc001aal.1"; chr1 hg18_knownGene CDS 58954 59871 0.000000 + 0 gene_id "uc001aal.1"; transcript_id "uc001aal.1"; chr1 hg18_knownGene exon 58954 59871 0.000000 + . gene_id "uc001aal.1"; transcript_id "uc001aal.1"; chr1 hg18_knownGene exon 124780 126351 0.000000 - . gene_id "uc009vjj.1"; transcript_id "uc009vjj.1";
Code:
chr1 hg18_knownGene gene 24476 25940 . - . ID=uc001aak.1 chr1 hg18_knownGene mRNA 24476 25940 . - . ID=uc001aak.1_mRNA;Parent=uc001aak.1 chr1 hg18_knownGene exon 24476 25037 . - . ID=exon0001_uc001aak.1;Parent=uc001aak.1_mRNA chr1 hg18_knownGene exon 25140 25344 . - . ID=exon0002_uc001aak.1;Parent=uc001aak.1_mRNA chr1 hg18_knownGene exon 25584 25940 . - . ID=exon0003_uc001aak.1;Parent=uc001aak.1_mRNA chr1 hg18_knownGene gene 58954 59871 . + . ID=uc001aal.1 chr1 hg18_knownGene mRNA 58954 59871 . + . ID=uc001aal.1_mRNA;Parent=uc001aal.1 chr1 hg18_knownGene exon 58954 59871 . + . ID=exon0001_uc001aal.1;Parent=uc001aal.1_mRNA chr1 hg18_knownGene gene 127587 128558 . - . ID=uc001aam.2 chr1 hg18_knownGene mRNA 127587 128558 . - . ID=uc001aam.2_mRNA;Parent=uc001aam.2 chr1 hg18_knownGene exon 127587 128558 . - . ID=exon0001_uc001aam.2;Parent=uc001aam.2_mRNA
Code:
$ tophat --solexa-quals -p 6 --coverage-search -o tophat_exome --no-novel-juncs -G ~/UCSC_genome.gff3 ~/bowtie-0.11.3/indexes/hg18 RNA_sequence.fastq [Mon Jan 11 11:30:24 2010] Beginning TopHat run (v1.0.12) ----------------------------------------------- [Mon Jan 11 11:30:24 2010] Preparing output location tophat_exome/ [Mon Jan 11 11:30:24 2010] Checking for Bowtie index files [Mon Jan 11 11:30:24 2010] Checking for reference FASTA file [Mon Jan 11 11:30:24 2010] Checking for Bowtie Bowtie version: 0.11.3.0 [Mon Jan 11 11:30:24 2010] Checking reads seed length: 35bp format: fastq quality scale: --solexa-quals [Mon Jan 11 11:31:58 2010] Reading known junctions from GFF file Warning: TopHat did not find any junctions in GFF file [Mon Jan 11 11:32:22 2010] Mapping reads against hg18 with Bowtie [Mon Jan 11 11:39:06 2010] Joining segment hits Warning: junction database is empty! [Mon Jan 11 11:40:26 2010] Joining segment hits [Mon Jan 11 11:41:38 2010] Reporting output tracks ----------------------------------------------- Run complete [00:15:34 elapsed]
So, if anyone has any expertise on this or has an idea on why this is not working at all, I'm very interested to know.
Thanks a lot!
Comment