SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
UCSC genome browser download ashwatha Bioinformatics 3 07-25-2011 07:58 PM
RNA-Seq: ENCODE whole-genome data in the UCSC genome browser (2011 update). Newsbot! Literature Watch 1 11-24-2010 01:08 PM
UCSC Genome Browser ECHo Illumina/Solexa 0 02-25-2010 07:21 PM
Using tophat results via UCSC genome browser statsteam RNA Sequencing 2 11-20-2009 11:37 AM
TopHat output and USCS genome browser question sdarko General 1 08-03-2009 02:50 PM

Reply
 
Thread Tools
Old 09-15-2009, 06:10 AM   #1
statsteam
Member
 
Location: Californica

Join Date: Sep 2009
Posts: 18
Default Using TopHat output files with UCSC genome browser

Hi all,

Recently, I ran TopHat with 76bp reads data and got the results (sam, bed, and wig files).

Actual a few lines of my input (fasta file) are:
>HWUSI-EAS366:4:1:4:624#0/1:
CTCNGGATGGAGTACAGTGGTGTGATCATGGCTCACTGTAGNNNNNANCN CNTGGGCGCAAGCNNNNNNNNNCTAN
>HWUSI-EAS366:4:1:4:243#0/1:
CGGNGCCGTTGCTGGTTCTCACACCTTTTAGGTCTGTTCTCNNNNNCNGN TNCGACTCTCTCTNNNNNANNNCCGN
>HWUSI-EAS366:4:1:4:1373#0/1:
GAAAAAACCACCCAGCGGTGATGGCAGCGCGCGTGGGTCCCNNNGNGNGN GGGGCGGGTCGCGCNNNNGNNNCGAN
>HWUSI-EAS366:4:1:4:1672#0/1:
GGGCAGGAAAAAAAGGGAAGANAAAATACTGGGGAAGAAAANNNANCNCN GTTTGGCAGCTCTTNNNNGNNNCAGN


And a few lines of junctions.bed file are:

track name=junctions description="TopHat junctions"
gi|29823169|ref|NT_025004.13|Hs18_25160 9690 19656 JUNC00000001 1 + 9690 19656 255,0,0 2 37,38 0,9928
gi|29823169|ref|NT_025004.13|Hs18_25160 14260 19654 JUNC00000002 2 + 14260 19654 255,0,0 2 57,36 0,5358
gi|29823169|ref|NT_025004.13|Hs18_25160 19701 160104 JUNC00000003 3 + 19701 160104 255,0,0 2 32,66 0,140337


A few lines of coverage.wig file are:

track type=bedGraph name="TopHat - read coverage"
gi|29823169|ref|NT_025004.13|Hs18_25160 0 9580 0
gi|29823169|ref|NT_025004.13|Hs18_25160 9580 9655 1
gi|29823169|ref|NT_025004.13|Hs18_25160 9655 9690 0


Here is the problem.

When I copied and pasted the results (either bed file or wig file), I always got an error and when I change the gi|29823169|ref|NT... part to something like chromosome name, it works.

As you can see from my input file, I don't have gi|29823169|ref|NT... part. I am not sure where the TopHat find such label or reference.

Can someone tell me what gi|29823169|ref|NT... part means and how I can convert these files into the one that UCSC genome brower understands. I think I need to get the actual chromosome names.

Thank you,
Statsteam
statsteam is offline   Reply With Quote
Old 09-15-2009, 07:08 AM   #2
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

The gi lines you see are the fasta file headers from the NCBI human assembly. Each chromosome in that assembly comes in a separate file and it is the accession codes for those separate files that you are seeing.

The full header for the first accession you found is:

>gi|29823169|ref|NT_025004.13|Hs18_25160 Homo sapiens chromosome 18 genomic contig, reference assembly

You therefore need to find all the accessions for the different chromosomes and replace them with the corresponding chromosome name.

Alternatively you could edit the original fasta files and change the first lines to just contain a chromosome name, eg:

>chr18

..and then reindex the genome and run tophat again. This should put usable chromosome names into your output files.
simonandrews is offline   Reply With Quote
Old 09-15-2009, 07:20 AM   #3
statsteam
Member
 
Location: Californica

Join Date: Sep 2009
Posts: 18
Default

Thank you simon.
I just started bowtie-build with fasta files containing only chromosome names.

Statsteam
statsteam is offline   Reply With Quote
Old 09-15-2009, 07:33 AM   #4
melody
Junior Member
 
Location: china

Join Date: Sep 2008
Posts: 2
Default

as the output above
:A few lines of coverage.wig file are:

track type=bedGraph name="TopHat - read coverage"
gi|29823169|ref|NT_025004.13|Hs18_25160 0 9580 0
gi|29823169|ref|NT_025004.13|Hs18_25160 9580 9655 1
then 9580 has 1 or 0 hit??
melody is offline   Reply With Quote
Old 09-15-2009, 08:36 AM   #5
statsteam
Member
 
Location: Californica

Join Date: Sep 2009
Posts: 18
Default

Quote:
Originally Posted by melody View Post
as the output above
:A few lines of coverage.wig file are:

track type=bedGraph name="TopHat - read coverage"
gi|29823169|ref|NT_025004.13|Hs18_25160 0 9580 0
gi|29823169|ref|NT_025004.13|Hs18_25160 9580 9655 1
then 9580 has 1 or 0 hit??

No, that is a data column because the output is in bedGraph format.
When you copy and paste with correct chromosome name, it will draw a bedGraph based on the value of the data column.

In this example, it will draw 0 for chr18:0-9580 then draw 1 for chr18:9580-9655.

-Statsteam
statsteam is offline   Reply With Quote
Old 09-23-2009, 02:57 PM   #6
sdriscoll
I like code
 
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 438
Default

just to add to this discussion, i found when using sequencing data from mice it worked best for all of my source references to come from UCSC. i used FASTA files for each chromosome downloaded from UCSC's downloads area to build my Bowtie index and I also used UCSC's table browser to produce the GTF file (which i converted to GFF3 using scripts from seq ontology). only when I had built everything from those sources did i have reliable output files that work straight away with the UCSC browser. in fact, when I used the NCBI reference (and swapped the chromosome names out with UCSC's names) the output from Tophat didn't even align with the genome.
sdriscoll is offline   Reply With Quote
Old 03-23-2010, 07:40 AM   #7
RockChalkJayhawk
Senior Member
 
Location: Rochester, MN

Join Date: Mar 2009
Posts: 191
Default

Quote:
Originally Posted by sdriscoll View Post
just to add to this discussion, i found when using sequencing data from mice it worked best for all of my source references to come from UCSC. i used FASTA files for each chromosome downloaded from UCSC's downloads area to build my Bowtie index and I also used UCSC's table browser to produce the GTF file (which i converted to GFF3 using scripts from seq ontology). only when I had built everything from those sources did i have reliable output files that work straight away with the UCSC browser. in fact, when I used the NCBI reference (and swapped the chromosome names out with UCSC's names) the output from Tophat didn't even align with the genome.
What is the config file needed to use the Seq ontology script? I can't find the documentation for it.
RockChalkJayhawk is offline   Reply With Quote
Old 05-16-2011, 06:09 PM   #8
NGS newbie
Junior Member
 
Location: MA

Join Date: May 2011
Posts: 7
Default

Quote:
Originally Posted by sdriscoll View Post
just to add to this discussion, i found when using sequencing data from mice it worked best for all of my source references to come from UCSC. i used FASTA files for each chromosome downloaded from UCSC's downloads area to build my Bowtie index and I also used UCSC's table browser to produce the GTF file (which i converted to GFF3 using scripts from seq ontology). only when I had built everything from those sources did i have reliable output files that work straight away with the UCSC browser. in fact, when I used the NCBI reference (and swapped the chromosome names out with UCSC's names) the output from Tophat didn't even align with the genome.
I have that exact problem but is there a way to fix it if all I have is either the raw file or the bam. or bam.bai files? Do I need to ask my core personnel to realign using the UCSC files? Any help would be greatly appreciated..
NGS newbie is offline   Reply With Quote
Reply

Tags
browser, genome, tophat, ucsc

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 11:50 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO