Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • statsteam
    Member
    • Sep 2009
    • 19

    Using TopHat output files with UCSC genome browser

    Hi all,

    Recently, I ran TopHat with 76bp reads data and got the results (sam, bed, and wig files).

    Actual a few lines of my input (fasta file) are:
    >HWUSI-EAS366:4:1:4:624#0/1:
    CTCNGGATGGAGTACAGTGGTGTGATCATGGCTCACTGTAGNNNNNANCN CNTGGGCGCAAGCNNNNNNNNNCTAN
    >HWUSI-EAS366:4:1:4:243#0/1:
    CGGNGCCGTTGCTGGTTCTCACACCTTTTAGGTCTGTTCTCNNNNNCNGN TNCGACTCTCTCTNNNNNANNNCCGN
    >HWUSI-EAS366:4:1:4:1373#0/1:
    GAAAAAACCACCCAGCGGTGATGGCAGCGCGCGTGGGTCCCNNNGNGNGN GGGGCGGGTCGCGCNNNNGNNNCGAN
    >HWUSI-EAS366:4:1:4:1672#0/1:
    GGGCAGGAAAAAAAGGGAAGANAAAATACTGGGGAAGAAAANNNANCNCN GTTTGGCAGCTCTTNNNNGNNNCAGN


    And a few lines of junctions.bed file are:

    track name=junctions description="TopHat junctions"
    gi|29823169|ref|NT_025004.13|Hs18_25160 9690 19656 JUNC00000001 1 + 9690 19656 255,0,0 2 37,38 0,9928
    gi|29823169|ref|NT_025004.13|Hs18_25160 14260 19654 JUNC00000002 2 + 14260 19654 255,0,0 2 57,36 0,5358
    gi|29823169|ref|NT_025004.13|Hs18_25160 19701 160104 JUNC00000003 3 + 19701 160104 255,0,0 2 32,66 0,140337


    A few lines of coverage.wig file are:

    track type=bedGraph name="TopHat - read coverage"
    gi|29823169|ref|NT_025004.13|Hs18_25160 0 9580 0
    gi|29823169|ref|NT_025004.13|Hs18_25160 9580 9655 1
    gi|29823169|ref|NT_025004.13|Hs18_25160 9655 9690 0


    Here is the problem.

    When I copied and pasted the results (either bed file or wig file), I always got an error and when I change the gi|29823169|ref|NT... part to something like chromosome name, it works.

    As you can see from my input file, I don't have gi|29823169|ref|NT... part. I am not sure where the TopHat find such label or reference.

    Can someone tell me what gi|29823169|ref|NT... part means and how I can convert these files into the one that UCSC genome brower understands. I think I need to get the actual chromosome names.

    Thank you,
    Statsteam
  • simonandrews
    Simon Andrews
    • May 2009
    • 870

    #2
    The gi lines you see are the fasta file headers from the NCBI human assembly. Each chromosome in that assembly comes in a separate file and it is the accession codes for those separate files that you are seeing.

    The full header for the first accession you found is:

    >gi|29823169|ref|NT_025004.13|Hs18_25160 Homo sapiens chromosome 18 genomic contig, reference assembly

    You therefore need to find all the accessions for the different chromosomes and replace them with the corresponding chromosome name.

    Alternatively you could edit the original fasta files and change the first lines to just contain a chromosome name, eg:

    >chr18

    ..and then reindex the genome and run tophat again. This should put usable chromosome names into your output files.

    Comment

    • statsteam
      Member
      • Sep 2009
      • 19

      #3
      Thank you simon.
      I just started bowtie-build with fasta files containing only chromosome names.

      Statsteam

      Comment

      • melody
        Junior Member
        • Sep 2008
        • 2

        #4
        as the output above
        :A few lines of coverage.wig file are:

        track type=bedGraph name="TopHat - read coverage"
        gi|29823169|ref|NT_025004.13|Hs18_25160 0 9580 0
        gi|29823169|ref|NT_025004.13|Hs18_25160 9580 9655 1
        then 9580 has 1 or 0 hit??

        Comment

        • statsteam
          Member
          • Sep 2009
          • 19

          #5
          Originally posted by melody View Post
          as the output above
          :A few lines of coverage.wig file are:

          track type=bedGraph name="TopHat - read coverage"
          gi|29823169|ref|NT_025004.13|Hs18_25160 0 9580 0
          gi|29823169|ref|NT_025004.13|Hs18_25160 9580 9655 1
          then 9580 has 1 or 0 hit??

          No, that is a data column because the output is in bedGraph format.
          When you copy and paste with correct chromosome name, it will draw a bedGraph based on the value of the data column.

          In this example, it will draw 0 for chr18:0-9580 then draw 1 for chr18:9580-9655.

          -Statsteam

          Comment

          • sdriscoll
            I like code
            • Sep 2009
            • 436

            #6
            just to add to this discussion, i found when using sequencing data from mice it worked best for all of my source references to come from UCSC. i used FASTA files for each chromosome downloaded from UCSC's downloads area to build my Bowtie index and I also used UCSC's table browser to produce the GTF file (which i converted to GFF3 using scripts from seq ontology). only when I had built everything from those sources did i have reliable output files that work straight away with the UCSC browser. in fact, when I used the NCBI reference (and swapped the chromosome names out with UCSC's names) the output from Tophat didn't even align with the genome.
            /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
            Salk Institute for Biological Studies, La Jolla, CA, USA */

            Comment

            • RockChalkJayhawk
              Senior Member
              • Mar 2009
              • 192

              #7
              Originally posted by sdriscoll View Post
              just to add to this discussion, i found when using sequencing data from mice it worked best for all of my source references to come from UCSC. i used FASTA files for each chromosome downloaded from UCSC's downloads area to build my Bowtie index and I also used UCSC's table browser to produce the GTF file (which i converted to GFF3 using scripts from seq ontology). only when I had built everything from those sources did i have reliable output files that work straight away with the UCSC browser. in fact, when I used the NCBI reference (and swapped the chromosome names out with UCSC's names) the output from Tophat didn't even align with the genome.
              What is the config file needed to use the Seq ontology script? I can't find the documentation for it.

              Comment

              • NGS newbie
                Junior Member
                • May 2011
                • 7

                #8
                Originally posted by sdriscoll View Post
                just to add to this discussion, i found when using sequencing data from mice it worked best for all of my source references to come from UCSC. i used FASTA files for each chromosome downloaded from UCSC's downloads area to build my Bowtie index and I also used UCSC's table browser to produce the GTF file (which i converted to GFF3 using scripts from seq ontology). only when I had built everything from those sources did i have reliable output files that work straight away with the UCSC browser. in fact, when I used the NCBI reference (and swapped the chromosome names out with UCSC's names) the output from Tophat didn't even align with the genome.
                I have that exact problem but is there a way to fix it if all I have is either the raw file or the bam. or bam.bai files? Do I need to ask my core personnel to realign using the UCSC files? Any help would be greatly appreciated..

                Comment

                Latest Articles

                Collapse

                • SEQadmin2
                  Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                  by SEQadmin2


                  I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

                  Here are nine questions we think about, in roughly the order they matter, before...
                  06-18-2026, 07:11 AM
                • SEQadmin2
                  From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                  by SEQadmin2


                  Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                  The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                  ...
                  06-02-2026, 10:05 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by SEQadmin2, 06-26-2026, 11:10 AM
                0 responses
                11 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, 06-17-2026, 06:09 AM
                0 responses
                45 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, 06-09-2026, 11:58 AM
                0 responses
                105 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, 06-05-2026, 10:09 AM
                0 responses
                125 views
                0 reactions
                Last Post SEQadmin2  
                Working...