Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • TOphat... -j vs -G help!

    Hi, i'm currently trying to align RNAseq reads to a reference genome and corresponding .gff file.

    I've built my genome index using bowtie-build, and bowtie-inspect -names returns values chr1, chr2, chr3, etc. I've edited the .gff file to be in the format name start end strand. For example, a line in my .gff file can be as follows:

    chr9 10190 10248 +

    My problem is that the alignment fails. The error output from Tophat is:

    [Thu Mar 1 23:03:43 2012] Preparing output location ./tophat_out/
    [Thu Mar 1 23:03:43 2012] Checking for Bowtie index files
    [Thu Mar 1 23:03:43 2012] Checking for reference FASTA file
    [Thu Mar 1 23:03:43 2012] Checking for Bowtie
    Bowtie version: 0.12.7.0
    [Thu Mar 1 23:03:43 2012] Checking for Samtools
    Samtools Version: 0.1.18
    [Thu Mar 1 23:03:43 2012] Generating SAM header for MYgenome
    format: fasta
    [Thu Mar 1 23:03:46 2012] Reading known junctions from GTF file
    Warning: TopHat did not find any junctions in GTF file
    [Thu Mar 1 23:03:47 2012] Preparing reads
    left reads: min. length=50, count=8625200
    [Thu Mar 1 23:06:47 2012] Creating transcriptome data files..
    [Thu Mar 1 23:07:03 2012] Building Bowtie index from transcriptome_index.fa
    [FAILED]
    Error: Couldn't build bowtie index with err = 1



    Does anyone know why this process is failing? I don't know why Tophat says it can't read any junctions from the GTF file (in my case a .gff file). I'm using the -G option in the tophat command to specify using the .gff file.

    The manual says .junc is in the format I mentioned above, but that it specifies an inclusive range for introns, with flanking exons. That's why I used -G instead of -j for .juncts, since my .gff file specifies an inclusive range for exons.


    Anyone have any thoughts on this?? Thanks for your input

    Is my format for .gff file correct?

  • #2
    Originally posted by all_your_base View Post
    ...a line in my .gff file can be as follows:

    chr9 10190 10248 +

    Is my format for .gff file correct?
    No. The GFF format specifies 9 columns; the start, end and strand information are in columns 4, 5 and 7 respectively. The format you described above is the TopHat .juncs format. You use the -j parameter to pass a .juncs file to TopHat.

    Comment


    • #3
      Thanks very much @kmcarr... This is the info I needed

      Comment


      • #4
        same thing happens with Homo_sapiens.GRCh37.62.gtf

        Hi,

        I get the same error message and I used the Homo_sapiens.GRCh37.62.gtf
        from ftp://ftp.ensembl.org/pub/release-62/gtf/homo_sapiens/

        So, I need to provide chr1, chr2, etc in collumn 0 instead of the terms used by ensembl?

        Thanks

        Comment


        • #5
          May be you guys already figured this out by now but incase you didnt, given that gff file is in the correct format and fasta file is in the correct format, tophat throws this error when the chromosome names dont match between the gff file and fasta file.

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Essential Discoveries and Tools in Epitranscriptomics
            by seqadmin




            The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
            04-22-2024, 07:01 AM
          • seqadmin
            Current Approaches to Protein Sequencing
            by seqadmin


            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
            04-04-2024, 04:25 PM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, Today, 08:47 AM
          0 responses
          12 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-11-2024, 12:08 PM
          0 responses
          60 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 10:19 PM
          0 responses
          59 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 09:21 AM
          0 responses
          54 views
          0 likes
          Last Post seqadmin  
          Working...
          X