Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Avoid changing the GTF file

    I would caution against messing around with the GTF file, as in the future this should break, as soon as Ensembl switches to the next version. In addition, it doesn't work for any other organisms.

    I am running cufflinks and solved the problem differently: I indexed the original Ensembl genome, ran bowtie on it, converted the samfile to bam, sorted it, removed this weird transcript from the Ensembl file and ran cufflinks normally, without any awk script.

    here is a log file of what I did:

    Code:
    bowtie-build -C Mus_musculus.NCBIM37.61.dna.toplevel.fa Mus_musculus.NCBIM37.61.dna.toplevel_c
    bowtie -f -C -m 1 -p4 --sam $(BOWTIEINDEX) $$i > $(MAPPEDREADS)/`basename $$i .txt`.sam
    samtools view -Sb sam/SL005_R00002_RME033_01pg_F3.csfasta.sam > sam/SL005_R00002_RME033_01pg_F3.csfasta.bam
    samtools sort -Sb SL005_R00002_RME033_01pg_F3.csfasta.bam SL005_R00002_RME033_01pg_F3.csfasta.sorted
    512391
    grep -v ENSMUST00000127664 Mus_musculus.NCBIM37.61.gtf > Mus_musculus.NCBIM37.61.corrected.gtf
    ~/software/cufflinks-0.9.3/cufflinks -G Mus_musculus.NCBIM37.61.corrected.gtf -v sam.old/SL005_R00002_RME033_01pg_F3.csfasta.sorted.

    Comment


    • #17
      Another solution is go to tophat-1.2.0/src/gff.cpp
      and change
      const uint GFF_MAX_LOCUS = 4000000;
      to
      const uint GFF_MAX_LOCUS = 5000000;
      then recompile the tophat

      Ensemble database indicate that this transcript ENSMUST00000127664 is 4.43Mb, bigger than previous cut-off GFF_MAX_LOCUS = 4000000.

      Originally posted by marcora View Post
      My bad! This last time when I ran tophat I didn't rename the cleaned GTF file to "mm9.ensembl" and therefore tophat couldn't find it. Surprisingly, instead of reporting a missing file, tophat gave me the same exact warning as before.

      In conclusion, with the squeaky clean GTF file obtained from Mus_musculus.NCBIM37.60.gtf as such:

      Code:
      awk '{print "chr"$0}' Mus_musculus.NCBIM37.60.gtf | sed 's/chrMT/chrM/g' | awk '/^chr[1-9XYM]|^chr1[0-9]/' | grep -v "ENSMUST00000127664" > mm9.ensembl.gtf
      I am finally able to run tophat against the ENSEMBL annotation.

      You are my hero!

      Thank you very much for your help.

      Comment


      • #18
        Gtf

        Will try new approach in next run uptill now we have been changing GTF file

        Comment


        • #19
          Additional error source

          Just wanted to add -- as of tophat 1.4.0 -- there is another way to get this "did not find any junctions" error message. If your gtf has any entries with nonstandard strand symbols, for instance '*', parsing will apparently fail for the the entire gtf, even though all other entries are OK.

          Reading gtf_juncs.log will show you the offending line.

          Comment


          • #20
            Originally posted by marcora View Post
            My bad! This last time when I ran tophat I didn't rename the cleaned GTF file to "mm9.ensembl" and therefore tophat couldn't find it. Surprisingly, instead of reporting a missing file, tophat gave me the same exact warning as before.

            In conclusion, with the squeaky clean GTF file obtained from Mus_musculus.NCBIM37.60.gtf as such:

            Code:
            awk '{print "chr"$0}' Mus_musculus.NCBIM37.60.gtf | sed 's/chrMT/chrM/g' | awk '/^chr[1-9XYM]|^chr1[0-9]/' | grep -v "ENSMUST00000127664" > mm9.ensembl.gtf
            I am finally able to run tophat against the ENSEMBL annotation.

            You are my hero!

            Thank you very much for your help.
            Dear All

            Please tell me , I got the same error as follows. I am using everything from UCSC hg19 and downloaded everything, unpacked them but still getting this error:
            What to do ?

            Warning: couldn't find fasta record for 'chr17_ctg5_hap1'!
            Warning: couldn't find fasta record for 'chr17_gl000205_random'!
            Warning: couldn't find fasta record for 'chr19_gl000209_random'!
            Warning: couldn't find fasta record for 'chr1_gl000191_random'!
            Warning: couldn't find fasta record for 'chr4_ctg9_hap1'!
            Warning: couldn't find fasta record for 'chr4_gl000193_random'!
            Warning: couldn't find fasta record for 'chr4_gl000194_random'!
            Warning: couldn't find fasta record for 'chr6_apd_hap1'!
            Warning: couldn't find fasta record for 'chr6_cox_hap2'!
            Warning: couldn't find fasta record for 'chr6_dbb_hap3'!
            Warning: couldn't find fasta record for 'chr6_mann_hap4'!
            Warning: couldn't find fasta record for 'chr6_mcf_hap5'!
            Warning: couldn't find fasta record for 'chr6_qbl_hap6'!
            Warning: couldn't find fasta record for 'chr6_ssto_hap7'!
            Warning: couldn't find fasta record for 'chr7_gl000195_random'!
            Warning: couldn't find fasta record for 'chrUn_gl000211'!
            Warning: couldn't find fasta record for 'chrUn_gl000212'!
            Warning: couldn't find fasta record for 'chrUn_gl000218'!
            Warning: couldn't find fasta record for 'chrUn_gl000219'!
            Warning: couldn't find fasta record for 'chrUn_gl000220'!
            Warning: couldn't find fasta record for 'chrUn_gl000222'!
            Warning: couldn't find fasta record for 'chrUn_gl000223'!
            Warning: couldn't find fasta record for 'chrUn_gl000228'!

            Comment


            • #21
              The command that you referenced is for genome annotation files from ENSEMBL (release 60), not for UCSC files.

              Comment


              • #22
                Thank you so much for guidance. I was so confused ... and now thats a relief.
                I am using UCSC data and commands of ensembl...thats the problem..
                Well, this mean I should download the emsembl GRCh37 ?
                .. but I am confused about seq output results... do you think differential expression results from ensembl data/commands and UCSC data/commands will have no significant different ?

                There must be a best way out of both ?

                Kindly do reply if you get reply.

                Comment


                • #23
                  Please suggest me something ...
                  I got a known problem but I have no solution. I tried downloading data (wget) from 'Ensembl GRCh37 17297 MB May 14 17:23' but got error after 12 hrs..

                  --2013-04-17 02:23:44-- ftp://igenome:*password*@ussd-ftp.il..._GRCh37.tar.gz
                  (try: 2) => `Homo_sapiens_Ensembl_GRCh37.tar.gz'
                  ==> CWD not required.
                  ==> SIZE Homo_sapiens_Ensembl_GRCh37.tar.gz ... Aborted

                  I searched, this problem is posted couple of times... But I cant find the solution....???

                  Can somebody give me a hit...

                  Comment


                  • #24
                    one more thing to share

                    tophat2 2.0.10 still doesn't recognize gzipped file in -G option, so do uncompress the gtf file before you run the command. Or you will be confronted with warning "opHat did not find any junctions in GTF file" first, then with the error "gtf_to_fasta returned an error".

                    i downloaded fasta and gtf from ensemble ftp site for zea mays. files are gzipped for save spaces.
                    Last edited by eastasiasnow; 01-15-2014, 12:12 AM. Reason: add more contents

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Current Approaches to Protein Sequencing
                      by seqadmin


                      Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                      04-04-2024, 04:25 PM
                    • seqadmin
                      Strategies for Sequencing Challenging Samples
                      by seqadmin


                      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                      03-22-2024, 06:39 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, 04-11-2024, 12:08 PM
                    0 responses
                    17 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 10:19 PM
                    0 responses
                    22 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 09:21 AM
                    0 responses
                    16 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-04-2024, 09:00 AM
                    0 responses
                    46 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X