Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Cuffmerge Warning: couldn't find fasta record for 'chr1_random'

    Hi everyone,

    I just got my first RNA-seq dataset (50bp, paired-end) and am trying to analyze it using the common top hat - cufflinks - cuffdiff way of doing it. Actually, I am using the pipeline suggested in the following Nat Prot. paper:Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks.

    However, I run into some problems when I use cuffmerge.
    The annotations files I use, are the one downloaded for mm9 on Tophats homepage provided by Illumina.

    cuffmerge -g /home/dalgaard/genomes/mm9/Annotation/Genes/genes.gtf -s /home/dalgaard/genomes/mm9/Sequence/WholeGenomeFasta/genome.fa -p 8 assemblies.txt

    Assemblies.txt contains:
    /home/dalgaard/xx/sample01/sample01_tophat_out/sample01.cufflinks.out/transcripts.gtf
    /home/dalgaard/xx/sample02/sample02_tophat_out/sample02.cufflinks.out/transcripts.gtf

    The error messages is the following that it cannot find the names for the chromosomes.

    I really appreciate your help!

    Thanks a lot.

    Kind regards,

    Kevin Dalgaard
    -------

    cufflinks -o ./merged_asm/ -F 0.05 -g /home/dalgaard/genomes/mm9/Annotation/Genes/genes.gtf -q --overhang-tolerance 200 --library-type=transfrags -A 0.0 --min-frags-per-transfrag 0 --no-5-extend -p 8 ./merged_asm/tmp/mergeSam_file9S5P0t
    [bam_header_read] EOF marker is absent.
    [bam_header_read] invalid BAM binary header (this is not a BAM file).
    File ./merged_asm/tmp/mergeSam_file9S5P0t doesn't appear to be a valid BAM file, trying SAM...
    [21:45:58] Loading reference annotation.
    [21:46:02] Inspecting reads and determining fragment length distribution.
    Processed 26894 loci.
    > Map Properties:
    > Normalized Map Mass: 71083.00
    > Raw Map Mass: 71083.00
    > Fragment Length Distribution: Truncated Gaussian (default)
    > Default Mean: 200
    > Default Std Dev: 80
    [21:46:03] Assembling transcripts and estimating abundances.

    Processed 26412 loci.
    [Sun Dec 2 18:39:40 2012] Comparing against reference file /home/dalgaard/refgenome/mm9.igenes.gtf
    Warning: Your version of Cufflinks is not up-to-date. It is recommended that you upgrade to Cufflinks v2.0.2 to benefit from the most recent features and bug fixes (http://cufflinks.cbcb.umd.edu).
    Warning: couldn't find fasta record for 'chr13_random'!
    Warning: couldn't find fasta record for 'chr17_random'!
    Warning: couldn't find fasta record for 'chr1_random'!
    Warning: couldn't find fasta record for 'chr4_random'!
    Warning: couldn't find fasta record for 'chr5_random'!
    Warning: couldn't find fasta record for 'chr7_random'!
    Warning: couldn't find fasta record for 'chr8_random'!
    Warning: couldn't find fasta record for 'chr9_random'!
    Warning: couldn't find fasta record for 'chrUn_random'!
    Warning: couldn't find fasta record for 'chrX_random'!
    Warning: couldn't find fasta record for 'chrY_random'!
    Last edited by DonDolowy; 12-02-2012, 01:12 PM.

  • #2
    Hello,

    Did you find any answers to your couldn't find fasta record for 'chr1_random' i've run into the same problem.

    Thank you

    -Joe

    Comment


    • #3
      What I decided to do is to use the grep command and remove all lines containing something with "_random". That allows you to continue your analysis.

      Comment


      • #4
        Hello, which file did you remove words containing '_random' from, and how exactly do you do this with a grep command?

        Thanks

        Alex

        Comment


        • #5
          I think it is because the chr in the gtf you used as '-g' is different from that in the genome fasta file. Maybe you can check the 'chr name' of these two files, by grep "_random" gtf/fa.

          To solve this problem, you can remove all the transcripts which associated with chr*_random in the gtf, then try to do the analysis again.

          Comment


          • #6
            Thanks, that did remove some, but not all, of the error lines. And couldn't these be important sequences that we are grepping?

            Alex

            Comment


            • #7
              Originally posted by Alex234 View Post
              Thanks, that did remove some, but not all, of the error lines. And couldn't these be important sequences that we are grepping?

              Alex
              Maybe.
              So the best way is make sure that the ref gtf and your analysis pipeline are using the same version of genome to locate the transcripts or do the alignment.
              You can download the mouse genome here http://hgdownload.cse.ucsc.edu/downloads.html#mouse from UCSC, which could possibly solve the problem.

              Comment


              • #8
                I just find it odd that if you download a certain iGenome "package" (e.g. UCSC mm9) that then the genome.fa and genes.gtf do not correspond and you get this error.

                Personally, I have just removed all lines containing "random".
                If I got it correctly, chr1_random just means that when the genome got assembled, sequences were mapped to chromosome 1 but it is not known specifically where on chromosome 1 they go. Maybe they are repetitive sequences.

                Comment


                • #9
                  Well, it's odd that the iGenomes files don't always correspond, the error itself makes sense. I wouldn't recommend removing the *_random lines from either a the reference or the annotation. Those sequences/features are actually in the genome, so leaving them out will bias alignment a bit (the magnitude of this effect is likely fairly small, of course).

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Techniques and Challenges in Conservation Genomics
                    by seqadmin



                    The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                    Avian Conservation
                    Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                    03-08-2024, 10:41 AM
                  • seqadmin
                    The Impact of AI in Genomic Medicine
                    by seqadmin



                    Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
                    02-26-2024, 02:07 PM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, 03-14-2024, 06:13 AM
                  0 responses
                  34 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 03-08-2024, 08:03 AM
                  0 responses
                  72 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 03-07-2024, 08:13 AM
                  0 responses
                  82 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 03-06-2024, 09:51 AM
                  0 responses
                  68 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X