SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
couldn't run CNVnator menenuh Bioinformatics 39 07-29-2014 12:48 PM
Cannot find any mention of reference gff/gtf annotations in cuffmerge output anna_vt Bioinformatics 0 11-13-2012 04:55 AM
Extract partial sequence from FASTA record cdlam Bioinformatics 9 10-30-2012 03:21 PM
Find all occurrences of a sequence in a fasta file dphansti Bioinformatics 3 12-06-2011 07:11 AM
Where can I find the complete FASTA format sequence(human and mouse)? iloveneworleans Bioinformatics 5 02-24-2010 05:00 PM

Reply
 
Thread Tools
Old 12-02-2012, 12:58 PM   #1
DonDolowy
Member
 
Location: Freiburg

Join Date: Oct 2012
Posts: 56
Default Cuffmerge Warning: couldn't find fasta record for 'chr1_random'

Hi everyone,

I just got my first RNA-seq dataset (50bp, paired-end) and am trying to analyze it using the common top hat - cufflinks - cuffdiff way of doing it. Actually, I am using the pipeline suggested in the following Nat Prot. paper:Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks.

However, I run into some problems when I use cuffmerge.
The annotations files I use, are the one downloaded for mm9 on Tophats homepage provided by Illumina.

cuffmerge -g /home/dalgaard/genomes/mm9/Annotation/Genes/genes.gtf -s /home/dalgaard/genomes/mm9/Sequence/WholeGenomeFasta/genome.fa -p 8 assemblies.txt

Assemblies.txt contains:
/home/dalgaard/xx/sample01/sample01_tophat_out/sample01.cufflinks.out/transcripts.gtf
/home/dalgaard/xx/sample02/sample02_tophat_out/sample02.cufflinks.out/transcripts.gtf

The error messages is the following that it cannot find the names for the chromosomes.

I really appreciate your help!

Thanks a lot.

Kind regards,

Kevin Dalgaard
-------

cufflinks -o ./merged_asm/ -F 0.05 -g /home/dalgaard/genomes/mm9/Annotation/Genes/genes.gtf -q --overhang-tolerance 200 --library-type=transfrags -A 0.0 --min-frags-per-transfrag 0 --no-5-extend -p 8 ./merged_asm/tmp/mergeSam_file9S5P0t
[bam_header_read] EOF marker is absent.
[bam_header_read] invalid BAM binary header (this is not a BAM file).
File ./merged_asm/tmp/mergeSam_file9S5P0t doesn't appear to be a valid BAM file, trying SAM...
[21:45:58] Loading reference annotation.
[21:46:02] Inspecting reads and determining fragment length distribution.
Processed 26894 loci.
> Map Properties:
> Normalized Map Mass: 71083.00
> Raw Map Mass: 71083.00
> Fragment Length Distribution: Truncated Gaussian (default)
> Default Mean: 200
> Default Std Dev: 80
[21:46:03] Assembling transcripts and estimating abundances.

Processed 26412 loci.
[Sun Dec 2 18:39:40 2012] Comparing against reference file /home/dalgaard/refgenome/mm9.igenes.gtf
Warning: Your version of Cufflinks is not up-to-date. It is recommended that you upgrade to Cufflinks v2.0.2 to benefit from the most recent features and bug fixes (http://cufflinks.cbcb.umd.edu).
Warning: couldn't find fasta record for 'chr13_random'!
Warning: couldn't find fasta record for 'chr17_random'!
Warning: couldn't find fasta record for 'chr1_random'!
Warning: couldn't find fasta record for 'chr4_random'!
Warning: couldn't find fasta record for 'chr5_random'!
Warning: couldn't find fasta record for 'chr7_random'!
Warning: couldn't find fasta record for 'chr8_random'!
Warning: couldn't find fasta record for 'chr9_random'!
Warning: couldn't find fasta record for 'chrUn_random'!
Warning: couldn't find fasta record for 'chrX_random'!
Warning: couldn't find fasta record for 'chrY_random'!

Last edited by DonDolowy; 12-02-2012 at 01:12 PM.
DonDolowy is offline   Reply With Quote
Old 02-25-2013, 04:07 PM   #2
joseph.troy
Junior Member
 
Location: Urbana, IL

Join Date: Oct 2012
Posts: 4
Default

Hello,

Did you find any answers to your couldn't find fasta record for 'chr1_random' i've run into the same problem.

Thank you

-Joe
joseph.troy is offline   Reply With Quote
Old 02-25-2013, 11:35 PM   #3
DonDolowy
Member
 
Location: Freiburg

Join Date: Oct 2012
Posts: 56
Default

What I decided to do is to use the grep command and remove all lines containing something with "_random". That allows you to continue your analysis.
DonDolowy is offline   Reply With Quote
Old 08-10-2013, 05:49 AM   #4
Alex234
Member
 
Location: UK

Join Date: Aug 2013
Posts: 31
Default

Hello, which file did you remove words containing '_random' from, and how exactly do you do this with a grep command?

Thanks

Alex
Alex234 is offline   Reply With Quote
Old 08-25-2013, 07:43 PM   #5
matrix731
Junior Member
 
Location: Shenzhen

Join Date: Oct 2011
Posts: 4
Default

I think it is because the chr in the gtf you used as '-g' is different from that in the genome fasta file. Maybe you can check the 'chr name' of these two files, by grep "_random" gtf/fa.

To solve this problem, you can remove all the transcripts which associated with chr*_random in the gtf, then try to do the analysis again.
matrix731 is offline   Reply With Quote
Old 08-27-2013, 05:35 AM   #6
Alex234
Member
 
Location: UK

Join Date: Aug 2013
Posts: 31
Default

Thanks, that did remove some, but not all, of the error lines. And couldn't these be important sequences that we are grepping?

Alex
Alex234 is offline   Reply With Quote
Old 08-28-2013, 12:30 AM   #7
matrix731
Junior Member
 
Location: Shenzhen

Join Date: Oct 2011
Posts: 4
Default

Quote:
Originally Posted by Alex234 View Post
Thanks, that did remove some, but not all, of the error lines. And couldn't these be important sequences that we are grepping?

Alex
Maybe.
So the best way is make sure that the ref gtf and your analysis pipeline are using the same version of genome to locate the transcripts or do the alignment.
You can download the mouse genome here http://hgdownload.cse.ucsc.edu/downloads.html#mouse from UCSC, which could possibly solve the problem.
matrix731 is offline   Reply With Quote
Old 08-28-2013, 12:38 AM   #8
DonDolowy
Member
 
Location: Freiburg

Join Date: Oct 2012
Posts: 56
Default

I just find it odd that if you download a certain iGenome "package" (e.g. UCSC mm9) that then the genome.fa and genes.gtf do not correspond and you get this error.

Personally, I have just removed all lines containing "random".
If I got it correctly, chr1_random just means that when the genome got assembled, sequences were mapped to chromosome 1 but it is not known specifically where on chromosome 1 they go. Maybe they are repetitive sequences.
DonDolowy is offline   Reply With Quote
Old 08-28-2013, 02:02 AM   #9
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Well, it's odd that the iGenomes files don't always correspond, the error itself makes sense. I wouldn't recommend removing the *_random lines from either a the reference or the annotation. Those sequences/features are actually in the genome, so leaving them out will bias alignment a bit (the magnitude of this effect is likely fairly small, of course).
dpryan is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 11:34 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO