SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Inconsistency between cuffdiff 1.1.0 and cuffdiff 1.0.2 tleonardi Bioinformatics 3 11-16-2011 09:23 AM
cuffdiff q-value fangquan Bioinformatics 1 09-19-2011 07:08 PM
CuffDiff q-value jdanderson Bioinformatics 0 08-01-2011 11:51 AM
What comes after cuffdiff? shurjo Bioinformatics 1 06-05-2011 06:36 AM
Cuffdiff question alvin Bioinformatics 0 02-15-2011 04:59 AM

Reply
 
Thread Tools
Old 10-04-2010, 06:49 AM   #1
Balat
Member
 
Location: Australia

Join Date: May 2010
Posts: 36
Default cuffdiff-0.9.1

Hi,
I am using the latest version of cufflinks (v0.9.1). When I tried to run cuffdiff, with stdout.combined.gtf and SAM files generated from tophat-1.0.13, I got the following error message,
Error: this SAM file doesn't appear to be correctly sorted!
current hit is at scaffold_10:84064, last one was at scaffold_9:38984923
Cufflinks requires that if your file has SQ records in
the SAM header that they appear in the same order as the chromosomes names
in the alignments.
If there are no SQ records in the header, or if the header is missing,
the alignments must be sorted lexicographically by chromsome
name and by position.

I sorted the SAM with "LC_ALL="C" sort -k 3,3 -k 4,4n" command but the problem persists. Could someone please let me know how to sort the SAM lexicographically by chromsome name and by position? According to the notes in the latest release version, this version should correctly handle the SAM files generated using tophat-1.0.14 or earlier versions.
Balat is offline   Reply With Quote
Old 10-04-2010, 07:09 AM   #2
Cole Trapnell
Senior Member
 
Location: Boston, MA

Join Date: Nov 2008
Posts: 212
Default

Quote:
Originally Posted by Balat View Post
Hi,
I am using the latest version of cufflinks (v0.9.1). When I tried to run cuffdiff, with stdout.combined.gtf and SAM files generated from tophat-1.0.13, I got the following error message,
Error: this SAM file doesn't appear to be correctly sorted!
current hit is at scaffold_10:84064, last one was at scaffold_9:38984923
Cufflinks requires that if your file has SQ records in
the SAM header that they appear in the same order as the chromosomes names
in the alignments.
If there are no SQ records in the header, or if the header is missing,
the alignments must be sorted lexicographically by chromsome
name and by position.

I sorted the SAM with "LC_ALL="C" sort -k 3,3 -k 4,4n" command but the problem persists. Could someone please let me know how to sort the SAM lexicographically by chromsome name and by position? According to the notes in the latest release version, this version should correctly handle the SAM files generated using tophat-1.0.14 or earlier versions.
Does this happen during the initial inspection of the reads?
Cole Trapnell is offline   Reply With Quote
Old 10-04-2010, 01:49 PM   #3
zones
Junior Member
 
Location: La Jolla

Join Date: Jun 2010
Posts: 6
Default

I'm having this same problem and think it occurs during inspection. I pasted the output below.

[bam_header_read] EOF marker is absent.
[bam_header_read] invalid BAM binary header (this is not a BAM file).
File /users/zones/PEtophat/PEsams/0hrsL.sam doesn't appear to be a valid BAM file, trying SAM...
[22:26:02] Inspecting reads and determining fragment length distribution.
> Processing Locus scaffold_79:11475-16927 [*********************** ] 92%
Error: this SAM file doesn't appear to be correctly sorted!
current hit is at scaffold_77:5171, last one was at scaffold_76:21975
Cufflinks requires that if your file has SQ records in
the SAM header that they appear in the same order as the chromosomes names
in the alignments.
If there are no SQ records in the header, or if the header is missing,
the alignments must be sorted lexicographically by chromsome
name and by position.
zones is offline   Reply With Quote
Old 10-04-2010, 01:51 PM   #4
Balat
Member
 
Location: Australia

Join Date: May 2010
Posts: 36
Default

Thanks Cole for the reply. Sorry, I couldn't quite get your question. I ran the cufflinks and cuffcompare sucessfully with the newer version (v0.9.1). When I run cuffdiff, it says looking into the reference sequence and I get this error message. I get this message during the initial stages of running the program.
Balat is offline   Reply With Quote
Old 10-04-2010, 02:02 PM   #5
zones
Junior Member
 
Location: La Jolla

Join Date: Jun 2010
Posts: 6
Default

I should have mentioned that I received this error while running cufflinks, not cuffcompare or cuffdiff.
zones is offline   Reply With Quote
Old 10-04-2010, 02:04 PM   #6
Cole Trapnell
Senior Member
 
Location: Boston, MA

Join Date: Nov 2008
Posts: 212
Default

Quote:
Originally Posted by Balat View Post
Thanks Cole for the reply. Sorry, I couldn't quite get your question. I ran the cufflinks and cuffcompare sucessfully with the newer version (v0.9.1). When I run cuffdiff, it says looking into the reference sequence and I get this error message. I get this message during the initial stages of running the program.
My guess is that this is happening because you are using a GTF file that doesn't have records for all of the scaffolds in your assembly, and there are reads that are hitting some of the scaffolds that aren't in the GTF. If this were the case, the current version of Cufflinks will produce this behavior. I suggest that you try generating a SAM header that conforms to the requirements on the website: one SQ record for each scaffold in your assembly, and sorted in the header in lexicographic order. You can then either prepend this header onto your SAM files or convert them to BAM with the header you made (samtools "view" should let you do this).

Another alternative, which might be easier but take longer in wall-clock time is to remap your reads with TopHat 1.1.0, which should produce BAM files with valid headers.
Cole Trapnell is offline   Reply With Quote
Old 10-04-2010, 03:48 PM   #7
zones
Junior Member
 
Location: La Jolla

Join Date: Jun 2010
Posts: 6
Default

Thanks for the response Cole.

I tried one version of your first suggestion: to add the complete SQ header to my .sam file with samtools and feed this to cufflinks with the following work flow (did not convert to BAM format):

samtools faidx chlre4_masked.fa

samtools view -h -S -t chlre4_masked.fa.fai -o 0hrsL_s.sam /users/zones/PEtophat/PEsams/0hrsL.sam

nohup cufflinks -o ./0hrsL -N -r /users/zones/PEtophat/chlre4_masked.fa -m 150 -I 5000 -p 2 -G /users/zones/augGTFs/augustus.u9_addDesc.gtf /users/zones/PEsamtools/0hrsL_s.sam &

This produced the following error:

~/PEcuff_2 > cat nohup.out
[bam_header_read] EOF marker is absent.
[bam_header_read] invalid BAM binary header (this is not a BAM file).
File /users/zones/PEsamtools/0hrsL_s.sam doesn't appear to be a valid BAM file, trying SAM...
[15:23:59] Inspecting reads and determining fragment length distribution.
> Processing Locus chromosome_17:6609984-66144 [***** ] 22%
Error: this SAM file doesn't appear to be correctly sorted!
current hit is at chromosome_2:8891, last one was at chromosome_17:6666955
Cufflinks requires that if your file has SQ records in
the SAM header that they appear in the same order as the chromosomes names
in the alignments.
If there are no SQ records in the header, or if the header is missing,
the alignments must be sorted lexicographically by chromsome
name and by position.


This is interesting because the genomic coordinates of this error are different from those in my previous attempt (earlier post) and also match exactly those of the error produced a few days back by cufflinks-0.9.0.


I'll try converting this SAM to a BAM to see what happens and am also in the process of re-mapping my reads with the new version of tophat.
zones is offline   Reply With Quote
Old 10-04-2010, 04:00 PM   #8
Cole Trapnell
Senior Member
 
Location: Boston, MA

Join Date: Nov 2008
Posts: 212
Default

Quote:
Originally Posted by zones View Post
Thanks for the response Cole.

I tried one version of your first suggestion: to add the complete SQ header to my .sam file with samtools and feed this to cufflinks with the following work flow (did not convert to BAM format):

samtools faidx chlre4_masked.fa

samtools view -h -S -t chlre4_masked.fa.fai -o 0hrsL_s.sam /users/zones/PEtophat/PEsams/0hrsL.sam

nohup cufflinks -o ./0hrsL -N -r /users/zones/PEtophat/chlre4_masked.fa -m 150 -I 5000 -p 2 -G /users/zones/augGTFs/augustus.u9_addDesc.gtf /users/zones/PEsamtools/0hrsL_s.sam &

This produced the following error:

~/PEcuff_2 > cat nohup.out
[bam_header_read] EOF marker is absent.
[bam_header_read] invalid BAM binary header (this is not a BAM file).
File /users/zones/PEsamtools/0hrsL_s.sam doesn't appear to be a valid BAM file, trying SAM...
[15:23:59] Inspecting reads and determining fragment length distribution.
> Processing Locus chromosome_17:6609984-66144 [***** ] 22%
Error: this SAM file doesn't appear to be correctly sorted!
current hit is at chromosome_2:8891, last one was at chromosome_17:6666955
Cufflinks requires that if your file has SQ records in
the SAM header that they appear in the same order as the chromosomes names
in the alignments.
If there are no SQ records in the header, or if the header is missing,
the alignments must be sorted lexicographically by chromsome
name and by position.


This is interesting because the genomic coordinates of this error are different from those in my previous attempt (earlier post) and also match exactly those of the error produced a few days back by cufflinks-0.9.0.


I'll try converting this SAM to a BAM to see what happens and am also in the process of re-mapping my reads with the new version of tophat.
Can you verify that the SQ records in the SAM file you produced are lexicographically ordered? I tried doing the commands you did on a little test case, and I wasn't getting correct headers.
Cole Trapnell is offline   Reply With Quote
Old 10-04-2010, 04:05 PM   #9
zones
Junior Member
 
Location: La Jolla

Join Date: Jun 2010
Posts: 6
Default

I think so...

~/PEsamtools > head -n 95 0hrsL_s.sam

@HD VN:1.0 SO:sorted
@PG ID:TopHat VN:1.0.14 CL:/users/zones/tophat-1.0.14/tophat -o ./0hrsL -g 1 -p 2 -r 150 chlre4_masked 0hrsL_1.fq 0hrsL_2.fq
@SQ SN:chromosome_1 LN:9982135
@SQ SN:chromosome_2 LN:9975745
@SQ SN:chromosome_3 LN:7773333
@SQ SN:chromosome_4 LN:3005669
@SQ SN:chromosome_5 LN:3366352
@SQ SN:chromosome_6 LN:7695193
@SQ SN:chromosome_7 LN:6170768
@SQ SN:chromosome_8 LN:4189090
@SQ SN:chromosome_9 LN:4733070
@SQ SN:chromosome_10 LN:6579462
@SQ SN:chromosome_11 LN:2734619
@SQ SN:chromosome_12 LN:9355449
@SQ SN:chromosome_13 LN:6588689
@SQ SN:chromosome_14 LN:4114342
@SQ SN:chromosome_15 LN:3066274
@SQ SN:chromosome_16 LN:6617689
@SQ SN:chromosome_17 LN:6673064
@SQ SN:scaffold_18 LN:1287285
@SQ SN:scaffold_19 LN:814470
@SQ SN:scaffold_20 LN:598575
@SQ SN:scaffold_21 LN:558747
@SQ SN:scaffold_22 LN:430403
@SQ SN:scaffold_23 LN:373397
@SQ SN:scaffold_24 LN:339605
@SQ SN:scaffold_25 LN:333029
@SQ SN:scaffold_26 LN:317769
@SQ SN:scaffold_27 LN:256895
@SQ SN:scaffold_28 LN:253209
@SQ SN:scaffold_29 LN:245379
@SQ SN:scaffold_30 LN:238238
@SQ SN:scaffold_31 LN:222931
@SQ SN:scaffold_32 LN:217100
@SQ SN:scaffold_33 LN:214180
@SQ SN:scaffold_34 LN:186456
@SQ SN:scaffold_35 LN:179663
@SQ SN:scaffold_36 LN:154127
@SQ SN:scaffold_37 LN:123055
@SQ SN:scaffold_38 LN:120767
@SQ SN:scaffold_39 LN:119177
@SQ SN:scaffold_40 LN:116725
@SQ SN:scaffold_41 LN:99905
@SQ SN:scaffold_42 LN:98780
@SQ SN:scaffold_43 LN:80533
@SQ SN:scaffold_44 LN:80252
@SQ SN:scaffold_45 LN:79195
@SQ SN:scaffold_46 LN:72145
@SQ SN:scaffold_47 LN:67355
@SQ SN:scaffold_48 LN:66577
@SQ SN:scaffold_49 LN:65629
@SQ SN:scaffold_50 LN:63401
@SQ SN:scaffold_51 LN:63266
@SQ SN:scaffold_52 LN:59018
@SQ SN:scaffold_53 LN:55340
@SQ SN:scaffold_54 LN:54374
@SQ SN:scaffold_55 LN:50490
@SQ SN:scaffold_56 LN:47028
@SQ SN:scaffold_57 LN:46469
@SQ SN:scaffold_58 LN:45221
@SQ SN:scaffold_59 LN:43590
@SQ SN:scaffold_60 LN:43313
@SQ SN:scaffold_61 LN:39687
@SQ SN:scaffold_62 LN:38880
@SQ SN:scaffold_63 LN:36644
@SQ SN:scaffold_64 LN:36299
@SQ SN:scaffold_65 LN:36037
@SQ SN:scaffold_66 LN:32450
@SQ SN:scaffold_67 LN:30996
@SQ SN:scaffold_68 LN:29908
@SQ SN:scaffold_69 LN:29423
@SQ SN:scaffold_70 LN:28937
@SQ SN:scaffold_71 LN:28038
@SQ SN:scaffold_72 LN:26265
@SQ SN:scaffold_73 LN:25979
@SQ SN:scaffold_74 LN:25574
@SQ SN:scaffold_75 LN:25288
@SQ SN:scaffold_76 LN:23213
@SQ SN:scaffold_77 LN:22385
@SQ SN:scaffold_78 LN:21742
@SQ SN:scaffold_79 LN:21191
@SQ SN:scaffold_80 LN:20468
@SQ SN:scaffold_81 LN:20314
@SQ SN:scaffold_82 LN:20067
@SQ SN:scaffold_83 LN:18413
@SQ SN:scaffold_84 LN:15458
@SQ SN:scaffold_85 LN:13564
@SQ SN:scaffold_86 LN:12875
@SQ SN:scaffold_87 LN:12675
@SQ SN:scaffold_88 LN:8671
1033:4:3:12907:1656#gggggggaggtgggg 161 chromosome_1 2929 255 76M = 3313 0 GTTCGATTAGTCTTTCGCCCCTATACCCAAGTCTGAAAAGCGATTTGCACGTCAGCACATCTACGAGCCTACGAGG bbbbbbbbbbbbbbbbbbbbbbbbbbbbbcbcbbbbbbbcbcccbbcbcbcaba``caccaaaaaa`caaacca`] NM:i:0 NH:i:1
1033:4:78:16485:17847#aagccttcacgctct 99 chromosome_1 2936 255 76M = 3072 0 TAGTCTTTCGCCCCTATACCCAAGTCTGAAAAGCGATTTGCACGTCAGCACATCTACGAGCCTACGAGGCATTCTT bbbbbbbbbbbbbbbbbbbbb_bbcbbbcbbbcbbabbbbbcbbcccbabcabca`aaaacb`ca`^bb``aaaa` NM:i:0 NH:i:1
1033:4:54:5415:3919#cacactccacgcctc 161 chromosome_1 2938 255 76M = 3242 0 GTCTTTCGCCCCTATACCCAAGTCTGAAAAGCGATTTGCACGTCAGCACATCTACGAGCCTACGAGGCATTCTTGT bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbcbbbb`bbbbbbbbbbbcbbbb_c`ccbababba NM:i:0 NH:i:1
1033:4:57:1482:6284#atcttctctccctcc 97 chromosome_1 2938 255 76M = 3242 0 GTCTTTCGCCCCTATACCCAAGTCTGAAAAGCGATTTGCACGTCAGCACATCTACGAGCCTACGAGGCATTCTTGT bbbbbbbbbbbbbbbbbbbbbbbbbbb_bbbbbbbbbbbbbbbbbbbbbbbbb`bb`cacc`caZaaa^c_aaca[ NM:i:0 NH:i:1
1033:4:87:14930:10184#tcctaaaccacacct 163 chromosome_1 2948 255 76M = 3170 0 CCTATACCCAAGTCTGAAAAGCGATTTGCACGTCAGCACATCTACGAGCCTACGAGGCATTCTTGTGACAATCTCG bbbbabbb`b`bbbb\\^^`bbb`b`bbbbbb\ab`bbbbbbb_b`\bb``bba]ba___Zbb\_ab]U_\___ba NM:i:0 NH:i:1
zones is offline   Reply With Quote
Old 10-04-2010, 04:30 PM   #10
zones
Junior Member
 
Location: La Jolla

Join Date: Jun 2010
Posts: 6
Default

After converting my SAM (with added header) to BAM format and using this file as input to cufflinks, I get an error with the genomic coordinates identical to my original post. I guess I'll wait on tophat.
zones is offline   Reply With Quote
Old 10-04-2010, 04:43 PM   #11
Cole Trapnell
Senior Member
 
Location: Boston, MA

Join Date: Nov 2008
Posts: 212
Default

Ah - the SQ records there are sorted longest to shortest, not lexicographically by scaffold name. I think if you do:

samtools faidx chlre4_masked.fa
sort -k 1,1 chlre4_masked.fa.fai > chlre4_masked.fa.fai.s
samtools view -h -S -t chlre4_masked.fa.fai.s -o 0hrsL_s.sam /users/zones/PEtophat/PEsams/0hrsL.sam

Then things should come out right. Or you could wait for TopHat, as this is essentially what it's doing as well
Cole Trapnell is offline   Reply With Quote
Old 10-04-2010, 07:10 PM   #12
zones
Junior Member
 
Location: La Jolla

Join Date: Jun 2010
Posts: 6
Default

That did it! Many thanks!
zones is offline   Reply With Quote
Old 10-05-2010, 01:19 AM   #13
Balat
Member
 
Location: Australia

Join Date: May 2010
Posts: 36
Default

Thanks Cole. Sorting SAM files has fixed the problem. I am able to run cuffdiff-0.9.1 after sorting SAM files. Thanks Zones for the interaction.
Balat is offline   Reply With Quote
Old 10-05-2010, 05:49 AM   #14
Cole Trapnell
Senior Member
 
Location: Boston, MA

Join Date: Nov 2008
Posts: 212
Default

Thanks to you guys for helping sort this out. I'm going to update our website with a note about this issue and how to workaround it.
Cole Trapnell is offline   Reply With Quote
Old 10-05-2010, 07:00 AM   #15
aherbert@bu.edu
Junior Member
 
Location: Boston

Join Date: Jan 2010
Posts: 2
Default GFT output cufflinks 0.9.1

I am having problems with the combined.gtf from cuffcompare produced on a Linux system - gedit will not open the file, giving the error message

"gedit has not been able to detect the character coding.
Please check that you are not trying to open a binary file.
Select a character coding from the menu and try again."

The file will not load into the UCSC "tracks" with the error "Can't read file:"


It will load into emacs and openoffice - cutting and pasting from openoffice to gedit works - saving the file from gedit allows gedit to then open it.

Also will load into IGV

Seems like some non-ascii characters are being generated by cuffcompare. Any ideas why?

Thanks
aherbert@bu.edu is offline   Reply With Quote
Old 11-21-2010, 04:30 AM   #16
Enrico Palazzo
Junior Member
 
Location: Hamburg, Germany

Join Date: Jul 2010
Posts: 9
Default

Quote:
Originally Posted by aherbert@bu.edu View Post
I am having problems with the combined.gtf from cuffcompare produced on a Linux system - gedit will not open the file, giving the error message

"gedit has not been able to detect the character coding.
Please check that you are not trying to open a binary file.
Select a character coding from the menu and try again."

The file will not load into the UCSC "tracks" with the error "Can't read file:"


It will load into emacs and openoffice - cutting and pasting from openoffice to gedit works - saving the file from gedit allows gedit to then open it.

Also will load into IGV

Seems like some non-ascii characters are being generated by cuffcompare. Any ideas why?

Thanks
Hi,
I had the same problem with cuffcompare. I think cuffcompare generates illegal symbols if the strand orientation is unknown. Do you have entries like "^@" in the 7th column?

I fixed it with:
awk -F "\t" '{if ($7!="+" && $7!="-") print $1 "\t" $2 "\t" $3 "\t" $4 "\t" $5 "\t" $6 "\t" "." "\t" $8 "\t" $9 "\t" $10; else print $0}' stdout.combined.gtf > fixed.gtf

Looks ugly but for me it works
Enrico Palazzo is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:15 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO