sorting sam file

crh

Member

Join Date: Dec 2009

Posts: 46
- Share
- Tweet
#1

sorting sam file

06-15-2011, 08:11 PM

Hi All,

I've read the archives and see where this problem has been addressed previously, but I've been unable to sort a sam file to feed into cufflinks.
Here is an excerpt from the sam file:

@SQ SN:chromosome_1 LN:9982135
@SQ SN:chromosome_2 LN:9975745
@SQ SN:chromosome_3 LN:7773333
@SQ SN:chromosome_4 LN:3005669
@SQ SN:chromosome_5 LN:3366352
@SQ SN:chromosome_6 LN:7695193
@SQ SN:chromosome_7 LN:6170768
@SQ SN:chromosome_8 LN:4189090
@SQ SN:chromosome_9 LN:4733070
@SQ SN:chromosome_10 LN:6579462
@SQ SN:chromosome_11 LN:2734619
@SQ SN:chromosome_12 LN:9355449
@SQ SN:chromosome_13 LN:6588689
@SQ SN:chromosome_14 LN:4114342
@SQ SN:chromosome_15 LN:3066274
@SQ SN:chromosome_16 LN:6617689
@SQ SN:chromosome_17 LN:6673064
@SQ SN:scaffold_18 LN:1287285
@SQ SN:scaffold_19 LN:814470
@SQ SN:scaffold_20 LN:598575
@SQ SN:scaffold_21 LN:558747
@SQ SN:scaffold_22 LN:430403
@SQ SN:scaffold_23 LN:373397
@SQ SN:scaffold_24 LN:339605
@SQ SN:scaffold_25 LN:333029
@SQ SN:scaffold_26 LN:317769
@SQ SN:scaffold_27 LN:256895
@SQ SN:scaffold_28 LN:253209
@SQ SN:scaffold_29 LN:245379
@SQ SN:scaffold_30 LN:238238
@SQ SN:scaffold_31 LN:222931
@SQ SN:scaffold_32 LN:217100
@SQ SN:scaffold_33 LN:214180
@SQ SN:scaffold_34 LN:186456
@SQ SN:scaffold_35 LN:179663
@SQ SN:scaffold_36 LN:154127
@SQ SN:scaffold_37 LN:123055
@SQ SN:scaffold_38 LN:120767
@SQ SN:scaffold_39 LN:119177
@SQ SN:scaffold_40 LN:116725
@SQ SN:scaffold_41 LN:99905
@SQ SN:scaffold_42 LN:98780
@SQ SN:scaffold_43 LN:80533
@SQ SN:scaffold_44 LN:80252
@SQ SN:scaffold_45 LN:79195
@SQ SN:scaffold_46 LN:72145
@SQ SN:scaffold_47 LN:67355
@SQ SN:scaffold_48 LN:66577
@SQ SN:scaffold_49 LN:65629
@SQ SN:scaffold_50 LN:63401
@SQ SN:scaffold_51 LN:63266
@SQ SN:scaffold_52 LN:59018
@SQ SN:scaffold_53 LN:55340
@SQ SN:scaffold_54 LN:54374
@SQ SN:scaffold_55 LN:50490
@SQ SN:scaffold_56 LN:47028
@SQ SN:scaffold_57 LN:46469
@SQ SN:scaffold_58 LN:45221
@SQ SN:scaffold_59 LN:43590
@SQ SN:scaffold_60 LN:43313
@SQ SN:scaffold_61 LN:39687
@SQ SN:scaffold_62 LN:38880
@SQ SN:scaffold_63 LN:36644
@SQ SN:scaffold_64 LN:36299
@SQ SN:scaffold_65 LN:36037
@SQ SN:scaffold_66 LN:32450
@SQ SN:scaffold_67 LN:30996
@SQ SN:scaffold_68 LN:29908
@SQ SN:scaffold_69 LN:29423
@SQ SN:scaffold_70 LN:28937
@SQ SN:scaffold_71 LN:28038
@SQ SN:scaffold_72 LN:26265
@SQ SN:scaffold_73 LN:25979
@SQ SN:scaffold_74 LN:25574
@SQ SN:scaffold_75 LN:25288
@SQ SN:scaffold_76 LN:23213
@SQ SN:scaffold_77 LN:22385
@SQ SN:scaffold_78 LN:21742
@SQ SN:scaffold_79 LN:21191
@SQ SN:scaffold_80 LN:20468
@SQ SN:scaffold_81 LN:20314
@SQ SN:scaffold_82 LN:20067
@SQ SN:scaffold_83 LN:18413
@SQ SN:scaffold_84 LN:15458
@SQ SN:scaffold_85 LN:13564
@SQ SN:scaffold_86 LN:12875
@SQ SN:scaffold_87 LN:12675
@SQ SN:scaffold_88 LN:8671
USI-EAS39:1:1:2:1362#0/1_3[0] 65 chromosome_2 28825 30 35M * 0 0 CTGGGTTCCACAGGCACATAGCCAAACCGGTGCCT ``]XD\a``b`b`^aa`aaaa`a]^aaa\UZ^^a^ NM:i:1 MD:Z:4T30
USI-EAS39:1:1:2:1276#0/1_3[0] 65 chromosome_12 6073621 30 35M * 0 0 GTTCGCTTTACACCGTAACATATTCAGCCAAATGC ]_aWDK\bbaaba_Y]ababbaaa`ba`abbab]` NM:i:0 MD:Z:35
USI-EAS39:1:1:2:1649#0/1_3[0] 81 chromosome_17 6301050 30 35M * 0 0 GCTCCAACAACAAGACCTCCTGACATAAGACTCAC ^`_a`a`P__Xaaa]_X^``aa`bbbbba^D[a^a NM:i:1 MD:Z:30G4

I generated the sam file using:
samtools view -h -S -t ~/binf/seq/genome/chlre4/Chlre4_genomic_scaffolds.fasta.fai -o test_sorted.sam soap_mapped_reads.sam

If I run cufflinks, I get the error:

cufflinks -G ~/binf/sto1/models.gtf test_sorted.sam

[bam_header_read] EOF marker is absent.
[bam_header_read] invalid BAM binary header (this is not a BAM file).
File test_sorted.sam doesn't appear to be a valid BAM file, trying SAM...
[23:01:46] Loading reference annotation.
[23:01:51] Inspecting reads and determining fragment length distribution.
> Processing Locus chromosome_17:6295360-62958 [**** ] 18%
Error: this SAM file doesn't appear to be correctly sorted!
current hit is at chromosome_15:2745584, last one was at chromosome_17:6301049
Cufflinks requires that if your file has SQ records in
the SAM header that they appear in the same order as the chromosomes names
in the alignments.
If there are no SQ records in the header, or if the header is missing,
the alignments must be sorted lexicographically by chromsome
name and by position.

So, the sam file is not correctly sorted.
If I sort using:
sort -k 3,3 -k 4,4n test_sorted.sam > test2_sorted.sam

I loose the header, and get the same error when attempting to run cufflinks:
[23:08:38] Loading reference annotation.
[23:08:43] Inspecting reads and determining fragment length distribution.
> Processing Locus scaffold_88:8392-8579 [************************ ] 98%
Error: this SAM file doesn't appear to be correctly sorted!
current hit is at scaffold_82:5078, last one was at scaffold_81:18777
Cufflinks requires that if your file has SQ records in

I guess the question is how to sort 'chromosome_1'?

charles
Tags: None
peromhc

Senior Member

Join Date: Sep 2009

Posts: 108
- Share
- Tweet
#2

06-15-2011, 08:37 PM

I had a, identical error message when I had colons ":" in my original fasta file.. I'd check that if I were you!
Comment
crh

Member

Join Date: Dec 2009

Posts: 46
- Share
- Tweet
#3

06-16-2011, 06:45 AM

I've checked, and there are no colons within the genomic fasta file
charles
Comment

Previous template Next

Essential Discoveries and Tools in Epitranscriptomics

by seqadmin

The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
- Channel: Articles
04-22-2024, 07:01 AM
Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM

Topics	Statistics	Last Post
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Today, 08:47 AM	0 responses 12 views 0 likes	Last Post by seqadmin Today, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 59 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 54 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Latest Articles

ad_right_rmr

News