SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
badly sorted BAM Filippo Bioinformatics 3 12-29-2011 12:39 PM
How to get all contig boundaries from a sorted bam file dustar1986 Bioinformatics 3 09-30-2011 12:31 AM
how to check whether a bam fille is sorted using picard in java jay2008 Bioinformatics 0 05-23-2011 03:14 PM
Sorted bam wangzkai Bioinformatics 3 05-07-2010 01:37 AM

Reply
 
Thread Tools
Old 05-25-2012, 07:58 AM   #1
ians
Member
 
Location: St. Louis, MO

Join Date: Aug 2011
Posts: 53
Default sorted bam larger than unsorted bam

Recently i've sorted 2 different alignments and in both cases the sorted bam is ~3x larger in disk size. I did the sort twice with the same result.
Does anyone have a suspicion of what is going on here?
ians is offline   Reply With Quote
Old 05-26-2012, 09:05 AM   #2
xied75
Senior Member
 
Location: Oxford

Join Date: Feb 2012
Posts: 129
Default

1, output bam uncompressed? (show the command line pls.)
2, you are sorting on name or coord? could it be that bgzip block can't compress that hard after the sort? (Very unlikely though.)
xied75 is offline   Reply With Quote
Old 05-29-2012, 07:06 AM   #3
ians
Member
 
Location: St. Louis, MO

Join Date: Aug 2011
Posts: 53
Default

Here's what i'm running:

Code:
time -v samtools sort $mergedBam $sortedBam
Another run yielded the same result.
ians is offline   Reply With Quote
Old 05-30-2012, 05:02 AM   #4
jkbonfield
Senior Member
 
Location: Cambridge, UK

Join Date: Jul 2008
Posts: 146
Default

Do you have a combination of shallow coverage and excessively long read names? That can cause the sorting to be of little benefit to sequence and positional compression while also being detrimental to read name compression.
jkbonfield is offline   Reply With Quote
Old 05-30-2012, 05:51 AM   #5
ians
Member
 
Location: St. Louis, MO

Join Date: Aug 2011
Posts: 53
Default

headers aren't anything out of the ordinary. Here are a few entries from the sorted bam:

Code:
HWI-ST1063_0137:6:1308:10615:65342#0	161	chr1	765	59	95M6S	=	1019	354	TGACGGACTACATGAGATAGAAGAGAGAATTTTGGGAGCAGAAGATATCATAGAAAACATTGACACAACCTTCAAAGAGAACGTAAATAGGAAAAAGCTCC	?=++4B0@FHDB:<+C::CE@HC>?38C9CHAFHGEDHIJCDAFCHGI@FCGBH@@AFEGIIEEHHGH??DFD@>@CA3;?A>A;<<BAC:@CDD######	PG:Z:novoalign	AS:i:3UQ:i:36	NM:i:0	MD:Z:95
HWI-ST1063_0137:6:1308:10615:65342#0	81	chr1	1019	1	101M	=	765	-354	ATAAATGTCCATAAGTAGACATGAAGCCTGCAGAATTCCAAATAGAATGGACCAGAAAATAAATTCCTCCTGTCACATAATAGTCAAAACACCAAATGCAC	>>;>;5;@;;A;7.))7;7@C=?CA;@DAA=ADC;ECC=DB@IED<DBB?0D99??0499:<9DDC88:2<F9FAFDCA?CCBEE<:C<?DDABDB+??1?	PG:Z:novoalign	AS:i:1UQ:i:14	NM:i:1	MD:Z:14A86	CC:Z:=	CP:i:71209	ZS:Z:R	ZN:i:3	NH:i:3	HI:i:1	IH:i:3
HWI-ST1063_0137:6:2303:18332:58082#0	419	chr1	1044	1	13S88M	=	1151	210	GAACATAGAAGAAGCCTGCAGAATTCCAAATAGAATGGACCAGAAAATAAATTCCTCCTGTCACATAATAGTCAAAACACCAAATGCACAAAACAAAGAAT	CCCFFFFFHHHHHJJJJJJJGGJJJJJJIGIGJJJIJIIJJJJJJJJIHIEHIDIGIIIIGIGGIJG@GGGGIIIJJHHHHFFFEAACCEEBDDDDACDDC	PG:Z:novoalign	AS:i:105	UQ:i:105	NM:i:0	MD:Z:88	PQ:i:135	SM:i:70	AM:i:70	ZS:Z:R	ZN:i:3	NH:i:3	HI:i:3	IH:i:3
This is an RNAseq project and we have sufficient number of reads to get a deep sampling. I suspect something is going wrong at this step because down stream I am getting 20% coverage over the genome with this sample...


I aligned using novoalign, if that is of any consequence.
ians is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 04:55 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO