Seqanswers Leaderboard Ad

**GenoMax** · 09-30-2015, 09:08 AM

How are you handling multimappers in both cases? Perhaps in case B reads are getting discarded if they multimap more than A. If all of your reads are not making it into the sam file (you are likely not including unmapped reads in your sam?) then it may explain the size result (though as rule of thumb don't worry about file sizes for NGS data as long as the reads are all accounted for).

**scami** · 09-30-2015, 09:14 AM

Hi GenoMax

thanks for your reply. Parameters are exactly the same for the two alignments. I used bwa with the default setting except I used 0.05 as edit distance. I then used samtools to convert sam to bam, merge and sort bam file. In the past bwa produced a sam file that included all the reads of the fastq input file and I guess this has not changed. If so sam file, which contains both the sequence and the quality string, plus other fields should not be, as in my case, one third the length of the input fastq file.
Thanks.....

**GenoMax** · 09-30-2015, 09:20 AM

Have you checked your sam files using samtools idxstats?

See this recent thread as a reference for bam/sam file size issue (specially when the files are sorted).

**scami** · 09-30-2015, 10:42 PM

Originally posted by GenoMax View Post

Have you checked your sam files using samtools idxstats?

See this recent thread as a reference for bam/sam file size issue (specially when the files are sorted).

Thanks for the advice. I ran the command you suggested and this is my result:
I divided the number of line of my fastq files in order to get the number of reads. I got a total of 187,595,025 reads. I calculated the sum of all mapped and unmapped reads in the bam file and obtained 65,536,000. I can not understand where all the other read went! Also in the idxstats output what is the difference between the ummapped reads for each chromosome and the unmapped reads at the end of the tab (the one with chromosome *):
This is my idxstats output:

chr1 23037639 2881935 72869
chr10 18140952 2435272 74191
chr11 19818926 2692338 74879
chr12 22702307 2797718 99784
chr13 24396255 2992318 96653
chr14 30274277 3751313 116119
chr15 20304914 2609991 112248
chr16 22053297 2630989 99628
chr17 17126926 2166178 70746
chr18 29360087 3792840 114974
chr19 24021853 3113695 120154
chr2 18779844 2507134 89904
chr3 19341862 2276796 82991
chr4 23867706 2918446 100922
chr5 25021643 3282028 116137
chr6 21508407 2523489 76380
chr7 21026613 2756721 79519
chr8 22385789 2801900 69991
chr9 23006712 3440921 151837
* 0 0 9344052

thanks again for your help!

**GenoMax** · 10-01-2015, 04:27 AM

Did you divide the number of lines by 4 to arrive at the ~187M number?

Here is a useful thread that explains the idxstats output: https://www.biostars.org/p/14569/

**scami** · 10-01-2015, 05:25 AM

Originally posted by GenoMax View Post

Did you divide the number of lines by 4 to arrive at the ~187M number?

Here is a useful thread that explains the idxstats output: https://www.biostars.org/p/14569/

Yes I did. I also thought I messed up the reference genome file while removing the unwanted chromosomes with my python software. However afterwards I calculated the GC content of the chromosomes in the original and in the "stripped" reference file and they correspond exactly

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Yesterday, 11:49 AM	0 responses 13 views 0 likes	Last Post by seqadmin Yesterday, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Sam file smaller than fastq

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News