Seqanswers Leaderboard Ad

**swbarnes2** · 09-09-2011, 08:31 AM

I don't know about the reads with 2 and 8 both flagged, but you shouldn't need to do eight searches. You can specify multiple flags that you are filtering out, and the flags that you want to include, in one command line. For instance, samtools view -f 8 -F 4 should get all the reads that are mapped whose mates aren't mapped. You don't care about the read being the first or second read, or about its direction, so you don't filter for those flags at all.

**Ash** · 09-12-2011, 05:53 AM

Many thanks for your response swbarnes2. I should have picked up on -F being the opposite of -f! I've been playing around with what you suggested and a few more questions if I may:

1) Using your suggestion of -f 8 -F 4 certainly cuts down on the number of reads. Looking at the other criteria I can cut down on, I would also like to exclude those reads that fail platform/vendor quality checks and those reads that are either a PCR or an optical duplicate. Am I right in thinking that I should use -f 8 -F 1540? (4 + 512 + 1024 = 1540)

2) In the results that I get using -f 8 -F 4, POS and MPOS are always the same and ISIZE is always 0. I take it this means that POS and MPOS represent the position of the read that did map, and that MPOS is not actually the location of the read that didn't map (as it shouldn't have a position!). So these results are the unmapped reads only?

3) As an aside, I am also interested in those reads who's mates mapped unexpectedly far away (say 1kb or greater downstream/upstream). Is there anyway of filtering by MPOS or ISIZE? Or is there a better way of doing this?

Apologies if these questions have already been asked - I've not found anything in the forums or the manual, but I may be missing something...

Many thanks,

A

**swbarnes2** · 09-12-2011, 08:50 AM

Originally posted by Ash View Post

Many thanks for your response swbarnes2. I should have picked up on -F being the opposite of -f! I've been playing around with what you suggested and a few more questions if I may:

1) Using your suggestion of -f 8 -F 4 certainly cuts down on the number of reads. Looking at the other criteria I can cut down on, I would also like to exclude those reads that fail platform/vendor quality checks and those reads that are either a PCR or an optical duplicate. Am I right in thinking that I should use -f 8 -F 1540? (4 + 512 + 1024 = 1540)

Sure, if your bam file actually has those categories flagged.

2) In the results that I get using -f 8 -F 4, POS and MPOS are always the same and ISIZE is always 0. I take it this means that POS and MPOS represent the position of the read that did map, and that MPOS is not actually the location of the read that didn't map (as it shouldn't have a position!). So these results are the unmapped reads only?

Sam specs call for the unmapped mate of a mapped read to have the mapping coordinates of the mapped mate. It's so the two will sort togther. So yes, what you are seeing is totally normal and expected. So the only way to know the read is unmapped is to look at the flag, and if it has the 4 flag, it's unmapped, no matter what anything else in the line might imply.

3) As an aside, I am also interested in those reads who's mates mapped unexpectedly far away (say 1kb or greater downstream/upstream). Is there anyway of filtering by MPOS or ISIZE? Or is there a better way of doing this?

You could try -F 14, which should only get reads that both mapped, but are not a "proper pair". I'm not quite sure what the definition of a proper pair is, and it's probably going to depend on the software that is making the .bam, but reads that span a much larger than average or expected length should qualify.

You could also convert to sam, and use awk, or something like that. I don't know how to do that on a .bam.

**Ash** · 09-14-2011, 05:31 AM

Aaaah... now I see! So everything has to be searched from the flag field, and using -f and -F should give you every possible combination you need (with there being some redundancy in the system).

With regards to the unmapped mates, the file sizes I'm getting out vary from 164Mb to 44GB - why is there such a difference? Occasionally I get a truncated file msg (usually the connection to the ftp site has been interrupted), but more usually I don't get an error message. Is this kind of variation between these 1000 genomes normal, or is something going wrong?

With regards to those reads that are not in a proper pair, again there's a lot of variation in the file sizes (1GB to 10GB) - is this also normal? Is there any way of specifying certain positions in the genomes that I'm interested in (I'm thinking this would cut down on the file sizes and the time its taking..)?

Many thanks,

A

**swbarnes2** · 09-14-2011, 08:28 AM

samtools view allows for a region to be specified. So does samtools mpileup. BEDTools can give you the intersect between a .bed file and a .bam file

**Ash** · 09-15-2011, 06:03 AM

Thanks swbarnes2. I was hoping there was a way of specifying more than one position at a time, but I guess I can get round that with a perl script.

All the best,

A

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Yesterday, 11:49 AM	0 responses 15 views 0 likes	Last Post by seqadmin Yesterday, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Finding reads where the mate is unmapped

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News