Hi folks,
Apologies in advance for the rambling post!
Out of the paired end reads in the 1000 genomes data, I am trying to locate those reads where one end of the pair is unmapped. I understand that I need to be using the flag field for this. Originally I thought that the following would pull out what I needed:
./samtools view -f 8 ftp://ftp.1000genomes.ebi.ac.uk/vol1...e.20101123.bam > flagtest.txt
This resulted in a file that's around 50GB! Having a skim through this file, I noticed that the flag field varied considerably. I guess that by using -f 8, I asked samtools to pull out all those reads that have an unmapped mate, and that all the other bits can vary as they like?
Using http://picard.sourceforge.net/explain-flags.html to decipher the SAM flags, I noticed that some of the flags (for example, 75) translates to:
the read is paired in sequencing
the read is mapped in a proper pair
the mate is unmapped
the read is the first read in a pair
I think I'm missing something here. How can a read be "mapped in a proper pair" and have "the mate is unmapped"? I thought the 2 would be mutually exclusive?
Am i right in thinking that I need to work out what the flag score should be using the various options and search just for those reads that satisfy that criteria? For example:
the read is paired in sequencing
the mate is unmapped
strand of the query or strand of the query reverse
strand of the mate or strand of the mate reverse
the read is the first read in a pair or the read is the second read in a pair
So this would be 8 searches in total? Or is there a more sophisticated way to do this?
Apologies if this has already been asked - I have been searching through the threads (which are most informative), but without success.
All help gratefully received!
A
PS I also noticed that there seemed to be a lot of reads where the ISIZE column is 0 - is this to be expected because the mate is unmapped?
Apologies in advance for the rambling post!
Out of the paired end reads in the 1000 genomes data, I am trying to locate those reads where one end of the pair is unmapped. I understand that I need to be using the flag field for this. Originally I thought that the following would pull out what I needed:
./samtools view -f 8 ftp://ftp.1000genomes.ebi.ac.uk/vol1...e.20101123.bam > flagtest.txt
This resulted in a file that's around 50GB! Having a skim through this file, I noticed that the flag field varied considerably. I guess that by using -f 8, I asked samtools to pull out all those reads that have an unmapped mate, and that all the other bits can vary as they like?
Using http://picard.sourceforge.net/explain-flags.html to decipher the SAM flags, I noticed that some of the flags (for example, 75) translates to:
the read is paired in sequencing
the read is mapped in a proper pair
the mate is unmapped
the read is the first read in a pair
I think I'm missing something here. How can a read be "mapped in a proper pair" and have "the mate is unmapped"? I thought the 2 would be mutually exclusive?
Am i right in thinking that I need to work out what the flag score should be using the various options and search just for those reads that satisfy that criteria? For example:
the read is paired in sequencing
the mate is unmapped
strand of the query or strand of the query reverse
strand of the mate or strand of the mate reverse
the read is the first read in a pair or the read is the second read in a pair
So this would be 8 searches in total? Or is there a more sophisticated way to do this?
Apologies if this has already been asked - I have been searching through the threads (which are most informative), but without success.
All help gratefully received!
A
PS I also noticed that there seemed to be a lot of reads where the ISIZE column is 0 - is this to be expected because the mate is unmapped?
Comment