Hi,
I have been reading Seqanswers for quite some time, but this my first question.
Does someone has experience with BAM files from UCSC-encode project? I am interested in a small RNASeq dataset from CSHL and I found some issues in the BAM files hosted at UCSC.
It seems there are multiple alignments (2) for the same read, but not flagged as primary and secondary. In the first example bellow, same read name appears twice with same flag (0) and same alignment score (254). In the second example the duplicated read name is the reverse complement sequence.
Related to it, the alignment scores are also different from what I have normally seen in BAM files. It ranges from 246 to 255:
Is this normal? Am I missing something? It seems to happen only for a small percentage of reads. Would you just ignore those with multiple alignments for further analysis? If so, how would you filter them out?
Thanks!!
I have been reading Seqanswers for quite some time, but this my first question.
Does someone has experience with BAM files from UCSC-encode project? I am interested in a small RNASeq dataset from CSHL and I found some issues in the BAM files hosted at UCSC.
It seems there are multiple alignments (2) for the same read, but not flagged as primary and secondary. In the first example bellow, same read name appears twice with same flag (0) and same alignment score (254). In the second example the duplicated read name is the reverse complement sequence.
Code:
# Number of multiple alignments in a sample
samtools view ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeCshlShortRnaSeq//wgEncodeCshlShortRnaSeqA549CellShorttotalTapAlnRep1.bam | head -n 10000| awk '{print $1}' | sort | uniq -c | awk '{print $1}' | sort | uniq -c
9958 1
21 2
# Example 1
view ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeCshlShortRnaSeq/wgEncodeCshlShortRnaSeqA549CellShorttotalTapAlnRep1.bam | grep -m 2 'TUPAC_0038:7:110:15042:19226#0/1'
TUPAC_0038:7:110:15042:19226#0/1 0 chr1 45243516 254 34M2S * 0 0 CTCGTGATGAAAACTTTGTCCAGTTCTGCTACTGAA Ycacc\KW_RSLWSVMTTYT]a\a_`_KXK\Z\RQ_
TUPAC_0038:7:110:15042:19226#0/1 0 chr1 45244067 254 3S31M2S * 0 0 CTCGTGATGAAAACTTTGTCCAGTTCTGCTACTGAA Ycacc\KW_RSLWSVMTTYT]a\a_`_KXK\Z\RQ_
# Example 2
samtools view ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeCshlShortRnaSeq/wgEncodeCshlShortRnaSeqA549CellShorttotalTapAlnRep1.bam | grep -m 2 'TUPAC_0038:7:88:9054:14403#0/1'
TUPAC_0038:7:88:9054:14403#0/1 16 chr1 234792 246 18S18M * 0 0 CCGATCTTTTTTTTTTTCTAAGGACATCCTAAAGGA ghfeecchhhhfhhhhhhghhhhhhhhhhhhhhhhh
TUPAC_0038:7:88:9054:14403#0/1 0 chr1 464897 246 18M18S * 0 0 TCCTTTAGGATGTCCTTAGAAAAAAAAAAAGATCGG hhhhhhhhhhhhhhhhhghhhhhhfhhhhcceefhg
Related to it, the alignment scores are also different from what I have normally seen in BAM files. It ranges from 246 to 255:
Code:
samtools view ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeCshlShortRnaSeq/wgEncodeCshlShortRnaSeqA549CellShorttotalTapAlnRep1.bam | head -n 10000 | awk '{print $5}' | sort | uniq -c
13 246
5 247
149 248
11 249
19 250
26 251
93 252
9587 253
77 254
20 255
Is this normal? Am I missing something? It seems to happen only for a small percentage of reads. Would you just ignore those with multiple alignments for further analysis? If so, how would you filter them out?
Thanks!!