Hi,
I have been reading Seqanswers for quite some time, but this my first question.
Does someone has experience with BAM files from UCSC-encode project? I am interested in a small RNASeq dataset from CSHL and I found some issues in the BAM files hosted at UCSC.
It seems there are multiple alignments (2) for the same read, but not flagged as primary and secondary. In the first example bellow, same read name appears twice with same flag (0) and same alignment score (254). In the second example the duplicated read name is the reverse complement sequence.
Related to it, the alignment scores are also different from what I have normally seen in BAM files. It ranges from 246 to 255:
Is this normal? Am I missing something? It seems to happen only for a small percentage of reads. Would you just ignore those with multiple alignments for further analysis? If so, how would you filter them out?
Thanks!!
I have been reading Seqanswers for quite some time, but this my first question.
Does someone has experience with BAM files from UCSC-encode project? I am interested in a small RNASeq dataset from CSHL and I found some issues in the BAM files hosted at UCSC.
It seems there are multiple alignments (2) for the same read, but not flagged as primary and secondary. In the first example bellow, same read name appears twice with same flag (0) and same alignment score (254). In the second example the duplicated read name is the reverse complement sequence.
Code:
# Number of multiple alignments in a sample samtools view ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeCshlShortRnaSeq//wgEncodeCshlShortRnaSeqA549CellShorttotalTapAlnRep1.bam | head -n 10000| awk '{print $1}' | sort | uniq -c | awk '{print $1}' | sort | uniq -c 9958 1 21 2 # Example 1 view ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeCshlShortRnaSeq/wgEncodeCshlShortRnaSeqA549CellShorttotalTapAlnRep1.bam | grep -m 2 'TUPAC_0038:7:110:15042:19226#0/1' TUPAC_0038:7:110:15042:19226#0/1 0 chr1 45243516 254 34M2S * 0 0 CTCGTGATGAAAACTTTGTCCAGTTCTGCTACTGAA Ycacc\KW_RSLWSVMTTYT]a\a_`_KXK\Z\RQ_ TUPAC_0038:7:110:15042:19226#0/1 0 chr1 45244067 254 3S31M2S * 0 0 CTCGTGATGAAAACTTTGTCCAGTTCTGCTACTGAA Ycacc\KW_RSLWSVMTTYT]a\a_`_KXK\Z\RQ_ # Example 2 samtools view ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeCshlShortRnaSeq/wgEncodeCshlShortRnaSeqA549CellShorttotalTapAlnRep1.bam | grep -m 2 'TUPAC_0038:7:88:9054:14403#0/1' TUPAC_0038:7:88:9054:14403#0/1 16 chr1 234792 246 18S18M * 0 0 CCGATCTTTTTTTTTTTCTAAGGACATCCTAAAGGA ghfeecchhhhfhhhhhhghhhhhhhhhhhhhhhhh TUPAC_0038:7:88:9054:14403#0/1 0 chr1 464897 246 18M18S * 0 0 TCCTTTAGGATGTCCTTAGAAAAAAAAAAAGATCGG hhhhhhhhhhhhhhhhhghhhhhhfhhhhcceefhg
Related to it, the alignment scores are also different from what I have normally seen in BAM files. It ranges from 246 to 255:
Code:
samtools view ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeCshlShortRnaSeq/wgEncodeCshlShortRnaSeqA549CellShorttotalTapAlnRep1.bam | head -n 10000 | awk '{print $5}' | sort | uniq -c 13 246 5 247 149 248 11 249 19 250 26 251 93 252 9587 253 77 254 20 255
Is this normal? Am I missing something? It seems to happen only for a small percentage of reads. Would you just ignore those with multiple alignments for further analysis? If so, how would you filter them out?
Thanks!!