Hi,
I am trying to obtain base alleles for a specific position (polymorphic location) directly from read data. I also use samtools pileup to obtain bases information. And, the results are not matched.
Here is an example (the data is from 1000 genome project...low coverage). I look at the sample NA12716 and at the position 154402806.
These are reads data covering that position. The base at 154402806 is shown in red:
ERR000573.12285810 pPr1 X 154402756 60 9S42M = 154402571 -226
GCGAAGAAGTCAATTAGAAAGTCTTTTCAAGTTATCCAAGCAGGAGGTCTC
27=AA=8A;<?A?@>@>@@@=;>=<>>?A@>@>@:?>AA>?@>??>>=?9?
X0:i:1 X1:i:0 XC:i:42MD:Z:42 RG:Z:ERR000573 AM:i:37 NM:i:0 SM:i:37 MQ:i:60 XT:A:U
BQ:Z:@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
ERR000569.10381384 pPr1 X 154402769 60 51M = 154402592 -227
CTTTTCAAGTTATCCAAGCAGGAGGTCTCAAGTGGCCTGGTCTAGAGTAGT
>6?=8>9>=;1<<:7A@=>$$=@><=%::@@=>;6>><==;>=@>?8>@>?
X0:i:1 X1:i:0 MD:Z:37C13 RG:Z:ERR000569 AM:i:37 NM:i:1 SM:i:37 MQ:i:60
XT:A:U
BQ:Z:@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
ERR000569.428911 pPR2 X 154402785 60 35M16S = 154402963 214
AGCAGGAGGTCTCAAGTGGCCTGGTCTAGAGTAGTGACAGTGGACATTTAA
;??@?=>=;;=@>>:<%=;=>@>>>:@%:::?.99=8=?9$<34=6%<#79
X0:i:1 X1:i:0 XC:i:35MD:Z:21C13 RG:Z:ERR000569 AM:i:37 NM:i:1 SM:i:37 MQ:i:60 XT:A:U
BQ:Z:@@@@@@@@@@@@@@@@@@@@AC@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
ERR000572.854678 pPr2 X 154402794 60 15S36M = 154402599 -230
TATCCAAGCAGGAGGTCTCAAGTGGCCTGGTCTAGAGTAGTGACAGTGGAC
?A75:@A7:=?:A?>;<>?=@:?>===@>=>=>?;@<?@:@=>?@>A?>A<
X0:i:1 X1:i:1 XC:i:36MD:Z:12C23 RG:Z:ERR000572 AM:i:23 NM:i:1 SM:i:23 MQ:i:60 XT:A:U
BQ:Z:@@@@@@@@@@@@@@@@@@@@@@@@@@@C@@@@@@@@@@@@@@@@@@@@@@@
ERR000572.8002077 pPr1 X 154402804 60 15S36M = 154402586 -253
GGAGGTCTCACGTGGCCTGGTCTAGAGTAGTGACAGTGGACATTGAAAGAA
=:B<;<;>3?%;=?93=@>=>;>@>@<>@=@==@@=?>>9>@@@=AAA>A?
X0:i:1 X1:i:0 XC:i:36MD:Z:2C33 RG:Z:ERR000572 AM:i:37 NM:i:1 SM:i:37 MQ:i:60 XT:A:U
BQ:Z:@@@@@@@@@@@@@@@@@C@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@CD
ERR000574.8852864 pPR2 X 154402806 60 39M12S = 154402984 222
TGGTCTAGAGTAGTGACAGTGGACATTGAAAGAAAGAGATTTCAGAGATAC
=@???@>>>>?>>?>?=A>?>=>=@?@3#@?<8@@<>=:?@A?9<>=>A?<
X0:i:1 X1:i:0 XC:i:39MD:Z:0C38 RG:Z:ERR000574 AM:i:37 NM:i:1 SM:i:37 MQ:i:60 XT:A:U
BQ:Z:WJ@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
And the results from pileup is:
X 154402806 C T 9 67 60 5 tTtt^]T <===&
So, from the reads data, bases at position 154402806 should be CTTAAT(three reference alleles), but the results from pileup is TTTTT (all are reference alleles). Also, there only 5 reads from pileup results, but it supposes to have 6 reads coverage.
I checked the default read filter in samtools:
minimum mapping quality for an alignment to be used [0]
minmum base quality for a base to be considered [13]
The base quality of these bases CTTAAT are
63
60
64
65
66
61
Does anyone have any idea what is going wrong?
Thank you very much!!
Anney
I am trying to obtain base alleles for a specific position (polymorphic location) directly from read data. I also use samtools pileup to obtain bases information. And, the results are not matched.
Here is an example (the data is from 1000 genome project...low coverage). I look at the sample NA12716 and at the position 154402806.
These are reads data covering that position. The base at 154402806 is shown in red:
ERR000573.12285810 pPr1 X 154402756 60 9S42M = 154402571 -226
GCGAAGAAGTCAATTAGAAAGTCTTTTCAAGTTATCCAAGCAGGAGGTCTC
27=AA=8A;<?A?@>@>@@@=;>=<>>?A@>@>@:?>AA>?@>??>>=?9?
X0:i:1 X1:i:0 XC:i:42MD:Z:42 RG:Z:ERR000573 AM:i:37 NM:i:0 SM:i:37 MQ:i:60 XT:A:U
BQ:Z:@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
ERR000569.10381384 pPr1 X 154402769 60 51M = 154402592 -227
CTTTTCAAGTTATCCAAGCAGGAGGTCTCAAGTGGCCTGGTCTAGAGTAGT
>6?=8>9>=;1<<:7A@=>$$=@><=%::@@=>;6>><==;>=@>?8>@>?
X0:i:1 X1:i:0 MD:Z:37C13 RG:Z:ERR000569 AM:i:37 NM:i:1 SM:i:37 MQ:i:60
XT:A:U
BQ:Z:@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
ERR000569.428911 pPR2 X 154402785 60 35M16S = 154402963 214
AGCAGGAGGTCTCAAGTGGCCTGGTCTAGAGTAGTGACAGTGGACATTTAA
;??@?=>=;;=@>>:<%=;=>@>>>:@%:::?.99=8=?9$<34=6%<#79
X0:i:1 X1:i:0 XC:i:35MD:Z:21C13 RG:Z:ERR000569 AM:i:37 NM:i:1 SM:i:37 MQ:i:60 XT:A:U
BQ:Z:@@@@@@@@@@@@@@@@@@@@AC@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
ERR000572.854678 pPr2 X 154402794 60 15S36M = 154402599 -230
TATCCAAGCAGGAGGTCTCAAGTGGCCTGGTCTAGAGTAGTGACAGTGGAC
?A75:@A7:=?:A?>;<>?=@:?>===@>=>=>?;@<?@:@=>?@>A?>A<
X0:i:1 X1:i:1 XC:i:36MD:Z:12C23 RG:Z:ERR000572 AM:i:23 NM:i:1 SM:i:23 MQ:i:60 XT:A:U
BQ:Z:@@@@@@@@@@@@@@@@@@@@@@@@@@@C@@@@@@@@@@@@@@@@@@@@@@@
ERR000572.8002077 pPr1 X 154402804 60 15S36M = 154402586 -253
GGAGGTCTCACGTGGCCTGGTCTAGAGTAGTGACAGTGGACATTGAAAGAA
=:B<;<;>3?%;=?93=@>=>;>@>@<>@=@==@@=?>>9>@@@=AAA>A?
X0:i:1 X1:i:0 XC:i:36MD:Z:2C33 RG:Z:ERR000572 AM:i:37 NM:i:1 SM:i:37 MQ:i:60 XT:A:U
BQ:Z:@@@@@@@@@@@@@@@@@C@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@CD
ERR000574.8852864 pPR2 X 154402806 60 39M12S = 154402984 222
TGGTCTAGAGTAGTGACAGTGGACATTGAAAGAAAGAGATTTCAGAGATAC
=@???@>>>>?>>?>?=A>?>=>=@?@3#@?<8@@<>=:?@A?9<>=>A?<
X0:i:1 X1:i:0 XC:i:39MD:Z:0C38 RG:Z:ERR000574 AM:i:37 NM:i:1 SM:i:37 MQ:i:60 XT:A:U
BQ:Z:WJ@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
And the results from pileup is:
X 154402806 C T 9 67 60 5 tTtt^]T <===&
So, from the reads data, bases at position 154402806 should be CTTAAT(three reference alleles), but the results from pileup is TTTTT (all are reference alleles). Also, there only 5 reads from pileup results, but it supposes to have 6 reads coverage.
I checked the default read filter in samtools:
minimum mapping quality for an alignment to be used [0]
minmum base quality for a base to be considered [13]
The base quality of these bases CTTAAT are
63
60
64
65
66
61
Does anyone have any idea what is going wrong?
Thank you very much!!
Anney
Comment