Hi all,
I am using paired-end RNAseq data to investigate allele-specific expression (ASE) in a human population. I mapped the 37bp reads using TopHat to the human reference genome and used samtools to generate a mpileup file per individual. I have a list of known SNP locations that I would like to check for ASE, however I have trouble interpreting the mpileup output. I wrote a parser script in perl to get the read count for each base, which works fine when the output for a certain position looks like this:
chr6 348105 T 15 ,A........,..,. 366>>>>>>>2>>&>
I can count the number of dots and comma's in the 5th field as matches to the reference allele and in this case ACG/acg for counts for the other bases. However, what to do with outputs like these:
chr6 348102 G 15 <>>>>>>>>><>><> <3<<=.>>>>>>>;>
chr6 348883 G 10 <><<...,.^S. >><>><77><
chr6 351221 C 19 .$,$,$,$,$,$,$,$,,,,,,,,... 7<>><>>><>>;>3>*>>>
How to count the other characters like ^ , $, < and > ? I know ^ and $ mark the start and end of a read and < and > mark a reference skip (whatever that means). But can I count them as match to the reference allele? The few parsers that I found online don't count them, but then I end up with a lot of SNP positions that I cannot use.
I am really new to the RNAseq field and sequencing in general, so any help on this is appreciated. Also if you have a better solution for assessing ASE, please let me know.
Thanx!
I am using paired-end RNAseq data to investigate allele-specific expression (ASE) in a human population. I mapped the 37bp reads using TopHat to the human reference genome and used samtools to generate a mpileup file per individual. I have a list of known SNP locations that I would like to check for ASE, however I have trouble interpreting the mpileup output. I wrote a parser script in perl to get the read count for each base, which works fine when the output for a certain position looks like this:
chr6 348105 T 15 ,A........,..,. 366>>>>>>>2>>&>
I can count the number of dots and comma's in the 5th field as matches to the reference allele and in this case ACG/acg for counts for the other bases. However, what to do with outputs like these:
chr6 348102 G 15 <>>>>>>>>><>><> <3<<=.>>>>>>>;>
chr6 348883 G 10 <><<...,.^S. >><>><77><
chr6 351221 C 19 .$,$,$,$,$,$,$,$,,,,,,,,... 7<>><>>><>>;>3>*>>>
How to count the other characters like ^ , $, < and > ? I know ^ and $ mark the start and end of a read and < and > mark a reference skip (whatever that means). But can I count them as match to the reference allele? The few parsers that I found online don't count them, but then I end up with a lot of SNP positions that I cannot use.
I am really new to the RNAseq field and sequencing in general, so any help on this is appreciated. Also if you have a better solution for assessing ASE, please let me know.
Thanx!