Hi:
I'm looking for some guidance on using text processing programming language to filter and summarize SNP calls based on a variety of descriptors in a TSV file. The TSV file was obtained using the kissplice and and kissplice2reftranscriptome programs.
Here is a sample entry of the TSV file:
#Component_ID SNP_ID Is_in_CDS Query_coverage SNP_position Codon_1 Codon_2 Amino_acid_1 Amino_acid_2 Is_not_synonymous Bubble_is_aligned_on_multiple_comp Bubble_is_aligned_on_multiple_seq Possible_sequencing_error Allele_frequency Read_counts_variant_1 Read_counts_variant_2 Is_condition_specific
c59969_g1_i3 bcc_9888|Cycle_0|Type_0a True 100.0 1073 AGC TGC S C True False True True 0.0|0.0|0.0|0.0|100.0|100.0|100.0|100.0 C1_0|C2_0|C3_0|C4_0|C5_10|C6_80|C7_47|C8_17 C1_71|C2_11|C3_61|C4_55|C5_0|C6_0|C7_0|C8_0 True
The SNP is A/T. In this case. I have 4 paired-end samples represented equally by 2 species. Species 1 is represented by the read counts C1,C2,C3,C4 and species 2 is represented by C5,C6,C7,C8. A is supported by 0 reads in species 1, and 10+80 (sample 1), 47+17 (sample 2) in species 2. T is supported by 71+11 (sample 3), 61+55 (sample 4) reads in species 1 and 0 reads in species 2. This is evidence for a putative species-specific SNP.
In short, I would like to select SNPs that have an Allele_frequency of "|0.0|" in the first four values, and "|100.0|" in the next four (or vice versa), while making sure that the read counts are 10 or greater for the "species-specific" variant.
Please let me know if you need more clarification, this is some of the more complex cases of filtering that I've encountered and I'm struggling quite a bit.
Any insight would be greatly appreciated. Cheers.
M
I'm looking for some guidance on using text processing programming language to filter and summarize SNP calls based on a variety of descriptors in a TSV file. The TSV file was obtained using the kissplice and and kissplice2reftranscriptome programs.
Here is a sample entry of the TSV file:
#Component_ID SNP_ID Is_in_CDS Query_coverage SNP_position Codon_1 Codon_2 Amino_acid_1 Amino_acid_2 Is_not_synonymous Bubble_is_aligned_on_multiple_comp Bubble_is_aligned_on_multiple_seq Possible_sequencing_error Allele_frequency Read_counts_variant_1 Read_counts_variant_2 Is_condition_specific
c59969_g1_i3 bcc_9888|Cycle_0|Type_0a True 100.0 1073 AGC TGC S C True False True True 0.0|0.0|0.0|0.0|100.0|100.0|100.0|100.0 C1_0|C2_0|C3_0|C4_0|C5_10|C6_80|C7_47|C8_17 C1_71|C2_11|C3_61|C4_55|C5_0|C6_0|C7_0|C8_0 True
The SNP is A/T. In this case. I have 4 paired-end samples represented equally by 2 species. Species 1 is represented by the read counts C1,C2,C3,C4 and species 2 is represented by C5,C6,C7,C8. A is supported by 0 reads in species 1, and 10+80 (sample 1), 47+17 (sample 2) in species 2. T is supported by 71+11 (sample 3), 61+55 (sample 4) reads in species 1 and 0 reads in species 2. This is evidence for a putative species-specific SNP.
In short, I would like to select SNPs that have an Allele_frequency of "|0.0|" in the first four values, and "|100.0|" in the next four (or vice versa), while making sure that the read counts are 10 or greater for the "species-specific" variant.
Please let me know if you need more clarification, this is some of the more complex cases of filtering that I've encountered and I'm struggling quite a bit.
Any insight would be greatly appreciated. Cheers.
M
Comment