SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
text file editing Liy General 1 06-20-2015 01:29 PM
Filtering homozygous variants in vcf file with vcftools Seqql Bioinformatics 4 06-03-2015 11:06 AM
Programming Language Desai Bioinformatics 12 06-02-2014 10:44 AM
How to fetch any older version ACCs from NCBI using programming language? Brace Bioinformatics 5 06-20-2012 08:38 PM
Re: New to programming need help for inserting text kapoormanav Bioinformatics 5 07-09-2010 01:23 PM

Reply
 
Thread Tools
Old 12-04-2017, 09:47 AM   #1
MGlenn
Junior Member
 
Location: Ontario

Join Date: Dec 2017
Posts: 6
Question Help filtering variants from TSV file using text processing programming language

Hi:

I'm looking for some guidance on using text processing programming language to filter and summarize SNP calls based on a variety of descriptors in a TSV file. The TSV file was obtained using the kissplice and and kissplice2reftranscriptome programs.

Here is a sample entry of the TSV file:

#Component_ID SNP_ID Is_in_CDS Query_coverage SNP_position Codon_1 Codon_2 Amino_acid_1 Amino_acid_2 Is_not_synonymous Bubble_is_aligned_on_multiple_comp Bubble_is_aligned_on_multiple_seq Possible_sequencing_error Allele_frequency Read_counts_variant_1 Read_counts_variant_2 Is_condition_specific
c59969_g1_i3 bcc_9888|Cycle_0|Type_0a True 100.0 1073 AGC TGC S C True False True True 0.0|0.0|0.0|0.0|100.0|100.0|100.0|100.0 C1_0|C2_0|C3_0|C4_0|C5_10|C6_80|C7_47|C8_17 C1_71|C2_11|C3_61|C4_55|C5_0|C6_0|C7_0|C8_0 True

The SNP is A/T. In this case. I have 4 paired-end samples represented equally by 2 species. Species 1 is represented by the read counts C1,C2,C3,C4 and species 2 is represented by C5,C6,C7,C8. A is supported by 0 reads in species 1, and 10+80 (sample 1), 47+17 (sample 2) in species 2. T is supported by 71+11 (sample 3), 61+55 (sample 4) reads in species 1 and 0 reads in species 2. This is evidence for a putative species-specific SNP.

In short, I would like to select SNPs that have an Allele_frequency of "|0.0|" in the first four values, and "|100.0|" in the next four (or vice versa), while making sure that the read counts are 10 or greater for the "species-specific" variant.

Please let me know if you need more clarification, this is some of the more complex cases of filtering that I've encountered and I'm struggling quite a bit.

Any insight would be greatly appreciated. Cheers.

M
MGlenn is offline   Reply With Quote
Old 12-04-2017, 07:58 PM   #2
neavemj
Member
 
Location: MA, USA

Join Date: Feb 2014
Posts: 40
Default

Hi MGlenn,

Are you using linux for this analysis? You could print the lines containing that pattern using grep, for example:

grep "\(100.0|\)\{4\}\(0.0.\)\{4\}" your_input_file.txt

This basically means grab any line containing 100.0 repeated 4 times, followed by 0.0 repeated 4 times.

And the vice versa situation:

grep "\(0.0|\)\{4\}\(100.0.\)\{4\}" your_input_file.txt

This will get the 'species-specific' SNPs but it won't take into account read counts - that's too hard for me to do with linux!

I'd say you will need a python / perl script to really get down to what you want. I could probably write you one if you send me the first part (~100 lines) of your file.

Cheers,

Matt.
neavemj is offline   Reply With Quote
Old 12-05-2017, 05:19 AM   #3
MGlenn
Junior Member
 
Location: Ontario

Join Date: Dec 2017
Posts: 6
Default

Hi Matt,

Thanks so much for getting back to me and offering to help, I really appreciate that.

Yes, I'm using linux on a cluster. I agree that it is difficult to do; I was having trouble figuring out the syntax I would need to grab the "|100.0|100.0|100.0|100.0" and "0.0|0.0|0.0|0.0" allele frequencies that are present in the first or last four (eg 0.0|0.0|0.0|0.0|100.0|100.0|100.0|100.0), where one of the variants (eg T) has read support equal or greater than 10 in one species, and zero in the other species, and the other variant (eg A) has the opposite pattern indicating that each of the variants is represented only by one species (eg C1_0|C2_0|C3_0|C4_0|C5_10|C6_80|C7_47|C8_17 C1_71|C2_11|C3_61|C4_55 |C5_0|C6_0|C7_0|C8_0)

I've attached a .txt version of my .tsv file that contains the first 70 lines (to keep under the file size limit for attachments, can email a larger print out if needed). I've also attached a .txt file of the recommendations for my analysis provided by the author of the program, which explains exactly which SNPs I'm trying to capture from the file. It may help to compliment my explanation.

Thanks again for your help Matt! Let me know if you require any other info.

Cheers,

M
Attached Files
File Type: txt kissplice_SNP_author_recommendation.txt (1.2 KB, 2 views)
File Type: txt head_n70_mainOutput.txt (16.5 KB, 2 views)

Last edited by MGlenn; 12-05-2017 at 05:27 AM.
MGlenn is offline   Reply With Quote
Old 12-05-2017, 10:35 AM   #4
MGlenn
Junior Member
 
Location: Ontario

Join Date: Dec 2017
Posts: 6
Default

I guess an alternative would be to first filter the TSV file based on the allele frequencies being "100.0" for the first four entries and "0.0" for the next four (or vice versa), using the grep command you provided. Then, filter the SNPs that meet those requirements based on C1-C4 having 0 reads for one variant, and C1-C4 having equal to or greater than 10 reads for the other variant, and C5-C8 having 0 reads for one variant, and C5-C8 having equal or greater than 10 reads for the other variant (or vice versa).

As long as pairs of C1-C4 and C5-C8 have discordant read values that are 0 in each for one variant and equal to or greater than 10 for each in the other variant.

That would solve the issue that I'm having, I just don't know how I would begin to write the syntax for that.

M
MGlenn is offline   Reply With Quote
Old 12-05-2017, 03:38 PM   #5
neavemj
Member
 
Location: MA, USA

Join Date: Feb 2014
Posts: 40
Default

Hi MGlenn,

yep, I think when it starts to get a bit complicated it's best to go for a python script. Then you also have a clean record of how it was done.

I've attached a script that should do what you want (get_species_SNPs.txt). Once you download it, rename it to 'get_species_SNPs.py', and open it in a text editor. You'll see at the top of the script there are a couple of variables you can change - your input file name, minimum reads required, etc.

Then put the script in the same directory as your SNP file, and run it like so:

python get_species_SNPs.py

It should work with python 2 or 3. You may have to change file permissions to get it to run.

I wrote it fairly quickly, so just check it actually works! Pick a couple of SNPs that you know are species-specific and make sure they're in the output.

I might copy it below as well in case the attachment doesn't work.

Let me know how you go..

Matt.
Attached Files
File Type: txt get_species_SNPs.txt (1.5 KB, 3 views)

Last edited by neavemj; 12-05-2017 at 04:40 PM.
neavemj is offline   Reply With Quote
Old 12-05-2017, 03:44 PM   #6
neavemj
Member
 
Location: MA, USA

Join Date: Feb 2014
Posts: 40
Default

Code:
#!/usr/bin/env python
# python 3

# take kissplice output and get 'species-specific' SNPs
# Matthew J. Neave 06.12.2017

######################################
# the variables below can be changed..
SNP_file = "head_n70_mainOutput.txt"
output = open("species_SNPs.txt", "w")
minimum_reads = 10
######################################

def clean_counts(input_col):
    # returns list of integers of actual read counts
    # the sample name, underscore and pipe char are removed
    sample_counts = input_col.split("|")
    counts = [int(c.split("_")[1]) for c in sample_counts]
    return(counts)

with open(SNP_file) as fl:
    # skip header
    next(fl)
    # now loop through lines
    for line in fl:
        line_s = line.strip()
        cols = line_s.split("\t")
        allele_freq = cols[13]
        counts1 = clean_counts(cols[14])
        counts2 = clean_counts(cols[15])
        # filter out species-specific SNPs for the first case
        if allele_freq == "100.0|100.0|100.0|100.0|0.0|0.0|0.0|0.0" \
        and all(c >= minimum_reads for c in counts1[:4]) \
        and all(i >= minimum_reads for i in counts2[4:]):
            output.write(line)

        # filter out species-specific SNPs for the vice versa case
        elif allele_freq == "0.0|0.0|0.0|0.0|100.0|100.0|100.0|100.0" \
        and all(c >= minimum_reads for c in counts1[4:]) \
        and all(i >= minimum_reads for i in counts2[:4]):
            output.write(line)

Last edited by neavemj; 12-05-2017 at 04:37 PM.
neavemj is offline   Reply With Quote
Old 12-06-2017, 09:11 AM   #7
MGlenn
Junior Member
 
Location: Ontario

Join Date: Dec 2017
Posts: 6
Default

Thanks so much Matt! It seems to have worked. I really appreciate the time you took to write this script. The script was able to filter ~2 million SNPs (all possible combos between 4 samples), down to ~145,000 that are species-specific with adequate read support for each. That's exactly what I needed it to do. Thanks again for replying to my thread, Matt!

M
MGlenn is offline   Reply With Quote
Old 12-06-2017, 02:20 PM   #8
neavemj
Member
 
Location: MA, USA

Join Date: Feb 2014
Posts: 40
Default

Excellent, glad it worked ok!

Best,

Matt.
neavemj is offline   Reply With Quote
Reply

Tags
filtering variants, kisslice2reftranscriptome, kissplice, snp

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 09:55 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO