Seqanswers Leaderboard Ad

**neavemj** · 12-04-2017, 07:58 PM

Hi MGlenn,

Are you using linux for this analysis? You could print the lines containing that pattern using grep, for example:

grep "\(100.0|\)\{4\}\(0.0.\)\{4\}" your_input_file.txt

This basically means grab any line containing 100.0 repeated 4 times, followed by 0.0 repeated 4 times.

And the vice versa situation:

grep "\(0.0|\)\{4\}\(100.0.\)\{4\}" your_input_file.txt

This will get the 'species-specific' SNPs but it won't take into account read counts - that's too hard for me to do with linux!

I'd say you will need a python / perl script to really get down to what you want. I could probably write you one if you send me the first part (~100 lines) of your file.

Cheers,

Matt.

**MGlenn** · 12-05-2017, 05:19 AM

Hi Matt,

Thanks so much for getting back to me and offering to help, I really appreciate that.

Yes, I'm using linux on a cluster. I agree that it is difficult to do; I was having trouble figuring out the syntax I would need to grab the "|100.0|100.0|100.0|100.0" and "0.0|0.0|0.0|0.0" allele frequencies that are present in the first or last four (eg 0.0|0.0|0.0|0.0|100.0|100.0|100.0|100.0), where one of the variants (eg T) has read support equal or greater than 10 in one species, and zero in the other species, and the other variant (eg A) has the opposite pattern indicating that each of the variants is represented only by one species (eg C1_0|C2_0|C3_0|C4_0|C5_10|C6_80|C7_47|C8_17 C1_71|C2_11|C3_61|C4_55 |C5_0|C6_0|C7_0|C8_0)

I've attached a .txt version of my .tsv file that contains the first 70 lines (to keep under the file size limit for attachments, can email a larger print out if needed). I've also attached a .txt file of the recommendations for my analysis provided by the author of the program, which explains exactly which SNPs I'm trying to capture from the file. It may help to compliment my explanation.

Thanks again for your help Matt! Let me know if you require any other info.

Cheers,

M

Attached Files

**MGlenn** · 12-05-2017, 10:35 AM

I guess an alternative would be to first filter the TSV file based on the allele frequencies being "100.0" for the first four entries and "0.0" for the next four (or vice versa), using the grep command you provided. Then, filter the SNPs that meet those requirements based on C1-C4 having 0 reads for one variant, and C1-C4 having equal to or greater than 10 reads for the other variant, and C5-C8 having 0 reads for one variant, and C5-C8 having equal or greater than 10 reads for the other variant (or vice versa).

As long as pairs of C1-C4 and C5-C8 have discordant read values that are 0 in each for one variant and equal to or greater than 10 for each in the other variant.

That would solve the issue that I'm having, I just don't know how I would begin to write the syntax for that.

M

**neavemj** · 12-05-2017, 03:38 PM

Hi MGlenn,

yep, I think when it starts to get a bit complicated it's best to go for a python script. Then you also have a clean record of how it was done.

I've attached a script that should do what you want (get_species_SNPs.txt). Once you download it, rename it to 'get_species_SNPs.py', and open it in a text editor. You'll see at the top of the script there are a couple of variables you can change - your input file name, minimum reads required, etc.

Then put the script in the same directory as your SNP file, and run it like so:

python get_species_SNPs.py

It should work with python 2 or 3. You may have to change file permissions to get it to run.

I wrote it fairly quickly, so just check it actually works! Pick a couple of SNPs that you know are species-specific and make sure they're in the output.

I might copy it below as well in case the attachment doesn't work.

Let me know how you go..

Matt.

Attached Files

get_species_SNPs.txt (1.5 KB, 3 views)

**neavemj** · 12-05-2017, 03:44 PM

Code:

#!/usr/bin/env python
# python 3

# take kissplice output and get 'species-specific' SNPs
# Matthew J. Neave 06.12.2017

######################################
# the variables below can be changed..
SNP_file = "head_n70_mainOutput.txt"
output = open("species_SNPs.txt", "w")
minimum_reads = 10
######################################

def clean_counts(input_col):
    # returns list of integers of actual read counts
    # the sample name, underscore and pipe char are removed
    sample_counts = input_col.split("|")
    counts = [int(c.split("_")[1]) for c in sample_counts]
    return(counts)

with open(SNP_file) as fl:
    # skip header
    next(fl)
    # now loop through lines
    for line in fl:
        line_s = line.strip()
        cols = line_s.split("\t")
        allele_freq = cols[13]
        counts1 = clean_counts(cols[14])
        counts2 = clean_counts(cols[15])
        # filter out species-specific SNPs for the first case
        if allele_freq == "100.0|100.0|100.0|100.0|0.0|0.0|0.0|0.0" \
        and all(c >= minimum_reads for c in counts1[:4]) \
        and all(i >= minimum_reads for i in counts2[4:]):
            output.write(line)

        # filter out species-specific SNPs for the vice versa case
        elif allele_freq == "0.0|0.0|0.0|0.0|100.0|100.0|100.0|100.0" \
        and all(c >= minimum_reads for c in counts1[4:]) \
        and all(i >= minimum_reads for i in counts2[:4]):
            output.write(line)

**MGlenn** · 12-06-2017, 09:11 AM

Thanks so much Matt! It seems to have worked. I really appreciate the time you took to write this script. The script was able to filter ~2 million SNPs (all possible combos between 4 samples), down to ~145,000 that are species-specific with adequate read support for each. That's exactly what I needed it to do. Thanks again for replying to my thread, Matt!

M

**neavemj** · 12-06-2017, 02:20 PM

Excellent, glad it worked ok!

Best,

Matt.

Topics	Statistics	Last Post
A Closer Look at the Enigmatic Genomes of Oikopleura dioica by seqadmin Started by seqadmin, Yesterday, 06:35 AM	0 responses 14 views 0 likes	Last Post by seqadmin Yesterday, 06:35 AM
Advanced Epigenome Editing Platform Explores Gene Regulation Mechanisms by seqadmin Started by seqadmin, 05-09-2024, 02:46 PM	0 responses 18 views 0 likes	Last Post by seqadmin 05-09-2024, 02:46 PM
Telomere Maintenance by PARP1: A New Perspective in Cancer Research by seqadmin Started by seqadmin, 05-07-2024, 06:57 AM	0 responses 17 views 0 likes	Last Post by seqadmin 05-07-2024, 06:57 AM
Enhanced Neoantigen Detection: Introducing NeoHunter by seqadmin Started by seqadmin, 05-06-2024, 07:17 AM	0 responses 19 views 0 likes	Last Post by seqadmin 05-06-2024, 07:17 AM

Seqanswers Leaderboard Ad

Announcement

Help filtering variants from TSV file using text processing programming language

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News