Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • MGlenn
    Junior Member
    • Dec 2017
    • 6

    Help filtering variants from TSV file using text processing programming language

    Hi:

    I'm looking for some guidance on using text processing programming language to filter and summarize SNP calls based on a variety of descriptors in a TSV file. The TSV file was obtained using the kissplice and and kissplice2reftranscriptome programs.

    Here is a sample entry of the TSV file:

    #Component_ID SNP_ID Is_in_CDS Query_coverage SNP_position Codon_1 Codon_2 Amino_acid_1 Amino_acid_2 Is_not_synonymous Bubble_is_aligned_on_multiple_comp Bubble_is_aligned_on_multiple_seq Possible_sequencing_error Allele_frequency Read_counts_variant_1 Read_counts_variant_2 Is_condition_specific
    c59969_g1_i3 bcc_9888|Cycle_0|Type_0a True 100.0 1073 AGC TGC S C True False True True 0.0|0.0|0.0|0.0|100.0|100.0|100.0|100.0 C1_0|C2_0|C3_0|C4_0|C5_10|C6_80|C7_47|C8_17 C1_71|C2_11|C3_61|C4_55|C5_0|C6_0|C7_0|C8_0 True

    The SNP is A/T. In this case. I have 4 paired-end samples represented equally by 2 species. Species 1 is represented by the read counts C1,C2,C3,C4 and species 2 is represented by C5,C6,C7,C8. A is supported by 0 reads in species 1, and 10+80 (sample 1), 47+17 (sample 2) in species 2. T is supported by 71+11 (sample 3), 61+55 (sample 4) reads in species 1 and 0 reads in species 2. This is evidence for a putative species-specific SNP.

    In short, I would like to select SNPs that have an Allele_frequency of "|0.0|" in the first four values, and "|100.0|" in the next four (or vice versa), while making sure that the read counts are 10 or greater for the "species-specific" variant.

    Please let me know if you need more clarification, this is some of the more complex cases of filtering that I've encountered and I'm struggling quite a bit.

    Any insight would be greatly appreciated. Cheers.

    M
  • neavemj
    Member
    • Feb 2014
    • 58

    #2
    Hi MGlenn,

    Are you using linux for this analysis? You could print the lines containing that pattern using grep, for example:

    grep "\(100.0|\)\{4\}\(0.0.\)\{4\}" your_input_file.txt

    This basically means grab any line containing 100.0 repeated 4 times, followed by 0.0 repeated 4 times.

    And the vice versa situation:

    grep "\(0.0|\)\{4\}\(100.0.\)\{4\}" your_input_file.txt

    This will get the 'species-specific' SNPs but it won't take into account read counts - that's too hard for me to do with linux!

    I'd say you will need a python / perl script to really get down to what you want. I could probably write you one if you send me the first part (~100 lines) of your file.

    Cheers,

    Matt.

    Comment

    • MGlenn
      Junior Member
      • Dec 2017
      • 6

      #3
      Hi Matt,

      Thanks so much for getting back to me and offering to help, I really appreciate that.

      Yes, I'm using linux on a cluster. I agree that it is difficult to do; I was having trouble figuring out the syntax I would need to grab the "|100.0|100.0|100.0|100.0" and "0.0|0.0|0.0|0.0" allele frequencies that are present in the first or last four (eg 0.0|0.0|0.0|0.0|100.0|100.0|100.0|100.0), where one of the variants (eg T) has read support equal or greater than 10 in one species, and zero in the other species, and the other variant (eg A) has the opposite pattern indicating that each of the variants is represented only by one species (eg C1_0|C2_0|C3_0|C4_0|C5_10|C6_80|C7_47|C8_17 C1_71|C2_11|C3_61|C4_55 |C5_0|C6_0|C7_0|C8_0)

      I've attached a .txt version of my .tsv file that contains the first 70 lines (to keep under the file size limit for attachments, can email a larger print out if needed). I've also attached a .txt file of the recommendations for my analysis provided by the author of the program, which explains exactly which SNPs I'm trying to capture from the file. It may help to compliment my explanation.

      Thanks again for your help Matt! Let me know if you require any other info.

      Cheers,

      M
      Attached Files
      Last edited by MGlenn; 12-05-2017, 05:27 AM.

      Comment

      • MGlenn
        Junior Member
        • Dec 2017
        • 6

        #4
        I guess an alternative would be to first filter the TSV file based on the allele frequencies being "100.0" for the first four entries and "0.0" for the next four (or vice versa), using the grep command you provided. Then, filter the SNPs that meet those requirements based on C1-C4 having 0 reads for one variant, and C1-C4 having equal to or greater than 10 reads for the other variant, and C5-C8 having 0 reads for one variant, and C5-C8 having equal or greater than 10 reads for the other variant (or vice versa).

        As long as pairs of C1-C4 and C5-C8 have discordant read values that are 0 in each for one variant and equal to or greater than 10 for each in the other variant.

        That would solve the issue that I'm having, I just don't know how I would begin to write the syntax for that.

        M

        Comment

        • neavemj
          Member
          • Feb 2014
          • 58

          #5
          Hi MGlenn,

          yep, I think when it starts to get a bit complicated it's best to go for a python script. Then you also have a clean record of how it was done.

          I've attached a script that should do what you want (get_species_SNPs.txt). Once you download it, rename it to 'get_species_SNPs.py', and open it in a text editor. You'll see at the top of the script there are a couple of variables you can change - your input file name, minimum reads required, etc.

          Then put the script in the same directory as your SNP file, and run it like so:

          python get_species_SNPs.py

          It should work with python 2 or 3. You may have to change file permissions to get it to run.

          I wrote it fairly quickly, so just check it actually works! Pick a couple of SNPs that you know are species-specific and make sure they're in the output.

          I might copy it below as well in case the attachment doesn't work.

          Let me know how you go..

          Matt.
          Attached Files
          Last edited by neavemj; 12-05-2017, 04:40 PM.

          Comment

          • neavemj
            Member
            • Feb 2014
            • 58

            #6
            Code:
            #!/usr/bin/env python
            # python 3
            
            # take kissplice output and get 'species-specific' SNPs
            # Matthew J. Neave 06.12.2017
            
            ######################################
            # the variables below can be changed..
            SNP_file = "head_n70_mainOutput.txt"
            output = open("species_SNPs.txt", "w")
            minimum_reads = 10
            ######################################
            
            def clean_counts(input_col):
                # returns list of integers of actual read counts
                # the sample name, underscore and pipe char are removed
                sample_counts = input_col.split("|")
                counts = [int(c.split("_")[1]) for c in sample_counts]
                return(counts)
            
            with open(SNP_file) as fl:
                # skip header
                next(fl)
                # now loop through lines
                for line in fl:
                    line_s = line.strip()
                    cols = line_s.split("\t")
                    allele_freq = cols[13]
                    counts1 = clean_counts(cols[14])
                    counts2 = clean_counts(cols[15])
                    # filter out species-specific SNPs for the first case
                    if allele_freq == "100.0|100.0|100.0|100.0|0.0|0.0|0.0|0.0" \
                    and all(c >= minimum_reads for c in counts1[:4]) \
                    and all(i >= minimum_reads for i in counts2[4:]):
                        output.write(line)
            
                    # filter out species-specific SNPs for the vice versa case
                    elif allele_freq == "0.0|0.0|0.0|0.0|100.0|100.0|100.0|100.0" \
                    and all(c >= minimum_reads for c in counts1[4:]) \
                    and all(i >= minimum_reads for i in counts2[:4]):
                        output.write(line)
            Last edited by neavemj; 12-05-2017, 04:37 PM.

            Comment

            • MGlenn
              Junior Member
              • Dec 2017
              • 6

              #7
              Thanks so much Matt! It seems to have worked. I really appreciate the time you took to write this script. The script was able to filter ~2 million SNPs (all possible combos between 4 samples), down to ~145,000 that are species-specific with adequate read support for each. That's exactly what I needed it to do. Thanks again for replying to my thread, Matt!

              M

              Comment

              • neavemj
                Member
                • Feb 2014
                • 58

                #8
                Excellent, glad it worked ok!

                Best,

                Matt.

                Comment

                Latest Articles

                Collapse

                • SEQadmin2
                  From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                  by SEQadmin2


                  Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                  The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                  ...
                  06-02-2026, 10:05 AM
                • SEQadmin2
                  Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                  by SEQadmin2


                  With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                  Introduction

                  Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                  05-22-2026, 06:42 AM
                • SEQadmin2
                  Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                  by SEQadmin2

                  Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                  Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                  05-06-2026, 09:04 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by SEQadmin2, 06-02-2026, 12:03 PM
                0 responses
                19 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, 06-02-2026, 11:40 AM
                0 responses
                14 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, 05-28-2026, 11:40 AM
                0 responses
                29 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, 05-26-2026, 10:12 AM
                0 responses
                31 views
                0 reactions
                Last Post SEQadmin2  
                Working...