Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Help filtering variants from TSV file using text processing programming language

    Hi:

    I'm looking for some guidance on using text processing programming language to filter and summarize SNP calls based on a variety of descriptors in a TSV file. The TSV file was obtained using the kissplice and and kissplice2reftranscriptome programs.

    Here is a sample entry of the TSV file:

    #Component_ID SNP_ID Is_in_CDS Query_coverage SNP_position Codon_1 Codon_2 Amino_acid_1 Amino_acid_2 Is_not_synonymous Bubble_is_aligned_on_multiple_comp Bubble_is_aligned_on_multiple_seq Possible_sequencing_error Allele_frequency Read_counts_variant_1 Read_counts_variant_2 Is_condition_specific
    c59969_g1_i3 bcc_9888|Cycle_0|Type_0a True 100.0 1073 AGC TGC S C True False True True 0.0|0.0|0.0|0.0|100.0|100.0|100.0|100.0 C1_0|C2_0|C3_0|C4_0|C5_10|C6_80|C7_47|C8_17 C1_71|C2_11|C3_61|C4_55|C5_0|C6_0|C7_0|C8_0 True

    The SNP is A/T. In this case. I have 4 paired-end samples represented equally by 2 species. Species 1 is represented by the read counts C1,C2,C3,C4 and species 2 is represented by C5,C6,C7,C8. A is supported by 0 reads in species 1, and 10+80 (sample 1), 47+17 (sample 2) in species 2. T is supported by 71+11 (sample 3), 61+55 (sample 4) reads in species 1 and 0 reads in species 2. This is evidence for a putative species-specific SNP.

    In short, I would like to select SNPs that have an Allele_frequency of "|0.0|" in the first four values, and "|100.0|" in the next four (or vice versa), while making sure that the read counts are 10 or greater for the "species-specific" variant.

    Please let me know if you need more clarification, this is some of the more complex cases of filtering that I've encountered and I'm struggling quite a bit.

    Any insight would be greatly appreciated. Cheers.

    M

  • #2
    Hi MGlenn,

    Are you using linux for this analysis? You could print the lines containing that pattern using grep, for example:

    grep "\(100.0|\)\{4\}\(0.0.\)\{4\}" your_input_file.txt

    This basically means grab any line containing 100.0 repeated 4 times, followed by 0.0 repeated 4 times.

    And the vice versa situation:

    grep "\(0.0|\)\{4\}\(100.0.\)\{4\}" your_input_file.txt

    This will get the 'species-specific' SNPs but it won't take into account read counts - that's too hard for me to do with linux!

    I'd say you will need a python / perl script to really get down to what you want. I could probably write you one if you send me the first part (~100 lines) of your file.

    Cheers,

    Matt.

    Comment


    • #3
      Hi Matt,

      Thanks so much for getting back to me and offering to help, I really appreciate that.

      Yes, I'm using linux on a cluster. I agree that it is difficult to do; I was having trouble figuring out the syntax I would need to grab the "|100.0|100.0|100.0|100.0" and "0.0|0.0|0.0|0.0" allele frequencies that are present in the first or last four (eg 0.0|0.0|0.0|0.0|100.0|100.0|100.0|100.0), where one of the variants (eg T) has read support equal or greater than 10 in one species, and zero in the other species, and the other variant (eg A) has the opposite pattern indicating that each of the variants is represented only by one species (eg C1_0|C2_0|C3_0|C4_0|C5_10|C6_80|C7_47|C8_17 C1_71|C2_11|C3_61|C4_55 |C5_0|C6_0|C7_0|C8_0)

      I've attached a .txt version of my .tsv file that contains the first 70 lines (to keep under the file size limit for attachments, can email a larger print out if needed). I've also attached a .txt file of the recommendations for my analysis provided by the author of the program, which explains exactly which SNPs I'm trying to capture from the file. It may help to compliment my explanation.

      Thanks again for your help Matt! Let me know if you require any other info.

      Cheers,

      M
      Attached Files
      Last edited by MGlenn; 12-05-2017, 05:27 AM.

      Comment


      • #4
        I guess an alternative would be to first filter the TSV file based on the allele frequencies being "100.0" for the first four entries and "0.0" for the next four (or vice versa), using the grep command you provided. Then, filter the SNPs that meet those requirements based on C1-C4 having 0 reads for one variant, and C1-C4 having equal to or greater than 10 reads for the other variant, and C5-C8 having 0 reads for one variant, and C5-C8 having equal or greater than 10 reads for the other variant (or vice versa).

        As long as pairs of C1-C4 and C5-C8 have discordant read values that are 0 in each for one variant and equal to or greater than 10 for each in the other variant.

        That would solve the issue that I'm having, I just don't know how I would begin to write the syntax for that.

        M

        Comment


        • #5
          Hi MGlenn,

          yep, I think when it starts to get a bit complicated it's best to go for a python script. Then you also have a clean record of how it was done.

          I've attached a script that should do what you want (get_species_SNPs.txt). Once you download it, rename it to 'get_species_SNPs.py', and open it in a text editor. You'll see at the top of the script there are a couple of variables you can change - your input file name, minimum reads required, etc.

          Then put the script in the same directory as your SNP file, and run it like so:

          python get_species_SNPs.py

          It should work with python 2 or 3. You may have to change file permissions to get it to run.

          I wrote it fairly quickly, so just check it actually works! Pick a couple of SNPs that you know are species-specific and make sure they're in the output.

          I might copy it below as well in case the attachment doesn't work.

          Let me know how you go..

          Matt.
          Attached Files
          Last edited by neavemj; 12-05-2017, 04:40 PM.

          Comment


          • #6
            Code:
            #!/usr/bin/env python
            # python 3
            
            # take kissplice output and get 'species-specific' SNPs
            # Matthew J. Neave 06.12.2017
            
            ######################################
            # the variables below can be changed..
            SNP_file = "head_n70_mainOutput.txt"
            output = open("species_SNPs.txt", "w")
            minimum_reads = 10
            ######################################
            
            def clean_counts(input_col):
                # returns list of integers of actual read counts
                # the sample name, underscore and pipe char are removed
                sample_counts = input_col.split("|")
                counts = [int(c.split("_")[1]) for c in sample_counts]
                return(counts)
            
            with open(SNP_file) as fl:
                # skip header
                next(fl)
                # now loop through lines
                for line in fl:
                    line_s = line.strip()
                    cols = line_s.split("\t")
                    allele_freq = cols[13]
                    counts1 = clean_counts(cols[14])
                    counts2 = clean_counts(cols[15])
                    # filter out species-specific SNPs for the first case
                    if allele_freq == "100.0|100.0|100.0|100.0|0.0|0.0|0.0|0.0" \
                    and all(c >= minimum_reads for c in counts1[:4]) \
                    and all(i >= minimum_reads for i in counts2[4:]):
                        output.write(line)
            
                    # filter out species-specific SNPs for the vice versa case
                    elif allele_freq == "0.0|0.0|0.0|0.0|100.0|100.0|100.0|100.0" \
                    and all(c >= minimum_reads for c in counts1[4:]) \
                    and all(i >= minimum_reads for i in counts2[:4]):
                        output.write(line)
            Last edited by neavemj; 12-05-2017, 04:37 PM.

            Comment


            • #7
              Thanks so much Matt! It seems to have worked. I really appreciate the time you took to write this script. The script was able to filter ~2 million SNPs (all possible combos between 4 samples), down to ~145,000 that are species-specific with adequate read support for each. That's exactly what I needed it to do. Thanks again for replying to my thread, Matt!

              M

              Comment


              • #8
                Excellent, glad it worked ok!

                Best,

                Matt.

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Strategies for Sequencing Challenging Samples
                  by seqadmin


                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                  03-22-2024, 06:39 AM
                • seqadmin
                  Techniques and Challenges in Conservation Genomics
                  by seqadmin



                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                  Avian Conservation
                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                  03-08-2024, 10:41 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, Yesterday, 06:37 PM
                0 responses
                8 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, Yesterday, 06:07 PM
                0 responses
                8 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-22-2024, 10:03 AM
                0 responses
                49 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-21-2024, 07:32 AM
                0 responses
                67 views
                0 likes
                Last Post seqadmin  
                Working...
                X