Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Genotype calling within your sample set instead relative to reference genome

    Hi there,

    I'm developing a workflow to call variants from a dataset of ~600 samples sequenced through genotyping-by-sequencing (GBS) for phylogenomic analyses. My reference genome is rather divergent, around 20 million years. I'm interested in the variants among my sample dataset, not with respect to the reference genome, but those haplotype callers that I'm cheking call the variants with respect the reference (GATK, SAMTOOLS, FreeBayes...) Any suggestion around this problem?

    Thanks a lot guys.
    Last edited by Guillefriis; 11-04-2015, 07:11 AM.

  • #2
    You can create a reference by de novo assembly from you 600 sample data set, then align each to identify sample-specific variants.

    Comment


    • #3
      Depending on how much data there is, 600 samples may be too much to try and assemble at one time. Perhaps a sampling approach and comparing the assemblies between those tries to estimate the differences?

      Comment


      • #4
        I'm not sure if I would see a problem here. Let's assume you would compare your samples against the reference and you would see for example that sample1 has a A(ref)->T(s1) mutation at position 10 while sample2 has a A(ref)->C(s2) mutation at position 10. The variation between samples (here: C vs T) is easy extractable you just use your reference as a backbone for the comparison.

        Comment


        • #5
          If @Guillefriis does what you are proposing then where to set the cut-off to say that a particular difference is due to divergence (present in > X% samples) and so is not interesting?

          Comment


          • #6
            @HESmith I wouldn't like to use a de novo assembly since I need the genomic positions of the variants provided by the zebra finch genome (planning to do a genome scan).

            @WhatsOEver (I don't know if it's a good practice to answer to two posts in one, please let me know if forum users prefer them separatedly) I see your point and actually I thought it could work as you say, only looks computational time wasting to look over differences with respect the reference (there are going to be a lot of them) and extracting between-samples variants afterwards. Looks like SelectVariants GATK tool can do so, but I'm not sure how exactely, somebody has used it? Also, I'm not sure of the behavior of the soft callers when heterozygous at these position, a variant heterozigous site between my samples be filter out because both of the samples have an alternate allele matching the reference?


            @GenoMax I'm not sure if I understood you, I'm not interested in reference-relative variants because my study is focused in phylogenomic relationships within an emberizid genus while my reference is the Zebra Finch, only used for the mapping and downstream analyses.

            Thanks you all guys.
            Last edited by Guillefriis; 11-05-2015, 02:37 AM.

            Comment


            • #7
              Originally posted by Guillefriis View Post
              @GenoMax I'm not sure if I understood you, I'm not interested in reference-relative variants because my study is focused in phylogenomic relationships within an emberizid genus while my reference is the Zebra Finch, only used for the mapping and downstream analyses.
              You may not be interested in them but that is how you are going to pick them, right? Have you done a test to see what this result looks like? I am not an evolutionary biologist by a long shot so I don't know how ~20M year difference has affected the overall genome organization (# of chromosomes, sizes etc).

              With 600 samples you likely have enough data to try some assemblies with a random sampling of reads. That may prove to be a better reference.

              It is late and my mind is wandering ...

              Comment


              • #8
                I wonder if you have considered pyRAD:
                http://dereneaton.com/software/pyrad/

                Comment


                • #9
                  Originally posted by GenoMax View Post
                  You may not be interested in them but that is how you are going to pick them, right? Have you done a test to see what this result looks like? I am not an evolutionary biologist by a long shot so I don't know how ~20M year difference has affected the overall genome organization (# of chromosomes, sizes etc).

                  With 600 samples you likely have enough data to try some assemblies with a random sampling of reads. That may prove to be a better reference.

                  It is late and my mind is wandering ...
                  I think I'll try. I'll lose genomic position of the variants but I can end with a larger number of them, which it's better in phylogenetic terms. Never have done an assembly though!

                  Comment


                  • #10
                    Originally posted by nucacidhunter View Post
                    I wonder if you have considered pyRAD:
                    http://dereneaton.com/software/pyrad/
                    You know @nucacidhunter I had a look and seems pretty interesting, I think that I'll do an intersection called SNPs using bith gatk and pyRAD. Thanks man.

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Current Approaches to Protein Sequencing
                      by seqadmin


                      Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                      04-04-2024, 04:25 PM
                    • seqadmin
                      Strategies for Sequencing Challenging Samples
                      by seqadmin


                      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                      03-22-2024, 06:39 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, 04-11-2024, 12:08 PM
                    0 responses
                    18 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 10:19 PM
                    0 responses
                    22 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 09:21 AM
                    0 responses
                    17 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-04-2024, 09:00 AM
                    0 responses
                    49 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X