Seqanswers Leaderboard Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • KNS
    Member
    • Aug 2010
    • 17

    Compile multi-individual SNP tables from bowtie/samtools mappings

    I am sure someone must have solved this before, and I wonder if there is not an obviously easy way to do this:

    I have GAIIx-RAD data of bowtie-reference-mapped individuals. The resulting individual SNP data (produced with samtools in the form of pileups) is easily merged in a single table, but one major problem remains:

    The samtools pileup only reports on SNPs which are different wrt the reference sequence, but not on those SNPs which happen to be identical to the reference (i.e. those which show differences in other individuals).

    Is there any easy way of discriminating the absence of data from DNA-sequence identity of any given SNP with the reference sequence?

    Any pointers most welcome!
  • Jose Blanca
    Member
    • Aug 2009
    • 70

    #2
    We had the same problem in a plant transcriptome, so we created a tool to do just that. It takes the individual bam files, it merges them and then it looks for the SNP with respect to the reference or to between the reads. You could do something like that.
    Alternatively, you could give a spin to our tool, it might be useful to you, it is named ngs_backbone.

    Comment

    • KNS
      Member
      • Aug 2010
      • 17

      #3
      Dear Jose,
      this is brilliant. I will gove it a shot - no need to reinvent the wheel. Big thanks for your quick reply!

      Comment

      • malcook
        Member
        • Sep 2009
        • 24

        #4
        consolidating pileup across multiple samples

        Here are steps for using samtools to creating pileup output for three samples holding the sample allele at each site
        Code:
        # assuming for each of three read groups (samples) named s1,s2,s3 that you have one file of reads aligned to genome in $RefseqFasta
        
        # use samtools to produce your sorted bam files {s1,s2,s3}.s.bam
        ...
        
        # build the pileups:
        samtools pileup -vcf $RefseqFasta s1.s.bam > s1.pileup.tab
        samtools pileup -vcf $RefseqFasta s2.s.bam > s2.pileup.tab
        samtools pileup -vcf $RefseqFasta s3.s.bam > s3.pileup.tab
        
        # filter them using samtools.pl (or varfilter.py) (or awk)
        samtools.pl varFilter $varFilterOptions s1.pileup.tab > s1.pileup.filtered.tab
        samtools.pl varFilter $varFilterOptions s2.pileup.tab > s2.pileup.filtered.tab
        samtools.pl varFilter $varFilterOptions s3.pileup.tab > s3.pileup.filtered.tab
        
        # produce consolidated "list of sites" (c.f. [url]http://samtools.sourceforge.net/samtools.shtml[/url]) holding at least one filtered pileup
        # note: the s{1,2,3} relies upon your shell being bash.
        cut -f 1-2  s{1,2,3}.pileup.filtered.tab | sort  --key=1,1 --key=2,2n | uniq > s1-3.sites.tab
        
        # extract the pileup for each sample at each site
        samtools pileup -l s1-3.sites.tab -c -f $RefseqFasta s1.s.bam > s1.pileup.atsites.tab
        samtools pileup -l s1-3.sites.tab -c -f $RefseqFasta s2.s.bam > s2.pileup.atsites.tab
        samtools pileup -l s1-3.sites.tab -c -f $RefseqFasta s3.s.bam > s3.pileup.atsites.tab
        
        # Use whatever tools you have to compare s{1,2,3}.pileup.atsites.tab
        # They all are in the same order and have the same values in the 
        # first three columns (chromosome, position, reference sequence).
        # Put them in a database and write join commands?  
        # Use the unix `paste` command to see them side-by-side?
        # Load them into "R" and compare them there?
        # Use ensembl'e snp_effect_predictor to predict consequences?
        # ...
        I typically craft the above logic using Makefiles....

        Cheers,

        Malcolm

        Comment

        • lh3
          Senior Member
          • Feb 2008
          • 686

          #5
          Use mpileup

          Comment

          • sdavis
            Member
            • Jan 2010
            • 14

            #6
            Originally posted by lh3 View Post
            Use mpileup
            Using mpileup does a fantastic job of simplifying the workflow. However, the per-sample information (strand bias, sequence depth, etc.) are largely lost. Is there a way to maintain that sample-level information (DP, SB, and potentially others) so that more interesting filtering can be done on individual samples? In the end, we are typically interested in having very high quality variant calls per sample and want to guard against errors at that level and not so much at the level of the population.

            Comment

            • lh3
              Senior Member
              • Feb 2008
              • 686

              #7
              For now, call candidates first and then run mpileup on the candidates to get the pileup format (i.e., without the "-g" option). From pileup, you will know DP. As to strand bias, you may perform an exact test based on the pileup string, like what bcftools is doing.

              Or when you get the candidate lists, collect information individually by running mpileup on each individual.

              EDIT: in future, pileup lines may be folded into VCF, but this will not be soon.
              Last edited by lh3; 10-29-2010, 04:35 AM.

              Comment

              • lh3
                Senior Member
                • Feb 2008
                • 686

                #8
                I have just committed a new revision by which you can compute per-sample DP and SP (phred-scaled strand bias P-value), via the -D and -S options of mpileup. They are not the default behavior. In addition, -S may have performance overhead as Fisher's exact test really takes time.

                EDIT: for now, I think DP and SP are the most informative ones.
                Last edited by lh3; 10-29-2010, 08:21 AM.

                Comment

                • Jose Blanca
                  Member
                  • Aug 2009
                  • 70

                  #9
                  We keep, sample and library information as well as a mean quality per allele. That's why we use our own snp caller and not pileup. If pileup kept that information we would have use it.

                  Comment

                  • lh3
                    Senior Member
                    • Feb 2008
                    • 686

                    #10
                    mpileup keeps per sample information. this is what it is designed for. the quality of each base is there, so you can also compute a mean. on the other hand, I am not sure how much library information may help. I guess we mostly use one library per sample now.

                    nonetheless, for deep resequencing of a couple of individuals, the model for snp calling is not important. any reasonable snp callers will give sensible results. no need to be bound to samtools. just you should apply the baq thing before snp calling.

                    Comment

                    • aslihan
                      Member
                      • Jun 2011
                      • 23

                      #11
                      Hi Ih3

                      Could you write the command for it? Where I put -D and -S?

                      samtools mpileup -EuDSf *.fa s1.sorted.bam s2.sorted.bam
                      bcftools view -bvcg
                      bcftools/vcfutils.pl varFilter -D100

                      Thanks


                      I have just committed a new revision by which you can compute per-sample DP and SP (phred-scaled strand bias P-value), via the -D and -S options of mpileup. They are not the default behavior. In addition, -S may have performance overhead as Fisher's exact test really takes time.

                      EDIT: for now, I think DP and SP are the most informative ones.

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Pathogen Surveillance with Advanced Genomic Tools
                        by seqadmin




                        The COVID-19 pandemic highlighted the need for proactive pathogen surveillance systems. As ongoing threats like avian influenza and newly emerging infections continue to pose risks, researchers are working to improve how quickly and accurately pathogens can be identified and tracked. In a recent SEQanswers webinar, two experts discussed how next-generation sequencing (NGS) and machine learning are shaping efforts to monitor viral variation and trace the origins of infectious...
                        03-24-2025, 11:48 AM
                      • seqadmin
                        New Genomics Tools and Methods Shared at AGBT 2025
                        by seqadmin


                        This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.

                        The Headliner
                        The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...
                        03-03-2025, 01:39 PM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, 03-20-2025, 05:03 AM
                      0 responses
                      49 views
                      0 reactions
                      Last Post seqadmin  
                      Started by seqadmin, 03-19-2025, 07:27 AM
                      0 responses
                      57 views
                      0 reactions
                      Last Post seqadmin  
                      Started by seqadmin, 03-18-2025, 12:50 PM
                      0 responses
                      50 views
                      0 reactions
                      Last Post seqadmin  
                      Started by seqadmin, 03-03-2025, 01:15 PM
                      0 responses
                      201 views
                      0 reactions
                      Last Post seqadmin  
                      Working...