Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Combining VCF files using GATK?

    Here is my task. I have 4 treatment groups containing 5 replicates in each group.

    For each treatment group, I am looking for at least 2 replicates that have the same base substitution.

    So far I have generated a vcf file for each replicate in each treatment group.

    1. What is the best way to combined these files so that I have 4 vcf files, 1 for each treatment group that contains 5 replicates?

    2. What is the best way to keep track of the position of sites where at least 2 replicates showed the same base substitution so that I can produce results once the gene annotation is complete?

  • #2


    If you want to only show variants in 2+ duplicates, use "--minimumN 2" option.

    Comment


    • #3
      The issue seems to be that when I combine all 5 vcf files for the replicates, it shows a replicate at 1 site, but shows a '.' for the other 4.

      I'm not really sure how to go about this.

      Comment


      • #4
        Originally posted by prs321 View Post
        The issue seems to be that when I combine all 5 vcf files for the replicates, it shows a replicate at 1 site, but shows a '.' for the other 4.

        I'm not really sure how to go about this.
        Does that mean only 1 replicate has that called variant?

        Comment


        • #5
          Yes so for example I have 5 alignment files: rep1.bam, rep2.bam, rep3.bam, rep4.bam, rep5.bam

          All of them have been processed (markduplicates, addreadgroups etc).

          Then I called snps on each alignment to produce the following: rep1.vcf, rep2.vcf, rep3.vcf, rep4.vcf, rep5.vcf

          Comment


          • #6
            Originally posted by prs321 View Post
            Yes so for example I have 5 alignment files: rep1.bam, rep2.bam, rep3.bam, rep4.bam, rep5.bam

            All of them have been processed (markduplicates, addreadgroups etc).

            Then I called snps on each alignment to produce the following: rep1.vcf, rep2.vcf, rep3.vcf, rep4.vcf, rep5.vcf

            What will this give you?

            Code:
            java -Xmx2g -jar GenomeAnalysisTK.jar \
               -R ref.fasta \
               -T CombineVariants \
               -minN 2 \
               --variant rep1.vcf \
               --variant rep2.vcf \
               --variant rep3.vcf \
               --variant rep4.vcf \
               --variant rep5.vcf \
               -o combined.output.vcf

            Comment


            • #7
              0/0:6:21:24,0,100:44:299:15 . . . .

              Comment


              • #8
                There is only one such column? If you combine 5 samples (replicates), there should be 5 such columns.
                If not, make sure the header for the sample info are different in each of the 5 vcf files. Otherwise GATK thinks it's the same sample (i.e., same replicate in this case).
                0/0 means reference calls in both copies.
                Last edited by lethalfang; 05-22-2014, 07:33 AM.

                Comment


                • #9
                  There are actually 5 columns, but only the first column is listed with the genotype. The other 4 have a '.' under them.

                  Comment


                  • #10
                    Originally posted by prs321 View Post
                    There are actually 5 columns, but only the first column is listed with the genotype. The other 4 have a '.' under them.
                    Can you show me a few rows of the output?

                    Comment


                    • #11
                      #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT cs1 cs2 cs3 cs4 cs5
                      NODE_4_length_282_cov_38.510639 66 . G A 169.04 . AC=1;AF=0.500;AN=2;CIGAR=1X;DP=66;DPRA=0;GTI=0;LEN=1;MQM=70;MQMR=70;NS=1;NUMALT=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;RUN=1;TYPE=snp;set=Intersection GT:AOP:PL:QA:QR:RO 0/1:9:19:100,0
                      ,100:271:306:10 . . . .
                      NODE_4_length_282_cov_38.510639 99 . C T 214.21 . AC=1;AF=0.500;AN=2;CIGAR=1X;DP=92;DPRA=0;GTI=0;LEN=1;MQM=70;MQMR=70;NS=1;NUMALT=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;RUN=1;TYPE=snp;set=Intersection GT:AOP:PL:QA:QR:RO 0/1:12:24:100,
                      0,100:319:364:11 . . . .
                      NODE_4_length_282_cov_38.510639 114 . A G 346.87 . AC=1;AF=0.500;AN=2;CIGAR=1X;DP=102;DPRA=0;GTI=0;LEN=1;MEANALT=1;MQM=70;MQMR=70;NS=1;NUMALT=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;RUN=1;TYPE=snp;set=Intersection GT:AOP:PL:QA:QR:RO 0/1:14
                      :26:100,0,100:475:413:12 . . . .
                      NODE_4_length_282_cov_38.510639 154 . G A 452.29 . AC=1;AF=0.500;AN=2;CIGAR=1X;DP=123;DPRA=0;GTI=0;LEN=1;MEANALT=1;MQM=70;MQMR=70;NS=1;NUMALT=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;RUN=1;TYPE=snp;set=Intersection GT:AOP:PL:QA:QR:RO 0/1:18
                      :31:100,0,100:578:467:13 . . . .
                      NODE_4_length_282_cov_38.510639 247 . T G 355.03 . AC=1;AF=0.500;AN=2;CIGAR=1X;DP=131;DPRA=0;GTI=0;LEN=1;MEANALT=1;MQM=70;MQMR=70;NS=1;NUMALT=1;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;RUN=1;TYPE=snp;set=Intersection GT:AOP:PL:QA:QR:RO
                      0/1:14:32:100,0,100:484:576:18 . . . .
                      NODE_9_length_16257_cov_15.910501 5514 . C A 248.89 . AB=0;ABP=0;AC=2;AF=1.00;AN=2;CIGAR=1X;DP=68;DPRA=0;EPPR=0;GTI=0;LEN=1;MEANALT=1;MQM=70;MQMR=0;NS=1;NUMALT=1;PAIRED=1;PAIREDR=0;PAO=0;PQA=0;PQR=0;PRO=0;QR=0;RO=0;RPPR=0;RUN=1;SRF=0;SR
                      P=0;SRR=0;TYPE=snp;set=Intersection GT:AOP:PL:QA:QR:RO 1/1:11:11:100,33,0:341:0:0 . . . .
                      NODE_9_length_16257_cov_15.910501 9753 . T G 0 . AB=0;ABP=0;AC=0;AF=0.00;AN=2;CIGAR=1X;DP=45;DPRA=0;GTI=0;LEN=1;MEANALT=1;MQM=70;MQMR=70;NS=1;NUMALT=1;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;RUN=1;SAR=0;TYPE=snp;set=cs1-cs4
                      GT:AOP:PL:QA:QR:RO 0/0:6:28:7,0,100:39:587:22 . . . .
                      NODE_9_length_16257_cov_15.910501 12869 . T G 0 . AB=0;ABP=0;AC=0;AF=0.00;AN=2;CIGAR=1X;DP=85;DPRA=0;GTI=0;LEN=1;MEANALT=1;MQM=70;MQMR=70;NS=1;NUMALT=1;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;RUN=1;TYPE=snp;set=cs1-cs2-cs3 GT:AO:
                      DP:PL:QA:QR:RO 0/0:8:28:34,0,100:59:485:20 . . . .
                      NODE_9_length_16257_cov_15.910501 14192 . T G 0 . AB=0;ABP=0;AC=0;AF=0.00;AN=2;CIGAR=1X;DP=37;DPRA=0;GTI=0;LEN=1;MQM=70;MQMR=70;NS=1;NUMALT=1;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;RUN=1;SAR=0;TYPE=snp;set=cs2-cs4 GT:AOP:PL:QA
                      :QR:RO . 0/0:4:19:1,0,100:24:385:15 . . .
                      NODE_9_length_16257_cov_15.910501 14201 . A G 0 . AB=0;ABP=0;AC=0;AF=0.00;AN=2;CIGAR=1X;DP=33;DPRA=0;GTI=0;LEN=1;MEANALT=1;MQM=70;MQMR=70;NS=1;NUMALT=1;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;RUN=1;SAR=0;TYPE=snp;set=cs2-cs3
                      GT:AOP:PL:QA:QR:RO . 0/0:4:17:5,0,100:24:385:13 . . .
                      NODE_9_length_16257_cov_15.910501 14240 . T G 0 . AB=0;ABP=0;AC=0;AF=0.00;AN=2;AO=4;CIGAR=1X;DP=36;DPB=18;DPRA=0;EPP=11.6962;EPPR=3.63072;GTI=0;LEN=1;MEANALT=1;MQM=70;MQMR=70;NS=1;NUMALT=1;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;
                      RO=14;RPP=11.6962;RUN=1;SAF=4;SAP=11.6962;SAR=0;TYPE=snp;set=cs2-cs4 GT:AOP:PL:QA:QR:RO . 0/0:4:18:3,0,100:24:383:14 . . .
                      NODE_9_length_16257_cov_15.910501 14261 . T G 0 . AB=0;ABP=0;AC=0;AF=0.00;AN=2;CIGAR=1X;DP=36;DPRA=0;GTI=0;LEN=1;MEANALT=1;MQM=70;MQMR=70;NS=1;NUMALT=1;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;QA=44;RUN=1;SAR=1;TYPE=snp;set=cs1-cs
                      3 GT:AOP:PL:QA:QR:RO 0/0:6:21:24,0,100:44:299:15 . . . .
                      NODE_9_length_16257_cov_15.910501 14263 . A G 0.01 . AB=0;ABP=0;AC=0;AF=0.00;AN=2;CIGAR=1X;DP=27;DPRA=0;GTI=0;LEN=1;MEANALT=1;MQM=70;MQMR=70;NS=1;NUMALT=1;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;RUN=1;SAF=1;TYPE=snp;set=cs3-cs4
                      GT:AOP:PL:QA:QR:RO . . 0/0:5:15:22,0,61:35:78:10 . .

                      Comment


                      • #12
                        You may have mostly positions that are only called in one single replicates.

                        Try this as a test, just to extract only those positions that exist in all 5 replicates, to see if things are working as intended.

                        Code:
                        java -Xmx2g -jar GenomeAnalysisTK.jar \
                           -R ref.fasta \
                           -T CombineVariants \
                           -minN 5 \
                           --variant rep1.vcf \
                           --variant rep2.vcf \
                           --variant rep3.vcf \
                           --variant rep4.vcf \
                           --variant rep5.vcf \
                           -o combined.output.vcf

                        Comment


                        • #13
                          Thank you, I will try it out. Appreciate it.

                          Comment

                          Latest Articles

                          Collapse

                          • seqadmin
                            Recent Advances in Sequencing Analysis Tools
                            by seqadmin


                            The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
                            05-06-2024, 07:48 AM
                          • seqadmin
                            Essential Discoveries and Tools in Epitranscriptomics
                            by seqadmin




                            The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                            04-22-2024, 07:01 AM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by seqadmin, 05-10-2024, 06:35 AM
                          0 responses
                          16 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 05-09-2024, 02:46 PM
                          0 responses
                          21 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 05-07-2024, 06:57 AM
                          0 responses
                          19 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 05-06-2024, 07:17 AM
                          0 responses
                          21 views
                          0 likes
                          Last Post seqadmin  
                          Working...
                          X