Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Pipeline to find somatic mutations

    Hi, I am trying to analyze Illumina paired-end reads of a tumor/normal study with two lanes per sample (i.e. normal.lane1, normal.lane2, tumor.lane1, tumor.lane2) to assess somatic mutations. My intention is to work with a single bam per individual as recommended in the GATK best practice variant detection in tumor-normal studies (http://www.broadinstitute.org/gsa/wi...th_the_GATK_v2).

    Therefore, the workflow I am trying to perform is the following:
    Code:
        For each reads_file (lane):
    
    - alignment using Bfast bwaaln / localalign / postprocess
    - sam to bam using samtools view
    - add read groups using picard AddOrReplaceReadGroups
    - remove duplicates using picard RemoveDuplicates
    Then, the bams of the two samples are merged:

    Code:
        sample.bam <- picard MergeSamFile (tumor.lane1.bam, tumor.lane2.bam, normal.lane1.bam, normal.lane2.bam)
    On this single bam (which contains reads from the 4 lanes and also 4 lines of @RG readgroup annotations in the header), I performed two more processing steps:

    Code:
        realigned.bam <- GATK_realign(individual.bam)
        recal.bam <- GATK_recalibrate(realigned.bam)
    Then I called the variants:

    Code:
    GATK_call_for_variants(recal.bam) using the GATK UnifiedGenotyper and VariantFiltration

    I used the --genotypeFilterExpression argument in the GATK_VariantFiltration step to have results at sample level, so what I've obtained by using a dummy example with a few reads per sample in the resulting vcf file looks like:

    Code:
    #CHROM    POS        ID REF  ALT  QUAL FILTER    INFO           FORMAT                    normal                 tumor
    chr1    115401136    .    G    A    40.14    PASS      (..)    GT:AD:DP:GQ:PL    1/1:0,1:1:3:29,3,0    1/1:0,2:2:3.01:44,3,0
    chr2    207625567    .    T    C    32.79    PASS      (..)    GT:AD:DP:GQ:PL    ./.		     1/1:0,2:2:6.01:64,6,0
    Therefore, ignoring the fact that the number of reads supporting the evidence is low in this case, could I interpret that the first SNP is a germline mutation (the same GT is present in both normal and tumor samples of the individual) whereas the second is a somatic mutation (the variant is present only in the normal sample)?

    I'm quite unsure how is the best way to interpret these results, also I'm not convinced whether the workflow I've used is the optimal.

    Any help/suggestion will be really appreciated!
    cheers
    David

  • #2
    Three suggestions:

    a) map your reads to:

    ftp://ftp-trace.ncbi.nih.gov/1000gen...mbly_sequence/

    b) use two distinct mappers (e.g. bwaaln and bwa-sw), call mutations separately and then take the intersection.

    c) (optional) you may consider samtools contrast caller.

    Comment


    • #3
      I think the question was more referring to how to discover the differences between paired normal and tumor samples from one single patient.

      The multi-sample vcf allows the following information:

      - It lists the variant positions and changes detected in any of the two samples.
      - In the columns after the FORMAT column, normal and tumor in the example, it displays the stats for each of the variants in the two samples. Obviously, there are variants that are recorded in both normal and tumor, and there are also those that are recorded in the normal sample only and not in the tumor and vice versa:
      - Variants present in both "normal" and "tumoral" may be interpreted as "germline" variants that are present in all cells of the patient, tumor or not.
      - Then, we can obtain a rough selection of "tumor-specific" variants (those that are recorded only in "tumor").
      - Variants that appear in "normal" only, may point at deletions of this region in the tumor sample, given sufficient read coverage at the region of interest.

      However, the multi-sample vcf does not give the statistical information described in FORMAT for the respective sample that does NOT contain the variant (entry is ./. ).

      Maybe it would help to run the Unified Genotyper without distinguishing between the two samples, in order to get the overall genotype stats, but I am not sure about that.

      Any opinion on / experience with that?

      Comment


      • #4
        There are my command lines for calling snps and indels with GATK. It did work well for me and might be useful for you.

        bwa aln -t 4 ucsc.hg19 /home/li/prostate_cancer_RNA/ERR031030_1.fastq > ERR031030_1.sai

        bwa aln -t 4 ucsc.hg19 /home/li/prostate_cancer_RNA/ERR031030_2.fastq > ERR031030_2.sai

        bwa sampe -r "@RG\tID:ERR031030\tLB:ERR031030\tSM:ERR031030\tPL:ILLUMINA" ucsc.hg19 ERR031030_1.sai ERR031030_2.sai /home/li/prostate_cancer_RNA/ERR031030_1.fastq /home/li/prostate_cancer_RNA/ERR031030_2.fastq > ERR031030.sam

        java -Xmx8g -Djava.io.tmpdir=/tmp -jar /home/li/prostate_cancer_RNA/picard-tools-1.82/SortSam.jar SO=coordinate INPUT=ERR031030.sam OUTPUT=ERR031030.bam VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=true

        java -Xmx8g -Djava.io.tmpdir=/tmp -jar /home/li/prostate_cancer_RNA/picard-tools-1.82/MarkDuplicates.jar INPUT=ERR031030.bam OUTPUT=ERR031030.marked.bam METRICS_FILE=metrics CREATE_INDEX=true VALIDATION_STRINGENCY=LENIENT

        java -Xmx8g -jar GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar -T RealignerTargetCreator -R ucsc.hg19.fasta -o ERR031030.bam.list -I ERR031030.marked.bam
        java -Xmx8g -Djava.io.tmpdir=/tmp -jar GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar -I ERR031030.marked.bam -R ucsc.hg19.fasta -T IndelRealigner -targetIntervals ERR031030.bam.list -o ERR031030.marked.realigned.bam

        java -Xmx8g -Djava.io.tmpdir=/tmp -jar /home/li/prostate_cancer_RNA/picard-tools-1.82/FixMateInformation.jar INPUT=ERR031030.marked.realigned.bam OUTPUT=ERR031030.marked.realigned.fixed.bam SO=coordinate VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=true

        java -Xmx8g -jar GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar -T BaseRecalibrator -R ucsc.hg19.fasta -I ERR031030.marked.realigned.fixed.bam -knownSites Mills_and_1000G_gold_standard.indels.hg19.vcf -knownSites 1000G_phase1.indels.hg19.vcf -knownSites dbsnp_135.hg19.vcf -cov ReadGroupCovariate -cov QualityScoreCovariate -cov CycleCovariate -cov ContextCovariate -o ERR031030.recal_data.grp

        java -Xmx8g -jar GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar -T PrintReads -R ucsc.hg19.fasta -I ERR031030.marked.realigned.fixed.bam -BQSR ERR031030.recal_data.grp --out ERR031030.marked.realigned.fixed.recal.bam

        java -Xmx16g -jar GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar -glm BOTH -R ucsc.hg19.fasta -T UnifiedGenotyper -I ERR031029.marked.realigned.fixed.recal.bam -I ERR031030.marked.realigned.fixed.recal.bam -D dbsnp_135.hg19.vcf -o ERR031030.raw.snps.indels.vcf -metrics snps.metrics -stand_call_conf 50.0 -stand_emit_conf 10.0 -dcov 1000

        java -Xmx8g -jar GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar -R ucsc.hg19.fasta -T SelectVariants --variant ERR031030.raw.snps.indels.vcf -o ERR031030.snpsonly.vcf -selectType SNP

        java -Xmx8g -jar GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar -R ucsc.hg19.fasta -T SelectVariants --variant ERR031030.raw.snps.indels.vcf -o ERR031030.indelsonly.vcf -selectType INDEL

        java -Xmx8g -jar GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar -T VariantRecalibrator -R ucsc.hg19.fasta -input ERR031030.snpsonly.vcf -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.hg19.vcf -resourcemni,known=false,training=true,truth=false,prior=12.0 1000G_phase1.indels.hg19.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=6.0 dbsnp_135.hg19.vcf -an QD -an HaplotypeScore -an MQRankSum -an ReadPosRankSum -an MQ -mode SNP -recalFile ERR031030.snp.recal.vcf -tranchesFile ERR031030.snp.tranches.vcf -rscriptFile ERR031030.plots.R

        java -Xmx8g -jar GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar -R ucsc.hg19.fasta -T ApplyRecalibration -input ERR031030.snpsonly.vcf -tranchesFile ERR031030.snp.tranches.vcf -recalFile ERR031030.snp.recal.vcf -o ERR031030.snps.filtered.vcf

        java -Xmx16g -jar GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar -R ucsc.hg19.fasta -T VariantFiltration --variant ERR031030.snps.filtered.vcf -o ERR031030.final.filtered.vcf --filterName "Nov28filters && QD < 2.0 && ReadPosRankSum < -8.0 && MQ < 40.0 && FS > 60.0 && MQRandkSum < -12.5" --filterExpression "HaplotypeScore > 13.0"

        By the way, to obtain a rough selection of "tumor-specific" variants (those that are recorded only in "tumor"), what command line do you use? any suggestion?

        Comment


        • #5
          Hi,

          I'm far (honestly) from being an expert in this issue, but as far as I know, if you perform the mutation call for the tumor sample and for the normal sample separately, once you have the final set of mutations for each then you simply substract one from the other to get the somatic ones.

          Of course, there is the option to use a variant caller which is already intended to deal with normal/tumor sample experiments. I've used Varscan, and seems pretty good for SNPs (not so much for indels), altough as I told you I'm not an expert in the field and for instance I was not able to use some of the filters that the second release of the tool made available. I also heard good things about SomaticSniper, but I did not use this one. At the end of the day, I think that this issue is far from being solved, and what many people do is to combine the results of several methods to obtain a list of highly confident somatic mutations.

          Hope it helps,
          best
          David

          Comment


          • #6
            Doing variant calls separately and then looking at the different will produce a large number of false positives. You really should be using somatic variant callers like MuTect or SomaticSniper that look at both the tumor and normal samples at the same time.

            Comment


            • #7
              Originally posted by id0 View Post
              Doing variant calls separately and then looking at the different will produce a large number of false positives. You really should be using somatic variant callers like MuTect or SomaticSniper that look at both the tumor and normal samples at the same time.
              Do I still need to use two aligners as suggested by lh3?

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Strategies for Sequencing Challenging Samples
                by seqadmin


                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                03-22-2024, 06:39 AM
              • seqadmin
                Techniques and Challenges in Conservation Genomics
                by seqadmin



                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                Avian Conservation
                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                03-08-2024, 10:41 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, Yesterday, 06:37 PM
              0 responses
              10 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, Yesterday, 06:07 PM
              0 responses
              9 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 03-22-2024, 10:03 AM
              0 responses
              50 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 03-21-2024, 07:32 AM
              0 responses
              67 views
              0 likes
              Last Post seqadmin  
              Working...
              X