Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Need advice on whole exome sequence analysis..

    I am entirely new to NextGen Sequencing data analysis and have been working on a project for a week. We have whole human exome 100bp paired end data from an Illumina HiSeq system, we are working on. Base calls are of good quality as assessed by FastQC.

    I am using all open source software. Can you please tell me whether this a good pipeline for processing raw reads before variant calling?

    Raw reads - Index reference genome with BWA - Align with BWA - sampe with BWA adding RG line - SAM to BAM with samtools - Mark and remove PCR duplicates with Picard - RealignerTargetCreator and IndelRealigner using knowns 1000G and Mills n 1000G with GATK - FixMateInformation with Picard - Count covariates using dbSNP135 and base quality score recalibration with GATK. All with default parameters.

    Is there something you would suggest I modify?
    In your opinion what is the best mutation caller for comparing cancer vs normal exomes, for further processing?

    Any advice would be appreciated.

    Thanks.

  • #2
    Your pipeline is fine. Pretty much identical to the one you'll see if you search "exome sequencing manual". You can use samtools/GATK to call snps or look into Varscan; lots of options.

    Comment


    • #3
      Thanks for your reply Heisman.

      I don't think GATK/Samtools can analyse tumour-normal pairs, can they? Varscan2 sounds good.. I will check it out.. Also found MuTect (beta) from Broad Institute. Any experience with it??

      Comment


      • #4
        Just ran MuTect with my data:


        E:\Exome>java -Xmx1g -jar MuTect\mutect.jar --analysis_type MuTect --reference_s
        equence UCSChg19\ucsc.hg19.fasta -B:cosmic,VCF Mutect\hg19_cosmic.vcf -B:dbsnp,V
        CF ucschg19\dbsnp_135.hg19.vcf --input_file:normal P01_normal_ready.bam --input_
        file:tumor P01_cancer_ready.bam --out call_stats.out --coverage_file coverage.wi
        g.txt
        INFO 10:38:26,672 HelpFormatter - ---------------------------------------------
        ------------------------------------
        INFO 10:38:26,682 HelpFormatter - The Genome Analysis Toolkit (GATK) v1.1-37-g5
        cedb2d, Compiled 2011/09/14 10:01:32
        INFO 10:38:26,683 HelpFormatter - Copyright (c) 2010 The Broad Institute
        INFO 10:38:26,683 HelpFormatter - Please view our documentation at http://www.b
        roadinstitute.org/gsa/wiki
        INFO 10:38:26,683 HelpFormatter - For support, please view our support site at

        INFO 10:38:26,684 HelpFormatter - Program Args: --analysis_type MuTect --refere
        nce_sequence UCSChg19\ucsc.hg19.fasta -B:cosmic,VCF Mutect\hg19_cosmic.vcf -B:db
        snp,VCF ucschg19\dbsnp_135.hg19.vcf --input_file:normal P01_normal_ready.bam --i
        nput_file:tumor P01_cancer_ready.bam --out call_stats.out --coverage_file covera
        ge.wig.txt
        INFO 10:38:26,684 HelpFormatter - Date/Time: 2012/06/06 10:38:26
        INFO 10:38:26,684 HelpFormatter - ---------------------------------------------
        ------------------------------------
        INFO 10:38:26,686 HelpFormatter - ---------------------------------------------
        ------------------------------------
        INFO 10:38:26,707 GenomeAnalysisEngine - Strictness is SILENT
        INFO 10:38:27,308 RMDTrackBuilder - Loading Tribble index from disk for file Mu
        tect\hg19_cosmic.vcf
        INFO 10:38:27,862 RMDTrackBuilder - Loading Tribble index from disk for file uc
        schg19\dbsnp_135.hg19.vcf
        ##### ERROR --------------------------------------------------------------------
        ----------------------
        ##### ERROR A USER ERROR has occurred (version 1.1-37-g5cedb2d):
        ##### ERROR The invalid arguments or inputs must be corrected before the GATK ca
        n proceed
        ##### ERROR Please do not post this error to the GATK forum
        ##### ERROR
        ##### ERROR See the documentation (rerun with -h) for this tool to view allowabl
        e command-line arguments.
        ##### ERROR Visit our wiki for extensive documentation http://www.broadinstitute
        .org/gsa/wiki
        ##### ERROR Visit our forum to view answers to commonly asked questions http://g
        etsatisfaction.com/gsa
        ##### ERROR
        ##### ERROR MESSAGE: Input files cosmic and reference have incompatible contigs:
        No overlapping contigs found.
        ##### ERROR cosmic contigs = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1
        5, 16, 17, 18, 19, 20, 21, 22]
        ##### ERROR reference contigs = [chrM, chr1, chr2, chr3, chr4, chr5, chr6, chr
        7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, ch
        r19, chr20, chr21, chr22, chrX, chrY, chr1_gl000191_random, chr1_gl000192_random
        , chr4_ctg9_hap1, chr4_gl000193_random, chr4_gl000194_random, chr6_apd_hap1, chr
        6_cox_hap2, chr6_dbb_hap3, chr6_mann_hap4, chr6_mcf_hap5, chr6_qbl_hap6, chr6_ss
        to_hap7, chr7_gl000195_random, chr8_gl000196_random, chr8_gl000197_random, chr9_
        gl000198_random, chr9_gl000199_random, chr9_gl000200_random, chr9_gl000201_rando
        m, chr11_gl000202_random, chr17_ctg5_hap1, chr17_gl000203_random, chr17_gl000204
        _random, chr17_gl000205_random, chr17_gl000206_random, chr18_gl000207_random, ch
        r19_gl000208_random, chr19_gl000209_random, chr21_gl000210_random, chrUn_gl00021
        1, chrUn_gl000212, chrUn_gl000213, chrUn_gl000214, chrUn_gl000215, chrUn_gl00021
        6, chrUn_gl000217, chrUn_gl000218, chrUn_gl000219, chrUn_gl000220, chrUn_gl00022
        1, chrUn_gl000222, chrUn_gl000223, chrUn_gl000224, chrUn_gl000225, chrUn_gl00022
        6, chrUn_gl000227, chrUn_gl000228, chrUn_gl000229, chrUn_gl000230, chrUn_gl00023
        1, chrUn_gl000232, chrUn_gl000233, chrUn_gl000234, chrUn_gl000235, chrUn_gl00023
        6, chrUn_gl000237, chrUn_gl000238, chrUn_gl000239, chrUn_gl000240, chrUn_gl00024
        1, chrUn_gl000242, chrUn_gl000243, chrUn_gl000244, chrUn_gl000245, chrUn_gl00024
        6, chrUn_gl000247, chrUn_gl000248, chrUn_gl000249]
        ##### ERROR --------------------------------------------------------------------
        ----------------------

        Any ideas on how to solve this error??

        Comment


        • #5
          One of the variant files has chr1, chr2... as contigs and other has 1,2... without the "chr". Other than that it could also complain if the number of contigs in one variant file doesn't match the other.

          Comment


          • #6
            Seek Answers - yeah, but how to FIX it? Without starting from scratch that is..

            I am doing my analysis on a laptop pc and reprocessing my reads with a new reference genome to match the reference genome that MuTect understands is going to take like forever. Looking for a easy way, if anybody knows one..

            Comment


            • #7
              You could try modifying the file using sed to get rid of the 'chr' character.

              samtools view -h <Input.bam> | sed 's/chr//g' > modified.bam

              Not 100% sure though, you could try on one chromosome to see if it works.

              Comment


              • #8
                Why is your reference sequence not the same all the way throuough out?

                Whatever you used for the alignment, it only had chr 1-22. So use that reference all the way throughout, rather that switching to a new one that has all those other partial chromosomes.

                Comment


                • #9
                  swbarnes - My reference has been the same all the way from the beginning. Its the UCSC hg19 build got from here ftp://ftp.broadinstitute.org/bundle/1.5/hg19/ (GATK resource bundle).. So, what I used for alignment did not have only chr 1-22, as you seem to have understood. It had all the haps and randoms too. Everything ran perfectly through my processing pipeline (described in the first post in this thread)..

                  But when I attempted to do variant calling with MuTect (beta) this error showed up, because one of the input files that they provided for use with the caller (hg19_cosmic.vcf, as seen in the java command line above) has different contigs, from the reference I used..

                  One way to solve the issue is to redo my pipeline with the reference build (also hg19, strangely) that MuTect provides, which presumably has contigs named 1, 2, 3... 22. But I don't have the computing power to do that without wasting a lot of time..

                  Comment


                  • #10
                    SeekAnswers - that would modify my input files, right? What about the extra contigs? Is there are a way to delete/remove, the M, X, Y, haps and randoms from .bam files, selectively?? (Though I doubt thats recommended, even if possible)

                    Comment


                    • #11
                      Fix the cosmic vcf chromosome nomenclature, don't try to fix your .bam

                      Comment


                      • #12
                        I am new to NGS. How??

                        Comment


                        • #13
                          Originally posted by shyam_la View Post
                          I am new to NGS. How??
                          i came across the same problem!

                          GATK requires consistency in the reference ordering and names.
                          Using the Broad reference genome for alignments:

                          ftp://ftp.broadinstitute.org/pub/seq...sembly19.fasta

                          Guess you will be fine!

                          Comment


                          • #14
                            Originally posted by xhyuo View Post
                            i came across the same problem!

                            GATK requires consistency in the reference ordering and names.
                            Using the Broad reference genome for alignments:

                            ftp://ftp.broadinstitute.org/pub/seq...sembly19.fasta

                            Guess you will be fine!
                            Thank you! I have made considerable progress now.. That was a month back!!
                            I finally used GRCh37.67 from ensembl and "cat" the chromosomes together - got rid of the haps and randoms, given that they are not particularly useful..

                            Comment

                            Latest Articles

                            Collapse

                            • seqadmin
                              Current Approaches to Protein Sequencing
                              by seqadmin


                              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                              04-04-2024, 04:25 PM
                            • seqadmin
                              Strategies for Sequencing Challenging Samples
                              by seqadmin


                              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                              03-22-2024, 06:39 AM

                            ad_right_rmr

                            Collapse

                            News

                            Collapse

                            Topics Statistics Last Post
                            Started by seqadmin, 04-11-2024, 12:08 PM
                            0 responses
                            25 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 04-10-2024, 10:19 PM
                            0 responses
                            27 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 04-10-2024, 09:21 AM
                            0 responses
                            24 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 04-04-2024, 09:00 AM
                            0 responses
                            52 views
                            0 likes
                            Last Post seqadmin  
                            Working...
                            X