Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #61
    Hi Mali,
    The 1000Genomes people tend to put many samples in a single BAM file separated through their Readgroup tags. GATK states VariantRecalibration needs a lot of SNPs to work smoothly, so since we analyse very few Exomes I had to switch to VariantFiltrator.
    I'd do the analysis separately for each sample and look afterwards to combine the data.
    In case you're searching for a disease causing gene in a set of patients I'd recommend to have a look at IBD2 (http://compbio.charite.de/contao/ind...cles/xibd.html). It could extract regions with same haplotypes for different patients. THat decreases the amount of SNPs drastically...

    You're very welcome,
    Peter

    Comment


    • #62
      Wow, thanks a lot Peter for the helpful information
      Mali

      Comment


      • #63
        Hi Peter and everybody

        First of all. I want to thank you for this excellent pipeline and all explanations. I downloaded the pdf you gave in the first post and was able to reproduce the steps therein until arriving to the section 2.6 "Quality score recalibration" where i
        am now stopped and would really appreciate if you or anybody else could clarify me some questions.

        In this section 2.6. you suggest to run the Step Count covariates of GATK using the option --DBSNP with the lastest SNP release version of UCSC (in your example dbsnp132.txt i took snp135 from UCSC). About this i am wondered about how to resolve the following troubles I met in order to keep going the protocol:


        1) In the version (i think the last one) i have of GATK the option -DBSNP does not exist. I saw the options and check threads and i think that the option DBSNP is now called "knownSites". I guess so could you confirm me that this is correct?

        2) "KnownSites" let me to run Count covariates but it gives me the error above, asking me to use a SNP database in the VCF format of the 1000 genomes project.

        ------------------------------------------------------------------------------------------------------------
        ~/assembling/tools/gatk/GenomeAnalysisTK-1.4-14-g2e47336> java -Xmx4g -jar GenomeAnalysisTK.jar -l INFO -R hg19ss.fasta -knownSites:snp135.txt -I illumina.marked.realigned.fixed.bam -T CountCovariates -cov ReadGroupCovariate -cov QualityScoreCovariate -cov CycleCovariate -cov DinucCovariate -recalFile SNP.recal_data.csv
        INFO 17:45:22,015 HelpFormatter - ---------------------------------------------------------------------------------
        INFO 17:45:22,016 HelpFormatter - The Genome Analysis Toolkit (GATK) v1.4-14-g2e47336, Compiled 2012/01/11 11:42:24
        INFO 17:45:22,016 HelpFormatter - Copyright (c) 2010 The Broad Institute
        INFO 17:45:22,016 HelpFormatter - Please view our documentation at http://www.broadinstitute.org/gsa/wiki
        INFO 17:45:22,016 HelpFormatter - For support, please view our support site at http://getsatisfaction.com/gsa
        INFO 17:45:22,017 HelpFormatter - Program Args: -l INFO -R hg19ss.fasta -knownSites:snp135.txt -I illumina.marked.realigned.fixed.bam -T CountCovariates -cov ReadGroupCovariate -cov QualityScoreCovariate -cov CycleCovariate -cov DinucCovariate -recalFile SNP.recal_data.csv
        INFO 17:45:22,017 HelpFormatter - Date/Time: 2012/02/03 17:45:22
        INFO 17:45:22,017 HelpFormatter - ---------------------------------------------------------------------------------
        INFO 17:45:22,017 HelpFormatter - ---------------------------------------------------------------------------------
        INFO 17:45:22,027 GenomeAnalysisEngine - Strictness is SILENT
        INFO 17:45:22,059 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
        INFO 17:45:22,071 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.01
        INFO 17:45:23,776 GATKRunReport - Uploaded run statistics report to AWS S3
        ##### ERROR ------------------------------------------------------------------------------------------
        ##### ERROR A USER ERROR has occurred (version 1.4-14-g2e47336):
        ##### ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
        ##### ERROR Please do not post this error to the GATK forum
        ##### ERROR
        ##### ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
        ##### ERROR Visit our wiki for extensive documentation http://www.broadinstitute.org/gsa/wiki
        ##### ERROR Visit our forum to view answers to commonly asked questions http://getsatisfaction.com/gsa
        ##### ERROR
        ##### ERROR MESSAGE: Invalid command line: This calculation is critically dependent on being able to skip over known variant sites. Please provide a VCF file containing known sites of genetic variation.
        ##### ERROR ------------------------------------------------------------------------------------------

        -----------------------------------------------------------------------------------------------------------------

        3) My question (but only if my question to step 1 is that i was doing the correct think) here is about if UCSC does provide any tool to convert plain files to VCF or if is there any conversor available to convert UCSC SNPs DBs into VCF?

        4) If there is not a conversor from UCSC to VCF i think I could get the dbsnp from the NCBI FTP but now the question is if i do that could I use this NCBI dbSNP over exome data mapped over the hg19 of the UCSC?? or i should repeat the whole protocol
        from the beginning using this time the human genome downloaded from NCBI as a reference?

        thank you in advance
        Carlos

        Comment


        • #64
          Hi Carlos,

          1)The dbSNP to knownsites change is correct as you figured out the thread is using an older version than the current GATK 1.4.x

          2)As you notice the use of properly formated VCF files is pretty uniform now in the GATK so I would move to using those files.

          3)You should not have an issue using a VCF from NCBI though you may need to modify to match by adding chr to each chromosome postion. Anyone that follows my threads would see I'd strongly recommend not using UCSC based references or annotations, remember GATK is largely related to 1000 genomes which uses Sanger accepted references (ie. GRCh37).

          4) Most of the GATK steps will work better if you use the provided files on the GSA ftp site as they are all formatted correctly though they expect reference files from NCBI,ensembl,1000 genomes. However, last I check they were dbSNP132 versions.

          5) If you are interested we have posted a pipeline that implements most of the suggestions on this thread less a couple of changes to match the current GATK best practice guidelines for exomes. (http://seqanswers.com/forums/showthr...?t=4589&page=4) or (www.keatslab.org)

          Comment


          • #65
            Hi Jon
            Thank you for your prompt and instructive answer. I going to check the posts and links you indicate here.

            Comment


            • #66
              The GS ftp site of GATK is just exactly what i needed. thank you Jon.

              Comment


              • #67
                thanks for the great manual.

                I have a short question:
                I'm analysing Nimblegen exome and not sure which bed file I have to use for SNP calling. The bed file from nimblegen is not proper format compared to the bed file that Broad inst. provides. Otherwise, can I just use the bed file that the manual recommends (10bp interval target bed file from ucsc)?

                Comment


                • #68
                  sehrrot, you can also download it from the original Nimblegen website:

                  Just open "Design and annotation files", and version 2 of the BED file should be there for download. Make sure also that the Nimblegen kit is the same, because they recently started selling version 3
                  "Though it may seem that all's been said and done, originality still lives on" - some unoriginal guy who had nothing better to write as his signature

                  Comment


                  • #69
                    Thanks Orr

                    I've tried that with V2 bed file but got an error message

                    File associated with name ../../apps/annotations/Design_Annotation_files/Target_Regions/SeqCap_EZ_Exome_v2.bed is malformed: Couldn't parse line 'track name=target_region description="Target Regions
                    I cannot exactly understand why the format of bed files is different - ucsc bed file and nimblegen one.

                    Comment


                    • #70
                      What software do you use to integrate the BED file into the analysis? I'm using samtools, and it works fine.

                      Did you get a BED file for the Nimblegen kit from the UCSC website? Can you give me a link to it? I don't know how the BED file from UCSC looks like, but the Nimblegen one starts with this line:

                      track name=target_region description="Target Regions"

                      Also, it has 2 sections: section 1 indicates positions of enrichment regions, and section 2 indicates complete probe regions (so enrichments regions + a few bases before and after them). So it could be that your software cannot deal with the header in the second section.
                      "Though it may seem that all's been said and done, originality still lives on" - some unoriginal guy who had nothing better to write as his signature

                      Comment


                      • #71
                        Sehrrot
                        Try to see if your bed file has header and if so remove it and try to run the command again. I had the same bug report and doing so it worked succesfully.
                        Carlos

                        Comment


                        • #72
                          Hello,

                          Really helpful manual! Nice work!

                          I saw that you are working with Illumina data. I was wondering if you clipped of the 3' adaptor sequence? If so, with which tool?

                          Comment


                          • #73
                            thanks a lot. Up to now, there was no need to remove adaptor sequences from the exome-sequencing data we got. However, for other experiments I was using the cutadapt tool (http://code.google.com/p/cutadapt/). Many people use the FastX-Toolkit, which somehow didn't work for my data (from the MiSeq)...

                            Comment


                            • #74
                              Thanks for the fast reply!

                              I do have another question: I downloaded the newest UCSC release of the human genome. 7When I unpack this .tar.gz file there are a lot of different files called f.e. chr17_gl000 ...

                              Should I also include these in the single genome reference file?

                              Comment


                              • #75
                                That's a good question. Those files correspnd to contigs that have not been reliably placed on a chromosome due to problematic sequences. I personally just use chr1-22 X, Y and M. There should be some chrUn_... contigs as well, they have not been associated with a chromosome so far. It depends a bit what you want to do...
                                However, I am not sure if GATK is very happy with those chromosome files. I never tried it out.

                                Another thing to consider would be the alternative haplotypes for thr MHC and the other two regions. I am not sure how much they deviate from the original reference sequence, or if there is a good reason for including them.

                                Lots of things to consider, but for getting started I'd recommend to use only the chr1-22, X, Y and M(T) chromosomes (in this order)

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Current Approaches to Protein Sequencing
                                  by seqadmin


                                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                  04-04-2024, 04:25 PM
                                • seqadmin
                                  Strategies for Sequencing Challenging Samples
                                  by seqadmin


                                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                  03-22-2024, 06:39 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 04-11-2024, 12:08 PM
                                0 responses
                                27 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 10:19 PM
                                0 responses
                                30 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 09:21 AM
                                0 responses
                                26 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-04-2024, 09:00 AM
                                0 responses
                                52 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X