Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Pipeline to simulate short reads and call indels

    Hi

    http://biotechies.blogspot.com/2011/...ct-indels.html is a kind of howto for doing the following pipeline:

    1. simulate short reads (wgsim)

    2. align short reads
    2.1. using BWA
    2.2. using Bowtie2

    3. call indels
    3.1. using GATK UnifiedGenotyper
    3.2. using Dindel

    I initially started to write it for my own usage to remember command calls but it may help others since I've spent a lot of time discovering software parameters and pipelines.

    Please help me to correct and complete the post as much as possible. Any suggestion is very welcome too.

    Regards.
    @pascal_biotech

  • #2
    Some of those commands are not as useful for real life problems, so maybe you want to add alternate commands, for once you really start working.

    For starters, you may want to note that the default bwa index won't work on a genome > 2 Gb. You need to add "-a bwtsw" to the command. So I think chr 20 will index alright, but the whole genome will not. And in real life, most people have access to computers with a few processors, so they'd add a "-t 2" or whatever to the bwa aln command.

    In real life, .sam files are just too big. If you pipe the command that makes the .sam file straight into samtools view, you won't make a .sam file, just the much smaller .bam file. You can use samtools view to convert part of the .bam back into a .sam if you want to look at the raw alignments. And I'd add -h to the samtools view command, to make sure the headers are carried on. You need them for some commands to work.

    So the bwa/samtools command woud be:

    bwa sampe -r '@RG\tID:foo\tSM:bar' human_g1k_v37_chr20.fasta out1.sai out2.sai out.read1b.fq out.read2b.fq | samtools view -hbS - > out.bam

    Comment


    • #3
      wow :-O
      Thanks for taking the time to read it and for your very interesting comments. I will modify the post accordingly.

      I believe that wgsim is limited compared to maq simulator and I plan to add a section along wgsim to simulate reads using MAQ. What do you thing about that?
      @pascal_biotech

      Comment


      • #4
        Personally, It's a lot more fun to learn with real data, like off of SRA, but it might be hard to find a data set that's suitable for practice.

        Comment


        • #5
          I confirm your suspicion: it is not that easy to find a data sets suitable for practicing. Simulated data sets is a good complementary source of data.

          Of course you have data sets from 1000g project and co. but I also like very much the flexibility of simulated data sets (eg. indels are well known, you can have as many SVs as you want etc.).
          @pascal_biotech

          Comment


          • #6
            Originally posted by swbarnes2 View Post
            Some of those commands are not as useful for real life problems, so maybe you want to add alternate commands, for once you really start working.

            For starters, you may want to note that the default bwa index won't work on a genome > 2 Gb. You need to add "-a bwtsw" to the command. So I think chr 20 will index alright, but the whole genome will not. And in real life, most people have access to computers with a few processors, so they'd add a "-t 2" or whatever to the bwa aln command.

            In real life, .sam files are just too big. If you pipe the command that makes the .sam file straight into samtools view, you won't make a .sam file, just the much smaller .bam file. You can use samtools view to convert part of the .bam back into a .sam if you want to look at the raw alignments. And I'd add -h to the samtools view command, to make sure the headers are carried on. You need them for some commands to work.

            So the bwa/samtools command woud be:
            One comment I'll add is that if you intend to sort the bam file with samtools sort after generating it, you can add the -u option to form an uncompressed bam file. After sorting with samtools it will be compressed. This will cut down on the time of generating the bam file.

            Comment


            • #7
              Thanks Heisman for your comment.

              If I have understood correctly you propose to generate the .bam file in an uncompressed format:

              bwa sampe -f out.sam -r '@RG\tID:foo\tSM:bar' human_g1k_v37_chr20.fasta out1.sai out2.sai out.read1b.fq out.read2b.fq | samtools view -hbS -u - > out.bam

              because the command

              samtools sort out.bam out_sorted

              will compress it anyway? So it will save time on computing the sort?
              @pascal_biotech

              Comment


              • #8
                Yes, correct. Very simple to test too if you have a stopwatch.

                Comment


                • #9
                  Thank you Sir! I added your comment into the post.
                  @pascal_biotech

                  Comment


                  • #10
                    Can dindel be run on Exome data?

                    Im trying to use dindel on exome human data but am getting about 100 indels/exome. I would like to know if there is something wrong with my pipeline (setting parameter values incorrectly; using re-calibrated, de-duped bams as opposed to using the raw bams, etc..) or if dindel cannot be used on exome data at all.

                    Any information is helpful.

                    Comment


                    • #11
                      Is there a reason you leave out the realignment step (RealignerTargetCreator and IndelRealigner) when using UnifiedGenotyper? The step is recommended in Broad's "best practices" (http://www.broadinstitute.org/gsa/wi...th_the_GATK_v3), and Dindel does the equivalent using the result of getCIGARindels.

                      Comment


                      • #12
                        I have done the realignment step too. The SNPs using GATK pipeline look ok but indels do not. I have the intermediate files and am wondering what indel caller to use?

                        Comment


                        • #13
                          @Alex Renwick: I am not sure whether it helps to do realignment around known indels here (I guess you refer to the steps described in http://www.broadinstitute.org/gsa/wi..._around_indels) as this pipeline starts from simulating reads and simulating indels. My understanding is that indels are scattered randomly so there are probably not "known" sites and not referenced in dbSNP or 1000genomes databases.

                          Please let me know if I misunderstood it and if local realignment is necessary.
                          @pascal_biotech

                          Comment


                          • #14
                            Originally posted by pmaugeri View Post
                            @Alex Renwick: I am not sure whether it helps to do realignment around known indels here ... as this pipeline starts from simulating reads and simulating indels...
                            The Broad documentation does seem to focus on the issue of handling known sites, but you can use the realigner to deal with novel indels.

                            RealignerTargetCreator scans the bam file for likely indel regions (probably looking at the CIGAR string) and then IndelRealigner performs a local alignment in those areas. The local alignment takes all reads into account at once, which requires more computation and finds a better result than the usual practice of instead of handling each read independently.

                            You can use wgsim_eval.pl to compare the correctness of the bam file before and after realignment, and you'll see read placement improve.

                            Comment


                            • #15
                              @Alex,

                              I just ended another round of indels calling against several sets of simulated data. In brief I simulated sets of reads for chromosome 20 at four different coverages (5x, 11x, 22x, 44x).

                              Then I run GATK UnifiedGenotyper with or without local realignment. I found no difference between the number of indels calls with or without realignment?!

                              So is it worth doing local realignment steps prior to call for indels ON simulated data?

                              Or maybe I just did something wrong in the realignement steps?!

                              Here are my commands:

                              java -jar GenomeAnalysisTK.jar -R human_g1k_v37_chr20.fasta -T RealignerTargetCreator -o output.intervals --known 1000G_biallelic.indels.b37.vcf

                              java -jar GenomeAnalysisTK.jar -I out_sorted.bam -R human_g1k_v37_chr20.fasta -T IndelRealigner -targetIntervals output.intervals -o realigned_out_sorted.bam -known 1000G_biallelic.indels.b37.vcf --consensusDeterminationModel KNOWNS_ONLY -LOD 0.4

                              java -jar GenomeAnalysisTK.jar -R human_g1k_v37_chr20.fasta -T UnifiedGenotyper -I realigned_out_sorted.bam -o gatk_indels.vcf -glm INDEL

                              Regards.
                              @pascal_biotech

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM
                              • seqadmin
                                The Impact of AI in Genomic Medicine
                                by seqadmin



                                Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
                                02-26-2024, 02:07 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 03-14-2024, 06:13 AM
                              0 responses
                              34 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-08-2024, 08:03 AM
                              0 responses
                              72 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-07-2024, 08:13 AM
                              0 responses
                              82 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-06-2024, 09:51 AM
                              0 responses
                              68 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X