Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • pmaugeri
    Member
    • Sep 2011
    • 11

    Pipeline to simulate short reads and call indels

    Hi

    http://biotechies.blogspot.com/2011/...ct-indels.html is a kind of howto for doing the following pipeline:

    1. simulate short reads (wgsim)

    2. align short reads
    2.1. using BWA
    2.2. using Bowtie2

    3. call indels
    3.1. using GATK UnifiedGenotyper
    3.2. using Dindel

    I initially started to write it for my own usage to remember command calls but it may help others since I've spent a lot of time discovering software parameters and pipelines.

    Please help me to correct and complete the post as much as possible. Any suggestion is very welcome too.

    Regards.
    @pascal_biotech
  • swbarnes2
    Senior Member
    • May 2008
    • 910

    #2
    Some of those commands are not as useful for real life problems, so maybe you want to add alternate commands, for once you really start working.

    For starters, you may want to note that the default bwa index won't work on a genome > 2 Gb. You need to add "-a bwtsw" to the command. So I think chr 20 will index alright, but the whole genome will not. And in real life, most people have access to computers with a few processors, so they'd add a "-t 2" or whatever to the bwa aln command.

    In real life, .sam files are just too big. If you pipe the command that makes the .sam file straight into samtools view, you won't make a .sam file, just the much smaller .bam file. You can use samtools view to convert part of the .bam back into a .sam if you want to look at the raw alignments. And I'd add -h to the samtools view command, to make sure the headers are carried on. You need them for some commands to work.

    So the bwa/samtools command woud be:

    bwa sampe -r '@RG\tID:foo\tSM:bar' human_g1k_v37_chr20.fasta out1.sai out2.sai out.read1b.fq out.read2b.fq | samtools view -hbS - > out.bam

    Comment

    • pmaugeri
      Member
      • Sep 2011
      • 11

      #3
      wow :-O
      Thanks for taking the time to read it and for your very interesting comments. I will modify the post accordingly.

      I believe that wgsim is limited compared to maq simulator and I plan to add a section along wgsim to simulate reads using MAQ. What do you thing about that?
      @pascal_biotech

      Comment

      • swbarnes2
        Senior Member
        • May 2008
        • 910

        #4
        Personally, It's a lot more fun to learn with real data, like off of SRA, but it might be hard to find a data set that's suitable for practice.

        Comment

        • pmaugeri
          Member
          • Sep 2011
          • 11

          #5
          I confirm your suspicion: it is not that easy to find a data sets suitable for practicing. Simulated data sets is a good complementary source of data.

          Of course you have data sets from 1000g project and co. but I also like very much the flexibility of simulated data sets (eg. indels are well known, you can have as many SVs as you want etc.).
          @pascal_biotech

          Comment

          • Heisman
            Senior Member
            • Dec 2010
            • 534

            #6
            Originally posted by swbarnes2 View Post
            Some of those commands are not as useful for real life problems, so maybe you want to add alternate commands, for once you really start working.

            For starters, you may want to note that the default bwa index won't work on a genome > 2 Gb. You need to add "-a bwtsw" to the command. So I think chr 20 will index alright, but the whole genome will not. And in real life, most people have access to computers with a few processors, so they'd add a "-t 2" or whatever to the bwa aln command.

            In real life, .sam files are just too big. If you pipe the command that makes the .sam file straight into samtools view, you won't make a .sam file, just the much smaller .bam file. You can use samtools view to convert part of the .bam back into a .sam if you want to look at the raw alignments. And I'd add -h to the samtools view command, to make sure the headers are carried on. You need them for some commands to work.

            So the bwa/samtools command woud be:
            One comment I'll add is that if you intend to sort the bam file with samtools sort after generating it, you can add the -u option to form an uncompressed bam file. After sorting with samtools it will be compressed. This will cut down on the time of generating the bam file.

            Comment

            • pmaugeri
              Member
              • Sep 2011
              • 11

              #7
              Thanks Heisman for your comment.

              If I have understood correctly you propose to generate the .bam file in an uncompressed format:

              bwa sampe -f out.sam -r '@RG\tID:foo\tSM:bar' human_g1k_v37_chr20.fasta out1.sai out2.sai out.read1b.fq out.read2b.fq | samtools view -hbS -u - > out.bam

              because the command

              samtools sort out.bam out_sorted

              will compress it anyway? So it will save time on computing the sort?
              @pascal_biotech

              Comment

              • Heisman
                Senior Member
                • Dec 2010
                • 534

                #8
                Yes, correct. Very simple to test too if you have a stopwatch.

                Comment

                • pmaugeri
                  Member
                  • Sep 2011
                  • 11

                  #9
                  Thank you Sir! I added your comment into the post.
                  @pascal_biotech

                  Comment

                  • vyellapa
                    Member
                    • Oct 2011
                    • 59

                    #10
                    Can dindel be run on Exome data?

                    Im trying to use dindel on exome human data but am getting about 100 indels/exome. I would like to know if there is something wrong with my pipeline (setting parameter values incorrectly; using re-calibrated, de-duped bams as opposed to using the raw bams, etc..) or if dindel cannot be used on exome data at all.

                    Any information is helpful.

                    Comment

                    • Alex Renwick
                      Member
                      • Jul 2011
                      • 44

                      #11
                      Is there a reason you leave out the realignment step (RealignerTargetCreator and IndelRealigner) when using UnifiedGenotyper? The step is recommended in Broad's "best practices" (http://www.broadinstitute.org/gsa/wi...th_the_GATK_v3), and Dindel does the equivalent using the result of getCIGARindels.

                      Comment

                      • vyellapa
                        Member
                        • Oct 2011
                        • 59

                        #12
                        I have done the realignment step too. The SNPs using GATK pipeline look ok but indels do not. I have the intermediate files and am wondering what indel caller to use?

                        Comment

                        • pmaugeri
                          Member
                          • Sep 2011
                          • 11

                          #13
                          @Alex Renwick: I am not sure whether it helps to do realignment around known indels here (I guess you refer to the steps described in http://www.broadinstitute.org/gsa/wi..._around_indels) as this pipeline starts from simulating reads and simulating indels. My understanding is that indels are scattered randomly so there are probably not "known" sites and not referenced in dbSNP or 1000genomes databases.

                          Please let me know if I misunderstood it and if local realignment is necessary.
                          @pascal_biotech

                          Comment

                          • Alex Renwick
                            Member
                            • Jul 2011
                            • 44

                            #14
                            Originally posted by pmaugeri View Post
                            @Alex Renwick: I am not sure whether it helps to do realignment around known indels here ... as this pipeline starts from simulating reads and simulating indels...
                            The Broad documentation does seem to focus on the issue of handling known sites, but you can use the realigner to deal with novel indels.

                            RealignerTargetCreator scans the bam file for likely indel regions (probably looking at the CIGAR string) and then IndelRealigner performs a local alignment in those areas. The local alignment takes all reads into account at once, which requires more computation and finds a better result than the usual practice of instead of handling each read independently.

                            You can use wgsim_eval.pl to compare the correctness of the bam file before and after realignment, and you'll see read placement improve.

                            Comment

                            • pmaugeri
                              Member
                              • Sep 2011
                              • 11

                              #15
                              @Alex,

                              I just ended another round of indels calling against several sets of simulated data. In brief I simulated sets of reads for chromosome 20 at four different coverages (5x, 11x, 22x, 44x).

                              Then I run GATK UnifiedGenotyper with or without local realignment. I found no difference between the number of indels calls with or without realignment?!

                              So is it worth doing local realignment steps prior to call for indels ON simulated data?

                              Or maybe I just did something wrong in the realignement steps?!

                              Here are my commands:

                              java -jar GenomeAnalysisTK.jar -R human_g1k_v37_chr20.fasta -T RealignerTargetCreator -o output.intervals --known 1000G_biallelic.indels.b37.vcf

                              java -jar GenomeAnalysisTK.jar -I out_sorted.bam -R human_g1k_v37_chr20.fasta -T IndelRealigner -targetIntervals output.intervals -o realigned_out_sorted.bam -known 1000G_biallelic.indels.b37.vcf --consensusDeterminationModel KNOWNS_ONLY -LOD 0.4

                              java -jar GenomeAnalysisTK.jar -R human_g1k_v37_chr20.fasta -T UnifiedGenotyper -I realigned_out_sorted.bam -o gatk_indels.vcf -glm INDEL

                              Regards.
                              @pascal_biotech

                              Comment

                              Latest Articles

                              Collapse

                              • SEQadmin2
                                From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                                by SEQadmin2


                                Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                                The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                                ...
                                06-02-2026, 10:05 AM
                              • SEQadmin2
                                Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                                by SEQadmin2


                                With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                                Introduction

                                Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                                05-22-2026, 06:42 AM
                              • SEQadmin2
                                Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                                by SEQadmin2

                                Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                                Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                                05-06-2026, 09:04 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, 06-02-2026, 12:03 PM
                              0 responses
                              19 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-02-2026, 11:40 AM
                              0 responses
                              14 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 05-28-2026, 11:40 AM
                              0 responses
                              29 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 05-26-2026, 10:12 AM
                              0 responses
                              31 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...