Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • A problem about Ion Torrent Whole Exome Sequencing

    First, I'd like to say Ion Torrent really sucks!
    We used to use Illumina data generated by a third party for our research. Last year, my boss decided to buy our own sequencing machine. Although everyone suggested him to use Illumina, due to cost consideration, we switched to Ion Torrent platform...That's where the nightmare began.

    We have a pipeline on our cluster for Illumina data which works fine. It reads the vcf files and then output a filtered annotated variants lists. The filter options are really simple, just based on MAF and inheritance type. Since we switched to Ion Torrent platform, we need a pipeline like this as well.

    Basically, the vcf files are the same. The pipeline should also work for Ion Torrent vcf files. GATK can call multiple samples together. So you can get genotype of every sample for a variant. Ion Torrent can only call variant for one sample. So if the sample doesn't have this variant, the vcf file won't have it. But you don't really know if the genotype at that position is not observed or same as the reference without looking into the bam files. I asked the Ion Torrent support people, they said if the variant is listed in the vcf, you can consider it same as reference.
    Fine. Let's go with this. Then, we lots of cases as the following one:
    In a proband, there is a variant like this:
    chr2 202006095 . AG A,AA 1/1
    If you look for this variant in the unaffected samples, they don't have it. OK, this looks promising, a deletion that proband has but none of the unaffected samples have it.
    Then I notice in the vcf files of all the unaffected samples, they have the following variant:
    chr2 202006096 . G A 0/1.

    Ok, now I believe the proband should have the same variant G->A as other samples at position chr2 202006096. But Ion Torrent identified it as an INDEL. We found lots of cases like this. I believe, in the vcf files of Illumina data, it will at least list this 2 variants, 1 as indel and 1 as a SNP. With genotype of other samples. You can easily recoginize it. But for Ion Torrent, if we don't look into it very carefully, we won't notice this.

    I believe everyone here are using Ion Torrent. How do you overcome this problem? Is there a way to overcome this problem without checking the variant manually?

  • #2
    I don't know the solution, but I can tell you Ion Torrent has a lot of deletion errors that are sequencing artifacts when sequencing through a long (>5bp) homopolymer region. If this is where your variants fall, you're going to have a bad time.

    Comment


    • #3
      Hi Jiayi Zhou,

      Using AmpliSeq Exome on Proton must not be that bad. It allowed one of researchers at your institution to find mutation causing hearing loss in Newfoundland population:


      Have you contacted your local Ion Torrent support regarding your issue?

      Comment


      • #4
        Hi zhoujiayi,

        I don't really like the practice of combining indels and SNPs (or, generally, different variants) on the same line in VCF, as it makes the format even more confusing and difficult to read. So, BBMap's CallVariants tool puts each variant on a unique line, which might be what you're looking for. You use it like this:

        callvariants.sh in=sample1.sam,sample2.sam ref=ref.fa out=vars.vcf multisample ploidy=2 prefilter

        I'd be interested in hearing how well it performs on Ion Torrent data; I've only been testing it on Illumina so far.

        Comment


        • #5
          Originally posted by mlariviere View Post
          Hi Jiayi Zhou,

          Using AmpliSeq Exome on Proton must not be that bad. It allowed one of researchers at your institution to find mutation causing hearing loss in Newfoundland population:


          Have you contacted your local Ion Torrent support regarding your issue?
          Yes. I am one of the author of this paper. The performance of detecting SNPs seems good for Ion Torrrent. But we are having troubles with INDELs. We are now working on improving the performance of detecting INDELs.

          Comment


          • #6
            Originally posted by Brian Bushnell View Post
            Hi zhoujiayi,

            I don't really like the practice of combining indels and SNPs (or, generally, different variants) on the same line in VCF, as it makes the format even more confusing and difficult to read. So, BBMap's CallVariants tool puts each variant on a unique line, which might be what you're looking for. You use it like this:

            callvariants.sh in=sample1.sam,sample2.sam ref=ref.fa out=vars.vcf multisample ploidy=2 prefilter

            I'd be interested in hearing how well it performs on Ion Torrent data; I've only been testing it on Illumina so far.
            Thank you for your suggestion. But this software cannot help for my case. Because Ion Torrent detected the variants as a INDEL chr2 202006095 . AG A,AA 1/1, even I use the software to split it into 2 line. I got chr2 202006095 . AG A and chr2 202006095 . AG AA. Then, if I use software to check if the parents have these 2 variants, the parents won't have them. But in fact, chr2 202006095 . AG AA should be chr2 202006096 G A. The parents have this variant. So it is not the variant we are looking for. For sure, we can check this kind of variants manually. But we have lots of variants like this. It is impossible to filter them manually.

            Comment


            • #7
              Actually, in this case, BBMap would produce two lines that look like this:

              Code:
              chr2 202006095 . AG A
              chr2 202006096 . G A

              Comment


              • #8
                Originally posted by Brian Bushnell View Post
                Actually, in this case, BBMap would produce two lines that look like this:

                Code:
                chr2 202006095 . AG A
                chr2 202006096 . G A
                Thank you so much. I'll try it.

                Comment


                • #9
                  Premise: I don't hold a grudge towards semiconductor sequencing. It's pretty good for some applications. It's great for germline data on gene panels. It's ok for investigating by exome analysis a case where there are a few candidate genes and you're not going to order a tailored panel for that specific case.
                  But it's a nightmare when you want to perform a broader analysis.

                  I agree with you that Ion Torrent default algorithms are very far from perfect. I've been in your situation too, and it's so unnerving.

                  Originally posted by zhoujiayi View Post
                  Ion Torrent can only call variant for one sample. So if the sample doesn't have this variant, the vcf file won't have it. But you don't really know if the genotype at that position is not observed or same as the reference without looking into the bam files. I asked the Ion Torrent support people, they said if the variant is listed in the vcf, you can consider it same as reference.
                  Here you probably mean "if the variant is not listed in the vcf, then it's equal to the reference".
                  If this is what the support people told you, I wholeheartedly disagree with them. It's just plainly not true.
                  Just take a look at the VariantCaller parameters and you'll see a list of things that will result in a variant not being called, which are there in the first place to avoid the sequencer calling too many of the errors it makes.
                  For example with default parameters, a region with homozygous SNV with < 5 total reads is not called; for indels, you need 10+ reads, and at least 5 of them on either strand. This might work ok in optimal conditions, i.e. high coverage and high quality, end-to-end reads; but in ordinary conditions, where some amplicons have consistently low coverage, you do get some trouble from it.
                  One example is trying to compare a trio: you'll see a lot of de novo mutations in the child and when you look it up on the parents, you'll see that the child has 4 of 6 reads with the variant (was called), and one parent shows it in 2 of 4 total reads (not called); and that's where the "de novo" variant came from. If you're only interested on a small set of genes, this is hardly a problem because you can verify manually. But broaden up your analysis, and you have to sort 200 de novo variants. Yeah sure. Either they're not de novo, or they're so much "de novo" that they started existing today in the sequencer.

                  Another example of bad behavior that results in a variant not being listed consistently, which happened in some exomes I've ran, is the fact that the Torrent Suite sometimes aligns a deletion on the 5' side, and sometimes on the 3' side of a repeated element. This results in (a) two possible variants describing the same deletion on a series of patients, and (b) the frequency of the deletion being split between 2 positions, possibly resulting in the variant not being called at all.

                  What I did once, was to take all "de novo" variants in a trio, make them into a hotspot .bed file to be fed to VariantCaller, and then run the VariantCaller plugin again. Every entry of a hotspot must be listed in the final vcf; if it was not called, then the reason must be stated, so you'd spot the risky calls.
                  A warning though: the default parameters for hotspot regions in Variant Caller are more stringent, so if you keep the default parameters, you will end up with some variants that will not be called anymore even in the sample that had them in the first place.
                  Another thing I did in that case was to write a simple Python script to go check every "de novo" variant (from the VCF) on the parents' BED files, and see if they actually, honestly did not have it.
                  Also if you have trios, IonReporter has a specific analysis option for that, and it might be helpful. If it wasn't so, so slow.

                  Anyways. I feel your pain too. If BBmap helps, let us know.

                  EDIT: why wait. I'll try BBMap on that data of mine.
                  Last edited by r.rosati; 02-07-2017, 08:46 AM.

                  Comment


                  • #10
                    Originally posted by r.rosati View Post
                    EDIT: why wait. I'll try BBMap on that data of mine.
                    Thanks, I'd like to see the results! Probably I need to do some calibration to get rid of the most obvious IonTorrent-specific error modes.

                    Comment


                    • #11
                      Originally posted by r.rosati View Post
                      Premise: I don't hold a grudge towards semiconductor sequencing. It's pretty good for some applications. It's great for germline data on gene panels. It's ok for investigating by exome analysis a case where there are a few candidate genes and you're not going to order a tailored panel for that specific case.
                      But it's a nightmare when you want to perform a broader analysis.

                      I agree with you that Ion Torrent default algorithms are very far from perfect. I've been in your situation too, and it's so unnerving.



                      Here you probably mean "if the variant is not listed in the vcf, then it's equal to the reference".
                      If this is what the support people told you, I wholeheartedly disagree with them. It's just plainly not true.
                      Just take a look at the VariantCaller parameters and you'll see a list of things that will result in a variant not being called, which are there in the first place to avoid the sequencer calling too many of the errors it makes.
                      For example with default parameters, a region with homozygous SNV with < 5 total reads is not called; for indels, you need 10+ reads, and at least 5 of them on either strand. This might work ok in optimal conditions, i.e. high coverage and high quality, end-to-end reads; but in ordinary conditions, where some amplicons have consistently low coverage, you do get some trouble from it.
                      One example is trying to compare a trio: you'll see a lot of de novo mutations in the child and when you look it up on the parents, you'll see that the child has 4 of 6 reads with the variant (was called), and one parent shows it in 2 of 4 total reads (not called); and that's where the "de novo" variant came from. If you're only interested on a small set of genes, this is hardly a problem because you can verify manually. But broaden up your analysis, and you have to sort 200 de novo variants. Yeah sure. Either they're not de novo, or they're so much "de novo" that they started existing today in the sequencer.

                      Another example of bad behavior that results in a variant not being listed consistently, which happened in some exomes I've ran, is the fact that the Torrent Suite sometimes aligns a deletion on the 5' side, and sometimes on the 3' side of a repeated element. This results in (a) two possible variants describing the same deletion on a series of patients, and (b) the frequency of the deletion being split between 2 positions, possibly resulting in the variant not being called at all.

                      What I did once, was to take all "de novo" variants in a trio, make them into a hotspot .bed file to be fed to VariantCaller, and then run the VariantCaller plugin again. Every entry of a hotspot must be listed in the final vcf; if it was not called, then the reason must be stated, so you'd spot the risky calls.
                      A warning though: the default parameters for hotspot regions in Variant Caller are more stringent, so if you keep the default parameters, you will end up with some variants that will not be called anymore even in the sample that had them in the first place.
                      Another thing I did in that case was to write a simple Python script to go check every "de novo" variant (from the VCF) on the parents' BED files, and see if they actually, honestly did not have it.
                      Also if you have trios, IonReporter has a specific analysis option for that, and it might be helpful. If it wasn't so, so slow.

                      Anyways. I feel your pain too. If BBmap helps, let us know.

                      EDIT: why wait. I'll try BBMap on that data of mine.
                      I installed bbmap on my cluster. Follow the instructions here: http://jgi.doe.gov/data-and-tools/bb...llation-guide/ . However, if I run $ (installation directory)/stats.sh in=(installation directory)/resources/phix174_ill.ref.fa.gz, I got error as command not found. But I can run bash $ (installation directory)/stats.sh in=(installation directory)/resources/phix174_ill.ref.fa.gz to get the same result as mentioned. So I think I have installed the bbmap correctly.

                      Then I tried to run your command, this is the command I ran:
                      bash bbmap/callvariants.sh in=/home/aiden/data/IonXpress_016.sam,/home/aiden/data/IonXpress_015.sam,/home/aiden/data/IonXpress_016.sam ref=/IonTorrent/HG19/hg19.fasta out=vars.vcf multisample ploidy=2 prefilter 1 > /home/aiden/result.txt 2> /home/aiden/errorbb.txt &
                      But I cannot get the command run. The error message in the errrorbb.txt is as following:
                      java -ea -Xmx46006m -Xms46006m -cp /home/aiden/bbmap/current/ var2.CallVariants in=/home/aiden/data/IonXpress_016.sam,/home/aiden/data/IonXpress_015.sam,/home/aiden/data/IonXpress_016.sam ref=/IonTorrent/HG19/hg19.fasta out=vars.vcf multisample ploidy=2 prefilter 1
                      Executing var2.CallVariants2 [in=/home/aiden/data/IonXpress_016.sam,/home/aiden/data/IonXpress_015.sam,/home/aiden/data/IonXpress_016.sam, ref=/IonTorrent/HG19/hg19.fasta, out=vars.vcf, multisample, ploidy=2, prefilter, 1]

                      Unknown parameter 1
                      Exception in thread "main" java.lang.AssertionError: Unknown parameter 1
                      at var2.CallVariants2.<init>(CallVariants2.java:199)
                      at var2.CallVariants2.main(CallVariants2.java:49)
                      at var2.CallVariants.main(CallVariants.java:48)

                      Do you know what I have done wrong?
                      Thank you.

                      Comment


                      • #12
                        Originally posted by zhoujiayi View Post
                        bash bbmap/callvariants.sh in=/home/aiden/data/IonXpress_016.sam,/home/aiden/data/IonXpress_015.sam,/home/aiden/data/IonXpress_016.sam ref=/IonTorrent/HG19/hg19.fasta out=vars.vcf multisample ploidy=2 prefilter 1 > /home/aiden/result.txt 2> /home/aiden/errorbb.txt &
                        I reckon the "1" that I underlined above might be the "parameter 1" that is giving the error. If you look at Brian Bushnell's message, there's no "1" in that position.

                        Comment


                        • #13
                          Originally posted by r.rosati View Post
                          I reckon the "1" that I underlined above might be the "parameter 1" that is giving the error. If you look at Brian Bushnell's message, there's no "1" in that position.
                          If you want to save the stdout and stderr of a command, 1 is for output the stdout to file result.txt and 2 is for output the stderr to errorbbmap.txt.This is just Linux command.

                          Comment


                          • #14
                            Originally posted by zhoujiayi View Post
                            If you want to save the stdout and stderr of a command, 1 is for output the stdout to file result.txt and 2 is for output the stderr to errorbbmap.txt.This is just Linux command.
                            Oh, my bad. Then the space must be the culprit, i.e. "1>" and not "1 >"

                            Comment


                            • #15
                              I have tried bbmap. I got 123057 snps and 51208 indels called by bbmap. Just based on the figures, it seems too many indels has been called. Ion Torrent called 46821 SNPs, 372 MNPs and 2187 indels. So I guess Ion Torrent did have some tricks in their algorithm to optimize the result of their sequencing data.
                              I also tried to use GATK or samtools or freebayes to call the Ion Torrent data before, all of them have the same problem as bbmap. Too many false positives have been called.

                              I tried to use GATK to find the concordance between these 2 results. It seems due to the format of bbmap's result(<ID=FAIL,Description="Fail">), GATK cannot get through.

                              I think we will keep using the TVC to call the variants, otherwise we need to start from the beginning to find a way to filter the results of other software.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Advancing Precision Medicine for Rare Diseases in Children
                                by seqadmin




                                Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
                                12-16-2024, 07:57 AM
                              • seqadmin
                                Recent Advances in Sequencing Technologies
                                by seqadmin



                                Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

                                Long-Read Sequencing
                                Long-read sequencing has seen remarkable advancements,...
                                12-02-2024, 01:49 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 12-17-2024, 10:28 AM
                              0 responses
                              26 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 12-13-2024, 08:24 AM
                              0 responses
                              43 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 12-12-2024, 07:41 AM
                              0 responses
                              29 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 12-11-2024, 07:45 AM
                              0 responses
                              42 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X