Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Variant calling for high-coverage Illumina data

    Hi,

    I've got a fairly unusual data-set: Illumina PE 50 bp data on a 25 kb region of human genomic DNA sequenced in ~1,500 samples to very high coverage, typically over 300X. (For the curious, it was generated using barcoding of 96 samples per lane.)

    I've mapped all of the reads with Maq and finished BAM conversion and pileup with SAMtools. Now my mission is to call genotypes for SNPs, indels and SVs as comprehensively as possible for all of the samples.

    I've started playing around with SAMtools varFilter for SNP and indel calling, but before I get too involved in building up a pipeline for this I was hoping to get some insight from the various experts here: given this data-set, what approaches would you use for SNP, indel and SV calling? Are there reliable pipelines and parameters established for these analyses that anyone can recommend?

    Cheers,


    Daniel.

  • #2
    Daniel,

    How did the variant calling work out for you? I, too, have next-gen illumina data, but mine covers the whole exome and so is not quite of such a high depth.

    I have been using MAQ's built-in variant caller. The problem that I'm finding is that using the easyrun parameters I get >180,000 snps in the filtered snp list (cns.filter.snp).

    This seems like waaaay to many to be biologically plausible -- I was expecting something closer to 20K -- but I don't know how to go about distinguishing the true calls from the false-positives.

    Did you encounter any similar issues?

    Comment


    • #3
      I would see


      For the parameters used for SNP calling for whole exome sequencing.

      Comment


      • #4
        I finally got to the bottom of why I had so many variants. I had neglected to run the sol2sanger command, which, as Heng Li warns, results in unreliable snp calls. (This is because all of the base-qualities appear higher than they actually are).

        After running this command and re-aligning the data, the number of variants (in cns.final.snp) dropped to about ~50,000.

        Comment


        • #5
          Reported read depth equals 255, but pileup shows otherwise

          I have come across another issue. It seems like in the cns.snp file, the maximum read depth which MAQ will report is 255. This is despite the fact that the pileup command indicates that there were many more reads covering this location.

          (The only explanation I can think of for this is that by limiting the number of reads to 255, it allows MAQ to use a smaller data type and thus reduce its memory footprint).

          Regardless of why it's done, it raises a number of questions about the reads that are not reported. In particular, we are developing our own genotype-caller and, for the purposes of comparison, we need to know exactly which reads MAQ is using when it makes a call. How can we get the id's of the reads that are used?

          Another pertinent question is: If only 255 reads are used to make the call, how are those reads chosen? Are the highest quality ones used?

          Comment


          • #6
            I think a read ceiling is set to guard against background noise.

            I am not sure about MAQ, but if you use SAMTools pileup command with the consensus calling, which uses the same model as MAQ, it reports all the reads, which can then be filtered using samtools.pl varFilter function. The default max reads is set to 100 though.

            Someone correct me if I'm wrong as I'm new to all this.

            Comment


            • #7
              Correct. MAQ is tested for depth of coverage 20-40x I believe, and more depth adds noise leading to more SNP calls.

              Perhaps you could randomly select 40-60x average coverage for your samples, and then maq align and call SNPs.
              --
              bioinfosm

              Comment


              • #8
                High coverage usually means it is a repetitive region.

                Comment


                • #9
                  Correct. MAQ is tested for depth of coverage 20-40x I believe, and more depth adds noise leading to more SNP calls.
                  I am having a tough time understanding what "noise" you are referring to. Are you talking about "noise" from reads showing other nucleotides, because my understanding is that, when doing concensus calling, MAQ considers only the reads that correspond to the two most frequently observed nucleotides. (This means reads showing additional (different) nucleotides are discarded).

                  Comment


                  • #10
                    By noise I mean the instrument error rate. My understanding is that using too high depth of coverage increases the errors at each position making it hard to call accurately.

                    We wanted to work on high depth sequencing data to call rare variants, but were unable to go below 2%. For regular homozygous or heterozygous calls, 20-40x depth of coverage seems to give best results.
                    --
                    bioinfosm

                    Comment


                    • #11
                      I've got a similar problem: most of the exome is highly covered by at least 200x.

                      I already convert Illumina scores to Sanger scores by using the latest version of BWA and adding -I, but I still have too many false positives, and worse, too many false negatives, since high covered areas are treated as background noise/contig collapsing due to repetitive regions and get a low score.

                      When I clean up my pileup file to a minimal coverage of 10x and a consensus-, snp- and RMS score of at least 15, I still have over 80,000 variants left.

                      Does anyone has any thoughts on how to tackle this problem?

                      Thanks a bunch!
                      Last edited by Papillon; 03-30-2011, 01:31 PM.

                      Comment


                      • #12
                        You mention BWA.. what is the variant caller you used? Maybe trying a different variant caller would give you a better perspective
                        --
                        bioinfosm

                        Comment


                        • #13
                          Thank you for responding! I'm using SAMTools to do my variant calling.

                          I am still a student, quite new to exome analysis and since I am self educating I sometimes make rookie mistakes. But I have changed tactics:

                          I now first filter out reads with a low mapping quality within the SAM-file and reads that are 'B' flagged for over 90% by Illumina within FASTQ-files and then remove duplicate reads with Picard. I also increased the maximum allowed coverage to reduce the number of false negatives.

                          I hope the extreme high coverage will reduce and I can set the maximum allowed coverage to a respected value again (like maybe 2x average coverage).
                          I do expect my false positive and false negative read is dropped. It is running now, so I will know soon.

                          Comment


                          • #14
                            Actually in whole exome data you should be getting 200K to 250K SNPs, whole genome data about 3 million SNPs.

                            Comment


                            • #15
                              Wouldn't that mean that there are relatively speaking more SNP's in the exome than in the genome?? The exome is much more conserved, so my estimate would be that 200k to 250k would be way to much.

                              Besides, most studies speak about 15,000 - 20,000, although I believe the actual number will be higher.

                              Please correct me if I make a miss assumption somewhere.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              7 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              7 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              49 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              66 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X