Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • dgmacarthur
    Junior Member
    • Jun 2008
    • 3

    Variant calling for high-coverage Illumina data

    Hi,

    I've got a fairly unusual data-set: Illumina PE 50 bp data on a 25 kb region of human genomic DNA sequenced in ~1,500 samples to very high coverage, typically over 300X. (For the curious, it was generated using barcoding of 96 samples per lane.)

    I've mapped all of the reads with Maq and finished BAM conversion and pileup with SAMtools. Now my mission is to call genotypes for SNPs, indels and SVs as comprehensively as possible for all of the samples.

    I've started playing around with SAMtools varFilter for SNP and indel calling, but before I get too involved in building up a pipeline for this I was hoping to get some insight from the various experts here: given this data-set, what approaches would you use for SNP, indel and SV calling? Are there reliable pipelines and parameters established for these analyses that anyone can recommend?

    Cheers,


    Daniel.
  • erichpowell
    Member
    • Nov 2009
    • 10

    #2
    Daniel,

    How did the variant calling work out for you? I, too, have next-gen illumina data, but mine covers the whole exome and so is not quite of such a high depth.

    I have been using MAQ's built-in variant caller. The problem that I'm finding is that using the easyrun parameters I get >180,000 snps in the filtered snp list (cns.filter.snp).

    This seems like waaaay to many to be biologically plausible -- I was expecting something closer to 20K -- but I don't know how to go about distinguishing the true calls from the false-positives.

    Did you encounter any similar issues?

    Comment

    • NextGenSeq
      Senior Member
      • Apr 2009
      • 482

      #3
      I would see


      For the parameters used for SNP calling for whole exome sequencing.

      Comment

      • erichpowell
        Member
        • Nov 2009
        • 10

        #4
        I finally got to the bottom of why I had so many variants. I had neglected to run the sol2sanger command, which, as Heng Li warns, results in unreliable snp calls. (This is because all of the base-qualities appear higher than they actually are).

        After running this command and re-aligning the data, the number of variants (in cns.final.snp) dropped to about ~50,000.

        Comment

        • erichpowell
          Member
          • Nov 2009
          • 10

          #5
          Reported read depth equals 255, but pileup shows otherwise

          I have come across another issue. It seems like in the cns.snp file, the maximum read depth which MAQ will report is 255. This is despite the fact that the pileup command indicates that there were many more reads covering this location.

          (The only explanation I can think of for this is that by limiting the number of reads to 255, it allows MAQ to use a smaller data type and thus reduce its memory footprint).

          Regardless of why it's done, it raises a number of questions about the reads that are not reported. In particular, we are developing our own genotype-caller and, for the purposes of comparison, we need to know exactly which reads MAQ is using when it makes a call. How can we get the id's of the reads that are used?

          Another pertinent question is: If only 255 reads are used to make the call, how are those reads chosen? Are the highest quality ones used?

          Comment

          • jeffhsu3
            Junior Member
            • Jan 2010
            • 5

            #6
            I think a read ceiling is set to guard against background noise.

            I am not sure about MAQ, but if you use SAMTools pileup command with the consensus calling, which uses the same model as MAQ, it reports all the reads, which can then be filtered using samtools.pl varFilter function. The default max reads is set to 100 though.

            Someone correct me if I'm wrong as I'm new to all this.

            Comment

            • bioinfosm
              Senior Member
              • Jan 2008
              • 483

              #7
              Correct. MAQ is tested for depth of coverage 20-40x I believe, and more depth adds noise leading to more SNP calls.

              Perhaps you could randomly select 40-60x average coverage for your samples, and then maq align and call SNPs.
              --
              bioinfosm

              Comment

              • NextGenSeq
                Senior Member
                • Apr 2009
                • 482

                #8
                High coverage usually means it is a repetitive region.

                Comment

                • erichpowell
                  Member
                  • Nov 2009
                  • 10

                  #9
                  Correct. MAQ is tested for depth of coverage 20-40x I believe, and more depth adds noise leading to more SNP calls.
                  I am having a tough time understanding what "noise" you are referring to. Are you talking about "noise" from reads showing other nucleotides, because my understanding is that, when doing concensus calling, MAQ considers only the reads that correspond to the two most frequently observed nucleotides. (This means reads showing additional (different) nucleotides are discarded).

                  Comment

                  • bioinfosm
                    Senior Member
                    • Jan 2008
                    • 483

                    #10
                    By noise I mean the instrument error rate. My understanding is that using too high depth of coverage increases the errors at each position making it hard to call accurately.

                    We wanted to work on high depth sequencing data to call rare variants, but were unable to go below 2%. For regular homozygous or heterozygous calls, 20-40x depth of coverage seems to give best results.
                    --
                    bioinfosm

                    Comment

                    • Papillon
                      Member
                      • Mar 2011
                      • 13

                      #11
                      I've got a similar problem: most of the exome is highly covered by at least 200x.

                      I already convert Illumina scores to Sanger scores by using the latest version of BWA and adding -I, but I still have too many false positives, and worse, too many false negatives, since high covered areas are treated as background noise/contig collapsing due to repetitive regions and get a low score.

                      When I clean up my pileup file to a minimal coverage of 10x and a consensus-, snp- and RMS score of at least 15, I still have over 80,000 variants left.

                      Does anyone has any thoughts on how to tackle this problem?

                      Thanks a bunch!
                      Last edited by Papillon; 03-30-2011, 01:31 PM.

                      Comment

                      • bioinfosm
                        Senior Member
                        • Jan 2008
                        • 483

                        #12
                        You mention BWA.. what is the variant caller you used? Maybe trying a different variant caller would give you a better perspective
                        --
                        bioinfosm

                        Comment

                        • Papillon
                          Member
                          • Mar 2011
                          • 13

                          #13
                          Thank you for responding! I'm using SAMTools to do my variant calling.

                          I am still a student, quite new to exome analysis and since I am self educating I sometimes make rookie mistakes. But I have changed tactics:

                          I now first filter out reads with a low mapping quality within the SAM-file and reads that are 'B' flagged for over 90% by Illumina within FASTQ-files and then remove duplicate reads with Picard. I also increased the maximum allowed coverage to reduce the number of false negatives.

                          I hope the extreme high coverage will reduce and I can set the maximum allowed coverage to a respected value again (like maybe 2x average coverage).
                          I do expect my false positive and false negative read is dropped. It is running now, so I will know soon.

                          Comment

                          • NextGenSeq
                            Senior Member
                            • Apr 2009
                            • 482

                            #14
                            Actually in whole exome data you should be getting 200K to 250K SNPs, whole genome data about 3 million SNPs.

                            Comment

                            • Papillon
                              Member
                              • Mar 2011
                              • 13

                              #15
                              Wouldn't that mean that there are relatively speaking more SNP's in the exome than in the genome?? The exome is much more conserved, so my estimate would be that 200k to 250k would be way to much.

                              Besides, most studies speak about 15,000 - 20,000, although I believe the actual number will be higher.

                              Please correct me if I make a miss assumption somewhere.

                              Comment

                              Latest Articles

                              Collapse

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, 06-05-2026, 10:09 AM
                              0 responses
                              14 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-04-2026, 08:59 AM
                              0 responses
                              26 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-02-2026, 12:03 PM
                              0 responses
                              33 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-02-2026, 11:40 AM
                              0 responses
                              23 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...