Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • very common mutation not in SNP databases

    We've got DNA sequencing for exome-capture run for about 40 cancer cell lines. I'm finding some very common differences from the reference genome that isn't in dbSNP or 1000 Genomes and I was wondering if anyone could help explain what it is and if there is a reasonable way to filter off these features.

    One example is A1BG with a T109G (ACC->CCC at position 325 [transcript ENST00000453054], reverse strand), chr19:58862835 T->G (hg19, chr19:63554647 for hg18). I have that mutation on 37/42 samples.

    The various QC metrics are good and the reads are nicely clustering around exonic regions as expected. I checked with a known gene (p53) and the resulting mutations also match up.

    The fact that it appears in so many samples and isn't in either dbSNP or 1000 genomes worries me a little bit. We have over 100 genes that have unknown mutations in over 30 samples, so I'd love to have a decent filter for them.

  • #2
    Either you are very lucky to find a cancer (or at least cell line) specific mutation or perhaps due to the informatics. I'd Sanger sequence the mutation and see if it's indeed what you think it is. Another thing to consider would be pseudogenes and duplicated genomic regions. Both of which may have slight variation with the gene of interest but still pass informatic filters.

    Comment


    • #3
      There is something going here. This does appear to be "common" in the many bamfiles I've looked at.

      Are you seeing something like this in this region (image is centered on your "snp")? :


      The SNP in question is in the middle of the di-sulfide bridge bond motif.
      Notes for Summary of A1BG (from refseq)
      The protein encoded by this gene is a plasma glycoprotein of unknown function. The protein shows sequence similarity to the variable regions of some immunoglobulin supergene family member proteins.

      Comment


      • #4
        csoong: this seems improbable enough that I'm betting against it being a cancer gene (hence my posting here). It would be an incredible case of beginner's luck to find something that prevalent that no one has find before. As I said, this is one example of over a hundred. Resequencing seems like a decent suggestion. I'll check for pseudogenes or duplicated regions and post back.

        Richard: It's more focused in my case, very little changes in the other bases. This sample 9/40 reads that have the mutation here. I'm glad to see that I'm not the only one seeing this.
        Attached Files

        Comment


        • #5
          I don't think it's real. The noise I see is too consistent.

          1) Is this Illumina ?
          2) Is the SNP call all from one direction of aligned reads (all plus or all minus)?
          3) Is the SNP being call only have evidence for it in reads with a sudden drop off in quality right near the called SNP?

          Comment


          • #6
            fpepin: What is it? I think its a false positive due to misalignment. How to filter it? depends on first question. I think its a false positive because I also found it in an exom sample with similar ~45 depth, ELAND alignment. 30% of reads the variant was at ends. 10% there was other mismatches withing 5bps from this position. I have it annotated differently from your annotation as (c.832T>G het p.T278TP not T109G can you check your annotation?) are you using illumina? what alignment algorithm did you use?

            csoong: i have found handful of pseudogenes that cause false positive calls one of which I was able to confirm with sequencing. My experience been two and more variants within close proximity from each other <10bps usually in same read. I have some examples but I don't want to go into details. What platform are you using? what alignment algorithm did you use?

            thanks for this post. This has been worrying me and I havn't found a good way to handle it. Is there other than BWT algorithms that have been used for exom alignment. I have used two different algorithms with same results both use BWT methodology mainly because of large reference one of them is ELAND but I like to try another method of alignment.
            Last edited by husamia; 02-09-2011, 02:17 PM.

            Comment


            • #7
              Originally posted by Richard Finney View Post
              I don't think it's real. The noise I see is too consistent.

              1) Is this Illumina ?
              2) Is the SNP call all from one direction of aligned reads (all plus or all minus)?
              3) Is the SNP being call only have evidence for it in reads with a sudden drop off in quality right near the called SNP?
              1) Yes, it's a GAII.
              2) Yes, that is correct, in this case. Checking with some of my other problem genes, this seems to be the case in all but 1 (which I'm willing to believe is actually interesting). What made you think that this might be the case?
              3) Generally. There is a trend toward the end of the reads (last 5 bases) and other cases where the quality drops sharply and comes back up but there are also a fair number of cases where the quality remains strong for a while after (10+ bases).

              It also seems that some of these reads are mis-aligned, 2 of my problematic genes are on chromosome Y. They account for the totality of the mutations called on chromosome Y on my breast cancer cell lines. These genes also have a lot of mutations close together, leading me to believe that those were mapping errors.

              Looking more in depth, it looks like I've got 16 genes with the strange one-strand-only mutations and 193 that are more likely to be due to mapping errors.

              Is there an explanation as to what would cause a such a consistent artifact like this only on reads from a given strand over dozens of samples?

              --edit--
              husamia: The alignment is done using BWA. You are correct, position 109 is wrong, it is indeed 278. Looks like the annotation code that I inherited is wrong.
              Last edited by fpepin; 02-09-2011, 03:29 PM. Reason: adding response to husamia

              Comment


              • #8
                Try using BWA or Novoalign to realign this data. Call variants with samtools or GATK.

                Edit: Okay. Which variant caller then?

                Did not find the same thing in my exome data. Aligned with novoalign (hard clipping on) and used GATK for variant calling.
                Last edited by Michael.James.Clark; 02-09-2011, 04:01 PM.
                Mendelian Disorder: A blogshare of random useful information for general public consumption. [Blog]
                Breakway: A Program to Identify Structural Variations in Genomic Data [Website] [Forum Post]
                Projects: U87MG whole genome sequence [Website] [Paper]

                Comment


                • #9
                  I was getting some hints of a mutation there, in other Illumina samples I have.

                  Using bambino viewer tool, I see this :
                  Unlimited space to host images, easy to use image uploader, albums, photo hosting, sharing, dynamic image resizing on web and mobile.


                  The little red nib at top between the T and Y is a measure of snpness.
                  All of the snpness is from reads in one direction and they are all on the border of some quality valley. The quality is likely bad the extra two reads in , just not showing up as bad. I asked the bambino author (tool picture shown) and he said :

                  I see some reads have runs of nucleotide quality 2, which smells like the Illumina pipeline masking issue ...

                  Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc

                  (comment from Torst, 04-24-2010, 09:21 PM)

                  I'm NOT sure if he's right or wrong,but that wall of quality drop off is very strange to look at. Maybe someone else knows what's up.

                  Comment


                  • #10
                    Originally posted by Michael.James.Clark View Post
                    Try using BWA or Novoalign to realign this data. Call variants with samtools or GATK.

                    Edit: Okay. Which variant caller then?

                    Did not find the same thing in my exome data. Aligned with novoalign (hard clipping on) and used GATK for variant calling.
                    This is also using GATK (java -jar GenomeAnalysisTK.jar -T UnifiedGenotyper -pl SOLEXA ...) for the variant calling. Is samtools likely to be different?

                    Could it be that slightly more stringent threshold for trimming reads might get rid of these artifacts, since they tend to be at the end of the reads?

                    Comment


                    • #11
                      Originally posted by fpepin View Post
                      This is also using GATK (java -jar GenomeAnalysisTK.jar -T UnifiedGenotyper -pl SOLEXA ...) for the variant calling. Is samtools likely to be different?

                      Could it be that slightly more stringent threshold for trimming reads might get rid of these artifacts, since they tend to be at the end of the reads?
                      I think if you're seeing it with GATK, you'll probably see it with samtools, though you could try it.

                      Why not try clipping the ends off, realigning, and doing variant calling? Just trim five or ten bases off the end. This is something people often do.

                      Personally, I use Novoalign's hard clipping and adaptor stripping, and that seems to bump up specificity.

                      Also, for GATK, take a look at the -baq parameter (see their wiki). It's recommended to use that if you didn't.
                      Mendelian Disorder: A blogshare of random useful information for general public consumption. [Blog]
                      Breakway: A Program to Identify Structural Variations in Genomic Data [Website] [Forum Post]
                      Projects: U87MG whole genome sequence [Website] [Paper]

                      Comment


                      • #12
                        Were these data sequenced one sample per lane (which I expect), or multiple samples per lane?

                        Were other, non cancer cell line samples sequenced on the same run as cancer cell line samples? (I mean, in our facility there's more than one research group that uses the sequencer, there may be like 5 whole exomes for one group and 3 for another going on)

                        If that's the case, does that data show the same artefact?

                        Comment


                        • #13
                          Bruins:This is 2 lanes per sample. I haven't checked the results from the individual lanes in my haste to get results, but I'm running it now.

                          Also, the runs are basically all cancer-related. I could look for some normals were run with the same protocols, but the artificiality of the having all the mutations in reads on one strand strongly suggest that the problem istechnical in nature.

                          Michael: I have an old version of GATK in my pipeline (1.0.3185) that doesn't have -baq implemented. It looks like I would need to tweak things to get the newer version to work, as it complains about the ordering of my contigs (lexicographically: 10, 11 , ..., 1 ,20, 21, ..., instead of karyotypically: 1,2,3). It can't be hard to fix, though.

                          Having not followed the development of GATK, how important is it to be up to date?

                          Comment


                          • #14
                            Originally posted by fpepin View Post
                            Michael: I have an old version of GATK in my pipeline (1.0.3185) that doesn't have -baq implemented. It looks like I would need to tweak things to get the newer version to work, as it complains about the ordering of my contigs (lexicographically: 10, 11 , ..., 1 ,20, 21, ..., instead of karyotypically: 1,2,3). It can't be hard to fix, though.

                            Having not followed the development of GATK, how important is it to be up to date?
                            It seems to be fairly important at this stage because of baq implementation and the -glm DINDEL feature of the UnifiedGenotyper (the UG is used to more sensitively detect indels than the IndelGenotyperV2!).
                            Mendelian Disorder: A blogshare of random useful information for general public consumption. [Blog]
                            Breakway: A Program to Identify Structural Variations in Genomic Data [Website] [Forum Post]
                            Projects: U87MG whole genome sequence [Website] [Paper]

                            Comment


                            • #15
                              Thanks Micheal, I'll update my pipeline then.

                              I also talked to more people about this issue and the answer is that this is a well-known issue when you have a specific tetramer (CCGG, if I remember correctly) that ends up causing bases to be skipped.

                              This explains why the differences are strand specific and why it happens in very specific places in the genome (including in completely different labs). The reads generally end up being clipped soon after because the indel, but there are cases that are repetitive enough that the few other bases match by chance, leading for the trimming to occur later and the read to be kept.

                              I'm not quite sure where would be best place to recognize this corner case and deal with it, but in my case I'll take the lazy way out and just filter them out at the end.

                              Thanks for all who offered suggestions.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              8 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              8 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              49 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              67 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X