Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • help me with dbsnp

    Hi, all,

    I am checking somatic mutations from cancer cell line RNASeq, could someone tell me in detail how to filter out polymorphisms in human reference genome using dbSNP database?

    Thanks

  • #2
    What does your rnaseq "mutation" data look like?
    What does your dbsnp file look like?
    What do you mean by "filter out" ?

    Comment


    • #3
      I have applied VARSCAN to predict millions of mutations in my RNASeq data. In order to get somatic mutation, i want to filter the polymorphisms in the dbSNP database from mutations Varscan predicted. I dont know how to do it. Which dbSNP database should i download and how to use it ?
      Last edited by zjrouc; 07-17-2015, 12:25 PM.

      Comment


      • #4
        Do you know any programming languages?
        Can you script in shell languages?
        What build is the varscan output (hg18,hg19,grch38) ?

        Many compressed copies of various versions of dbsnp for hg19 (human) is here ...
        ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/

        For example : snp138.txt.gz or snp142Common.txt.gz

        Note: there are likely somatic mutations in dbSNP.
        Last edited by Richard Finney; 07-17-2015, 12:42 PM.

        Comment


        • #5
          I am using GRCH38 version of reference genome. I have downloaded snp142Common.txt.gz file from ftp://hgdownload.cse.ucsc.edu/goldenPath/hg38/database/. Would you please tell me which software should i use next to remove those polymorphisms?

          Comment


          • #6
            What do the first few lines of your varscan output look like ?

            (not the header)

            Comment


            • #7
              Hi, it looks like this:
              chr1 10443 . C T . PASS ADP=8;WT=0;HET=1;HOM=0;NC=0 GT:GQ:SDPP:RD:AD:FREQ:PVAL:RBQ:ABQ:RDF:RDR:ADF:ADR 0/1:14:8:8:3:
              4:57.14%:3.4965E-2:32:32:0:3:0:4
              chr1 131628 . C A . PASS ADP=58;WT=0;HET=1;HOM=0;NC=0 GT:GQ:SDPP:RD:AD:FREQ:PVAL:RBQ:ABQ:RDF:RDR:ADF:ADR 0/1:28:58:58:
              44:9:16.98%:1.3495E-3:37:40:8:36:6:3

              Comment


              • #8
                Ok.

                You want to remove anything from varscan output that's in col2(chrom) and col3(chromStart) in dbsnp142common .

                What's your favorite programming language?

                What do you think the next step is?

                Comment


                • #9
                  well, the first idea came into my mind is to combine these two cols as an identifier, and find anything in my varscan but not in dbsnp database. for language, i know some bash commands, but not that sophistic. i think sed can do this, right?

                  Comment


                  • #10
                    METHOD1 ... (using sort uniq -d then filter based on dupes)

                    Ok. dbsnp is a muy grande file. A little scripting will take big time with that big of a file.
                    A programming language really comes in handy ... but bash and the unix utilties are up to the task.

                    Check out "sort" and "uniq".
                    cut col1 and col2 from varscan.
                    Cut col2 and col3 from dbsnp.
                    "cat" the files , pipe to "sort -d" for duplicates (call it "dupes").
                    "sort" might need the "--buffer-size" param of a few gig to sort in RAM (not disk).

                    You may need to slap a tab on the end of "dupes". "sed" can do this for you.
                    The reason is we don't want "chr\t123" to match "chr\t1234" so we make it "chr\t123\t" because varscan sepearate col2 and col3 with a tab ("\t").


                    Theoretically , this should then work ... "-f" says "use this file for matches" and "-v" says "actually, do the opposite, dont match them". See "man grep" for details.

                    fgrep -v -f dupes varscan.output.

                    Make sure varscan and dbsnp are not "one off", that is their coordinates agree and aren't "off by one".
                    Make sure to got the tabs right.

                    METHOD2 ... (using "comm")
                    cut -f1,2 varscanoutput | sort > file1
                    zcat hg38.snp142Common.txt.gz | cut -f2,3 | sort --buffer-size=20G > file2
                    comm -12 file1 file2 | awk '{print $1"\t"}' > dupes
                    fgrep -f -v dupes varscanoutput

                    Comment


                    • #11
                      Thank you for your help, really appreciated.
                      Last edited by zjrouc; 07-20-2015, 01:35 PM.

                      Comment


                      • #12
                        Check out bedtools:



                        ExAc may also be a better source of rare germline SNPs in coding regions

                        ftp://ftp.broadinstitute.org/pub/ExA...se/release0.3/

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Current Approaches to Protein Sequencing
                          by seqadmin


                          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                          04-04-2024, 04:25 PM
                        • seqadmin
                          Strategies for Sequencing Challenging Samples
                          by seqadmin


                          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                          03-22-2024, 06:39 AM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, 04-11-2024, 12:08 PM
                        0 responses
                        18 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-10-2024, 10:19 PM
                        0 responses
                        22 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-10-2024, 09:21 AM
                        0 responses
                        17 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-04-2024, 09:00 AM
                        0 responses
                        48 views
                        0 likes
                        Last Post seqadmin  
                        Working...
                        X