Seqanswers Leaderboard Ad

**Richard Finney** · 07-17-2015, 09:36 AM

What does your rnaseq "mutation" data look like?
What does your dbsnp file look like?
What do you mean by "filter out" ?

**zjrouc** · 07-17-2015, 12:21 PM

I have applied VARSCAN to predict millions of mutations in my RNASeq data. In order to get somatic mutation, i want to filter the polymorphisms in the dbSNP database from mutations Varscan predicted. I dont know how to do it. Which dbSNP database should i download and how to use it ?

**Richard Finney** · 07-17-2015, 12:36 PM

Do you know any programming languages?
Can you script in shell languages?
What build is the varscan output (hg18,hg19,grch38) ?

Many compressed copies of various versions of dbsnp for hg19 (human) is here ...
ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/

For example : snp138.txt.gz or snp142Common.txt.gz

Note: there are likely somatic mutations in dbSNP.

**zjrouc** · 07-20-2015, 08:52 AM

I am using GRCH38 version of reference genome. I have downloaded snp142Common.txt.gz file from ftp://hgdownload.cse.ucsc.edu/goldenPath/hg38/database/. Would you please tell me which software should i use next to remove those polymorphisms?

**Richard Finney** · 07-20-2015, 09:08 AM

What do the first few lines of your varscan output look like ?

(not the header)

**zjrouc** · 07-20-2015, 10:37 AM

Hi， it looks like this:
chr1 10443 . C T . PASS ADP=8;WT=0;HET=1;HOM=0;NC=0 GT:GQ:SDP

P:RD:AD:FREQ:PVAL:RBQ:ABQ:RDF:RDR:ADF:ADR 0/1:14:8:8:3:
4:57.14%:3.4965E-2:32:32:0:3:0:4
chr1 131628 . C A . PASS ADP=58;WT=0;HET=1;HOM=0;NC=0 GT:GQ:SDP

P:RD:AD:FREQ:PVAL:RBQ:ABQ:RDF:RDR:ADF:ADR 0/1:28:58:58:
44:9:16.98%:1.3495E-3:37:40:8:36:6:3

**Richard Finney** · 07-20-2015, 12:37 PM

Ok.

You want to remove anything from varscan output that's in col2(chrom) and col3(chromStart) in dbsnp142common .

What's your favorite programming language?

What do you think the next step is?

**zjrouc** · 07-20-2015, 12:42 PM

well, the first idea came into my mind is to combine these two cols as an identifier, and find anything in my varscan but not in dbsnp database. for language, i know some bash commands, but not that sophistic. i think sed can do this, right?

**Richard Finney** · 07-20-2015, 01:07 PM

METHOD1 ... (using sort uniq -d then filter based on dupes)

Ok. dbsnp is a muy grande file. A little scripting will take big time with that big of a file.
A programming language really comes in handy ... but bash and the unix utilties are up to the task.

Check out "sort" and "uniq".
cut col1 and col2 from varscan.
Cut col2 and col3 from dbsnp.
"cat" the files , pipe to "sort -d" for duplicates (call it "dupes").
"sort" might need the "--buffer-size" param of a few gig to sort in RAM (not disk).

You may need to slap a tab on the end of "dupes". "sed" can do this for you.
The reason is we don't want "chr\t123" to match "chr\t1234" so we make it "chr\t123\t" because varscan sepearate col2 and col3 with a tab ("\t").

Theoretically , this should then work ... "-f" says "use this file for matches" and "-v" says "actually, do the opposite, dont match them". See "man grep" for details.

fgrep -v -f dupes varscan.output.

Make sure varscan and dbsnp are not "one off", that is their coordinates agree and aren't "off by one".
Make sure to got the tabs right.

METHOD2 ... (using "comm")
cut -f1,2 varscanoutput | sort > file1
zcat hg38.snp142Common.txt.gz | cut -f2,3 | sort --buffer-size=20G > file2
comm -12 file1 file2 | awk '{print $1"\t"}' > dupes
fgrep -f -v dupes varscanoutput

**zjrouc** · 07-20-2015, 01:26 PM

Thank you for your help, really appreciated.

**m_two** · 07-27-2015, 02:47 PM

Check out bedtools:

intersect — bedtools 2.31.0 documentation

http://bedtools.readthedocs.org/en/latest/content/tools/intersect.html

ExAc may also be a better source of rare germline SNPs in coding regions

ftp://ftp.broadinstitute.org/pub/ExA...se/release0.3/

ExAC browser

http://exac.broadinstitute.org/

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 18 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 22 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 17 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 48 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

help me with dbsnp

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News