SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
dbSNP frequencies JohnK Bioinformatics 2 12-05-2013 08:36 AM
Annotation with dbSNP tahamasoodi Bioinformatics 1 12-04-2013 08:12 PM
dbSNP UCSC huma Asif Bioinformatics 1 12-04-2013 07:52 PM
IGV-dbSNP paolo.kunder Bioinformatics 2 02-20-2012 11:15 PM
dbSNP question boetsie Bioinformatics 0 02-15-2011 04:01 AM

Reply
 
Thread Tools
Old 07-17-2015, 08:38 AM   #1
zjrouc
Member
 
Location: USA

Join Date: Sep 2010
Posts: 25
Default help me with dbsnp

Hi, all,

I am checking somatic mutations from cancer cell line RNASeq, could someone tell me in detail how to filter out polymorphisms in human reference genome using dbSNP database?

Thanks
zjrouc is offline   Reply With Quote
Old 07-17-2015, 09:36 AM   #2
Richard Finney
Senior Member
 
Location: bethesda

Join Date: Feb 2009
Posts: 700
Default

What does your rnaseq "mutation" data look like?
What does your dbsnp file look like?
What do you mean by "filter out" ?
Richard Finney is offline   Reply With Quote
Old 07-17-2015, 12:21 PM   #3
zjrouc
Member
 
Location: USA

Join Date: Sep 2010
Posts: 25
Default

I have applied VARSCAN to predict millions of mutations in my RNASeq data. In order to get somatic mutation, i want to filter the polymorphisms in the dbSNP database from mutations Varscan predicted. I dont know how to do it. Which dbSNP database should i download and how to use it ?

Last edited by zjrouc; 07-17-2015 at 12:25 PM.
zjrouc is offline   Reply With Quote
Old 07-17-2015, 12:36 PM   #4
Richard Finney
Senior Member
 
Location: bethesda

Join Date: Feb 2009
Posts: 700
Default

Do you know any programming languages?
Can you script in shell languages?
What build is the varscan output (hg18,hg19,grch38) ?

Many compressed copies of various versions of dbsnp for hg19 (human) is here ...
ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/

For example : snp138.txt.gz or snp142Common.txt.gz

Note: there are likely somatic mutations in dbSNP.

Last edited by Richard Finney; 07-17-2015 at 12:42 PM.
Richard Finney is offline   Reply With Quote
Old 07-20-2015, 08:52 AM   #5
zjrouc
Member
 
Location: USA

Join Date: Sep 2010
Posts: 25
Default

I am using GRCH38 version of reference genome. I have downloaded snp142Common.txt.gz file from ftp://hgdownload.cse.ucsc.edu/goldenPath/hg38/database/. Would you please tell me which software should i use next to remove those polymorphisms?
zjrouc is offline   Reply With Quote
Old 07-20-2015, 09:08 AM   #6
Richard Finney
Senior Member
 
Location: bethesda

Join Date: Feb 2009
Posts: 700
Default

What do the first few lines of your varscan output look like ?

(not the header)
Richard Finney is offline   Reply With Quote
Old 07-20-2015, 10:37 AM   #7
zjrouc
Member
 
Location: USA

Join Date: Sep 2010
Posts: 25
Default

Hi, it looks like this:
chr1 10443 . C T . PASS ADP=8;WT=0;HET=1;HOM=0;NC=0 GT:GQ:SDPP:RD:AD:FREQ:PVAL:RBQ:ABQ:RDF:RDR:ADF:ADR 0/1:14:8:8:3:
4:57.14%:3.4965E-2:32:32:0:3:0:4
chr1 131628 . C A . PASS ADP=58;WT=0;HET=1;HOM=0;NC=0 GT:GQ:SDPP:RD:AD:FREQ:PVAL:RBQ:ABQ:RDF:RDR:ADF:ADR 0/1:28:58:58:
44:9:16.98%:1.3495E-3:37:40:8:36:6:3
zjrouc is offline   Reply With Quote
Old 07-20-2015, 12:37 PM   #8
Richard Finney
Senior Member
 
Location: bethesda

Join Date: Feb 2009
Posts: 700
Default

Ok.

You want to remove anything from varscan output that's in col2(chrom) and col3(chromStart) in dbsnp142common .

What's your favorite programming language?

What do you think the next step is?
Richard Finney is offline   Reply With Quote
Old 07-20-2015, 12:42 PM   #9
zjrouc
Member
 
Location: USA

Join Date: Sep 2010
Posts: 25
Default

well, the first idea came into my mind is to combine these two cols as an identifier, and find anything in my varscan but not in dbsnp database. for language, i know some bash commands, but not that sophistic. i think sed can do this, right?
zjrouc is offline   Reply With Quote
Old 07-20-2015, 01:07 PM   #10
Richard Finney
Senior Member
 
Location: bethesda

Join Date: Feb 2009
Posts: 700
Default

METHOD1 ... (using sort uniq -d then filter based on dupes)

Ok. dbsnp is a muy grande file. A little scripting will take big time with that big of a file.
A programming language really comes in handy ... but bash and the unix utilties are up to the task.

Check out "sort" and "uniq".
cut col1 and col2 from varscan.
Cut col2 and col3 from dbsnp.
"cat" the files , pipe to "sort -d" for duplicates (call it "dupes").
"sort" might need the "--buffer-size" param of a few gig to sort in RAM (not disk).

You may need to slap a tab on the end of "dupes". "sed" can do this for you.
The reason is we don't want "chr\t123" to match "chr\t1234" so we make it "chr\t123\t" because varscan sepearate col2 and col3 with a tab ("\t").


Theoretically , this should then work ... "-f" says "use this file for matches" and "-v" says "actually, do the opposite, dont match them". See "man grep" for details.

fgrep -v -f dupes varscan.output.

Make sure varscan and dbsnp are not "one off", that is their coordinates agree and aren't "off by one".
Make sure to got the tabs right.

METHOD2 ... (using "comm")
cut -f1,2 varscanoutput | sort > file1
zcat hg38.snp142Common.txt.gz | cut -f2,3 | sort --buffer-size=20G > file2
comm -12 file1 file2 | awk '{print $1"\t"}' > dupes
fgrep -f -v dupes varscanoutput
Richard Finney is offline   Reply With Quote
Old 07-20-2015, 01:26 PM   #11
zjrouc
Member
 
Location: USA

Join Date: Sep 2010
Posts: 25
Default

Thank you for your help, really appreciated.

Last edited by zjrouc; 07-20-2015 at 01:35 PM.
zjrouc is offline   Reply With Quote
Old 07-27-2015, 02:47 PM   #12
m_two
Member
 
Location: USA

Join Date: Mar 2010
Posts: 50
Default

Check out bedtools:

http://bedtools.readthedocs.org/en/l...intersect.html

ExAc may also be a better source of rare germline SNPs in coding regions

ftp://ftp.broadinstitute.org/pub/ExA...se/release0.3/
http://exac.broadinstitute.org/
m_two is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 12:21 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO