SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
PubMed: What can you do with 0.1x genome coverage? A case study based on a genome sur Newsbot! Literature Watch 1 04-11-2012 02:18 PM
Targeted Genome Assembly for region poorly represented in reference genome? gumbos Bioinformatics 1 01-09-2012 05:01 PM
12th International Meeting on Human Genome Variation and Complex Genome Analysis marcowanger Events / Conferences 1 08-29-2011 09:38 PM
RNA-Seq: ENCODE whole-genome data in the UCSC genome browser (2011 update). Newsbot! Literature Watch 1 11-24-2010 02:08 PM
11th International Meeting On HUMAN GENOME VARIATION AND COMPLEX GENOME ANALYSIS HGV2009 Events / Conferences 2 07-24-2009 02:10 AM

Reply
 
Thread Tools
Old 09-03-2010, 12:15 PM   #1
Jon_Keats
Senior Member
 
Location: Phoenix, AZ

Join Date: Mar 2010
Posts: 279
Default Who let M's and R's in the Genome

Just looking for other peoples input on something

Has anyone notice there are two R and one M IUPAC codes on chromosome 3 in the reference genomes, both NCBI36 and GRCh37. Maybe not surprisingly they sit in the FHIT gene.

Does this worry anyone for genome indexing or the such, seems minor in 3/3billion.
Attached Images
File Type: jpg GRCh37vsNCBI36_Blast.jpg (64.9 KB, 34 views)
Jon_Keats is offline   Reply With Quote
Old 09-03-2010, 01:46 PM   #2
john_mu
Member
 
Location: Stanford, CA

Join Date: May 2010
Posts: 88
Default

Yes I noticed that, they are quite odd... I just replace them with N's when doing analysis.
__________________
SpliceMap: De novo detection of splice junctions from RNA-seq
Download SpliceMap Comment here
john_mu is offline   Reply With Quote
Old 09-08-2010, 07:36 AM   #3
bioinfosm
Senior Member
 
Location: USA

Join Date: Jan 2008
Posts: 482
Default

Surprisingly, I noticed this yesterday too!
Not expected, but no clue if it affects things downstream...
__________________
--
bioinfosm
bioinfosm is offline   Reply With Quote
Old 09-08-2010, 08:51 AM   #4
foxyg
Member
 
Location: US

Join Date: May 2010
Posts: 54
Default

I know in some cases they use N to mask sequence that are repeats.
foxyg is offline   Reply With Quote
Old 09-08-2010, 12:50 PM   #5
Joann
Senior Member
 
Location: Woodbridge CT

Join Date: Oct 2008
Posts: 231
Default the other letters in the code

m= a or c, as in amino

r= g or a, as in purine

n= a or g or c or t/u, unknown, or other
Joann is offline   Reply With Quote
Old 09-08-2010, 01:05 PM   #6
janejane
Member
 
Location: Midwest

Join Date: Aug 2010
Posts: 15
Default

Quote:
Originally Posted by Joann View Post
m= a or c, as in amino

r= g or a, as in purine

n= a or g or c or t/u, unknown, or other
ah ha, good to know, thanks!
janejane is offline   Reply With Quote
Old 09-08-2010, 01:31 PM   #7
Joann
Senior Member
 
Location: Woodbridge CT

Join Date: Oct 2008
Posts: 231
Default Wait! there's more...

Please see Annex C, Appendix 2, Table 1, page 16 at, for example,

http://www.noip.gov.vn/noip/resource.nsf/vwResourceList/B4F5E35FA26A8AA4472577360013F1D3/$FILE/Standards%20%E2%80%93%20ST25.pdf

for a complete list of nucleotide letter symbols in use per a current international standard.

See also
An extended IUPAC nomenclature code for polymorphic
nucleic acids
doi:10.1093/bioinformatics/btq098

Last edited by Joann; 01-21-2011 at 02:09 PM. Reason: update
Joann is offline   Reply With Quote
Old 09-08-2010, 11:34 PM   #8
seq_GA
Senior Member
 
Location: Asiana

Join Date: Feb 2009
Posts: 124
Default

Interesting to know something apart from A,T,G,C,N.

But UCSC has them as "N"s
seq_GA is offline   Reply With Quote
Old 09-09-2010, 05:28 PM   #9
malachig
Senior Member
 
Location: WashU

Join Date: Aug 2010
Posts: 117
Default

These additional letters are sometimes called 'ambiguiety codes'. Back in the day when a 30X human genome sequence cost a billion dollars instead of several thousand, every piece of sequence information was much more precious. Knowing a position was a purine was better than calling it an N. The codes are also useful for reporting heterozygous genotype information as a single letter. The fact that they still occur in reference genomes is mostly just a nuisance for bioinformatics and thus some resources such as UCSC convert them to N's. I believe human genome sequences retrieved via Ensembl may still contain them though.
malachig is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 12:26 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2022, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO