SEQanswers

SEQanswers (http://seqanswers.com/forums/index.php)
-   Bioinformatics (http://seqanswers.com/forums/forumdisplay.php?f=18)
-   -   UCSC refSeq Gene and hg19 coordinate (http://seqanswers.com/forums/showthread.php?t=27173)

thedamian 02-05-2013 08:08 AM

UCSC refSeq Gene and hg19 coordinate
 
Hello,
I have a list of mRNA NM_ numers.
In UCSC, hg19->refGene table, I can get exons and cds coordinates for every NM_.

However, when I pull out a subsequence from hg19 based on refGene coordinates, the result seems to be not correct for reverse strand. Reverse complement of the pulled exons dosn't work as well.

-------
example:
I have a: NM_012345.3
From UCSC i know, that for NM_012345 the first CDS is beetwen 50000:50100, strand: "-", chr1
Then I use:
Code:

samtools faidx /path/hg19.fa chr1:50000-50100
The result doesn't start with ATG (and it should starts).


Where is the problem? I know that UCSC doesn't use the version (NM_012345 instead of NM_012345.3) but it should work.

(hg19 is downloaded from http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/)

dpryan 02-05-2013 09:37 AM

NM_012345 is on chromosome 13 (check the genome browser). I expect you're either reading something wrong or got the wrong refGene table.

thedamian 02-05-2013 11:17 AM

Quote:

Originally Posted by dpryan (Post 95607)
NM_012345 is on chromosome 13 (check the genome browser). I expect you're either reading something wrong or got the wrong refGene table.

heh, it was an abstract example:) 012345 is like abcdef:)
I test ~3000 genes in such way. 1600 works good, they are "+" strand.
~1400 are "-" and when I use samtools faidx, I can't get correct mRNA, CDS.

dpryan 02-05-2013 11:22 AM

Ah, in the future, always give working examples :)

Remember that anything on the "-" strand should end in ATG (actually, CAT), rather than start with it.

thedamian 02-06-2013 11:47 PM

ok, real example:
I have gene IL10, NM_000572.2.
Based on NM_000572 from UCSC I get:

name: NM_000572
chrom: chr1
strand: -
txStart: 206940947
txEnd: 206945839
cdsStart: 206941980
cdsEnd: 206945780
exonStarts: 206940947,206943173,206944251,206944700,206945615,
exonEnds: 206942073,206943239,206944404,206944760,206945839,
name2: IL10

so first CDS is from 206941980 to 206942073

then I use:
Code:

samtools faidx hg19.fa chr1:206941978-206942075
( I added +2 to each side because UCSC is 0-based, hg19 1-based)
the output:
GTCTCAGTTTCGTATCTTCATTGTCATGTAGGCTTCTATGTAGTTGATGAAGATGTCAAACTCACTCATGGCTTTGTAGATGCCTTTCTCTTGGAGCT

no ATG, and TAC in here;/

dpryan 02-07-2013 12:13 AM

It's on the '-' strand, so you're grabbing the end, rather than the beginning :)

thedamian 02-07-2013 12:18 AM

Quote:

Originally Posted by dpryan (Post 95792)
It's on the '-' strand, so you're grabbing the end, rather than the beginning :)

heh yes, I've just realised it.
If starnd is "-", start codon is cdsEnd and end codon is cdsStart! Very confusing!
+ 1 to experience:)


All times are GMT -8. The time now is 08:06 PM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.