SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
not able to load a bed file into UCSC genome browser ataaillah Introductions 0 01-18-2012 07:46 AM
UCSC LiftOver questions geneart Bioinformatics 5 09-26-2011 12:17 AM
exon numbers in UCSC genome browser ryantkoehler Bioinformatics 1 07-26-2011 03:13 PM
De novo exon quantification from BAM files nsalomonis Bioinformatics 0 03-31-2011 10:20 PM
UCSC .bed file hsmart Bioinformatics 3 06-10-2010 06:49 AM

Reply
 
Thread Tools
Old 08-25-2010, 11:29 AM   #1
yuanzhi
Member
 
Location: New York

Join Date: Aug 2010
Posts: 19
Default Questions about exon bed files downloaded from UCSC

Dear All

I just downloaded a refseq exon table (BED format) from UCSC table browser. I notice that the exon coordinates are not unique in the table. For example, in the following case, the exon start and end coordinates are all the same, but the gene IDs are different. Does it mean different genes or different isoforms?

Code:
chr1    100100400       100100567       NM_000642_exon_3_0_chr1_100100401_f     0       +
chr1    100100400       100100567       NM_000644_exon_3_0_chr1_100100401_f     0       +
chr1    100100400       100100567       NM_000643_exon_3_0_chr1_100100401_f     0       +
chr1    100100400       100100567       NM_000028_exon_3_0_chr1_100100401_f     0       +
chr1    100100400       100100567       NM_000646_exon_3_0_chr1_100100401_f     0       +
chr1    100100400       100100567       NM_000645_exon_1_0_chr1_100100401_f     0       +
Also, in this following case,

Code:
chr1    100088632       100088762       NM_000644_exon_0_0_chr1_100088633_f     0       +
chr1    100088632       100088762       NM_000643_exon_0_0_chr1_100088633_f     0       +
chr1    100088632       100089042       NM_000028_exon_0_0_chr1_100088633_f     0       +
the exon end coordinate for the last one is different from the other two. The gene IDs also differ. What does this mean?

All your help will be appreciated.

-YZ
yuanzhi is offline   Reply With Quote
Old 08-25-2010, 12:16 PM   #2
adamdeluca
Member
 
Location: Iowa City, IA

Join Date: Jul 2010
Posts: 95
Default

Refseq ids get assigned to each isoform so there is no distinguishing between different splice variants and different genes. You might be better off using ensembl gene id's and transcript id's if you care about the distinction. The ensGene table in UCSC contains these data.
adamdeluca is offline   Reply With Quote
Old 08-25-2010, 12:25 PM   #3
yuanzhi
Member
 
Location: New York

Join Date: Aug 2010
Posts: 19
Default

Thanks for your response.

I was told that ensemble gene coordinates are slightly different from UCSC's. Currently all my work are based on UCSC's refseq. I am not sure if ensGene is compatible with my current data.

How about if I manually remove all the redundant coordinates from refseq exons? I mean, I will just keep those exons with different start/end coordinates. Will this be ok?

Thanks
yuanzhi is offline   Reply With Quote
Old 08-25-2010, 12:34 PM   #4
adamdeluca
Member
 
Location: Iowa City, IA

Join Date: Jul 2010
Posts: 95
Default

http://code.google.com/p/bedtools/

bedtools is really useful for those types of manipulations, you can use mergeBed to get a non-redundant bed file.

Example:
chr1 1 5
chr1 3 10
would become:
chr1 1 10
adamdeluca is offline   Reply With Quote
Old 08-26-2010, 01:07 AM   #5
steven
Senior Member
 
Location: Southern France

Join Date: Aug 2009
Posts: 269
Default

What you want to do?
It is true that the "gene_id" field from UCSC exons is actually a transcript ID, but the same happens if you download the ensembl annotation from UCSC. As adamdeluca said, you need an additional table to map the IDs (there is an equivalent of ensGene for RefSeq of course). I would say that if you start with refseq, sticking on refseq is indeed the best option.
As for the next step, it depends on what you want to do. Keep in mind that merging exons leads to artificial exons (like the one in the example above) that do not exist in vivo.
If you just want to remove redundancy and keep distinct exons, a simple command like
awk '{print $1,$2,$3,$6}' your_exon_file.bed | sort -u
or
awk '{OFS="\t";$4=".";print $0}' your_exon_file.bed | sort -u
should work fine.
s.
steven is offline   Reply With Quote
Old 08-26-2010, 06:01 AM   #6
yuanzhi
Member
 
Location: New York

Join Date: Aug 2010
Posts: 19
Default

steven, thank you for your response first.

I want to know how much percentage of all refseq exon bases in one chromosome is covered by my sequencing data. The formula = the number of refseq exon bases covered by my sequencing/the total number of all refseq exon bases in that chromosome.

In this case, it seems like I should merge exons, right?

when you said "If you just want to remove redundancy and keep distinct exons", how do you define "distinct exons"?

if in this case
chr1 1 5
chr1 3 10

should I consider them as distinct exons or unite them to chr1 1 10?

Thanks
yuanzhi is offline   Reply With Quote
Old 08-26-2010, 06:27 AM   #7
steven
Senior Member
 
Location: Southern France

Join Date: Aug 2009
Posts: 269
Default

OK i see. Two analyzes could be considered, each of them gives you an indication about the transcriptome coverage.
A) Proportion of (distinct) exons that overlap a read.
B) Proportion of the exonic bases that are covered by a read.
("a read" = "at least one read")

A) Exon-level.
1. Extract distinct exons as indicated in my previous post. Distinct exons are defined as not identical, therefore with different {chrom,start,end,strand} values just ignore the strand if your RNA-seq protocol is not strand-specific. In the previous example the 2 exons are distinct.
2. Compare them with your reads and define how many of them share at least N bases with a read (usually N=1).

B) Base-level.
1. Merge your exons (do not take the strand into account) to get the transcribed regions of the genome. In the example you would get {chr1 1 10} indeed.
2. Compute the proportion among these bases of the ones that are included in the reads.

For example, with a read like
chr1 8 10
and your 2 exons from above,
A=50%
B=30% (if 1-based system like gff, otherwise 22.2% in 0-based bed system).

A is easier, B is more precise.
You can use BEDtools or Galaxy do do all that.
s.
steven is offline   Reply With Quote
Old 08-26-2010, 07:38 AM   #8
adamdeluca
Member
 
Location: Iowa City, IA

Join Date: Jul 2010
Posts: 95
Default

B is really easy with bedtools. To get the number of covered exonic bases:
Code:
mergeBed -i exons.bed | intersectBed -a reads.bed -b stdin | mergeBed -i stdin | awk '{SUM += $3-$2} END {print SUM}'
adamdeluca is offline   Reply With Quote
Old 08-26-2010, 07:59 AM   #9
yuanzhi
Member
 
Location: New York

Join Date: Aug 2010
Posts: 19
Default

Steven

Quote:
B) Base-level.
1. Merge your exons (do not take the strand into account) to get the transcribed regions of the genome. In the example you would get {chr1 1 10} indeed.
2. Compute the proportion among these bases of the ones that are included in the reads.
why not take the strand into account? I use BWA for mapping and was told BWA maps reads only to the "+" strand. If it is correct, maybe I should not merge exons which are on the "-" strand?

Thanks

-y

Last edited by yuanzhi; 08-26-2010 at 11:52 AM.
yuanzhi is offline   Reply With Quote
Old 08-27-2010, 05:35 AM   #10
steven
Senior Member
 
Location: Southern France

Join Date: Aug 2009
Posts: 269
Default

Quote:
Originally Posted by yuanzhi View Post
Steven



why not take the strand into account? I use BWA for mapping and was told BWA maps reads only to the "+" strand. If it is correct, maybe I should not merge exons which are on the "-" strand?

Thanks

-y
If the protocol is not strand-specific, each read can come from any strand, including the genes from the minus strand. You do not want to remove those from the analysis. Reads mapped by BWA on the plus strand will overlap the genes from the minus strand anyway if you do not take the strand info into account. not sure i am clear though..
steven is offline   Reply With Quote
Old 05-22-2011, 08:01 AM   #11
sneha
Junior Member
 
Location: Boston

Join Date: Feb 2011
Posts: 3
Default .BED files

Hi all,

I am trying to get the .BED files of all the tracks from USCS browser. Is there any way that I can proceed with this? Thanks
sneha is offline   Reply With Quote
Old 05-09-2012, 05:36 AM   #12
qqtwee
Member
 
Location: Beijing

Join Date: Feb 2011
Posts: 16
Default how can I take the strand info into account

Quote:
Originally Posted by steven View Post
If the protocol is not strand-specific, each read can come from any strand, including the genes from the minus strand. You do not want to remove those from the analysis. Reads mapped by BWA on the plus strand will overlap the genes from the minus strand anyway if you do not take the strand info into account. not sure i am clear though..
Hello, now my data is strand specific paired end RNA-seq,could you tell me how can I take the strand info into account when I do mapping with BWA? Is there some parameter in BWA that deal with the strand information? I am looking forward to your reply. Thanks !
Best wishes!
qqtwee is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 05:46 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO