Unconfigured Ad

**adamdeluca** · 08-25-2010, 12:16 PM

Refseq ids get assigned to each isoform so there is no distinguishing between different splice variants and different genes. You might be better off using ensembl gene id's and transcript id's if you care about the distinction. The ensGene table in UCSC contains these data.

**yuanzhi** · 08-25-2010, 12:25 PM

Thanks for your response.

I was told that ensemble gene coordinates are slightly different from UCSC's. Currently all my work are based on UCSC's refseq. I am not sure if ensGene is compatible with my current data.

How about if I manually remove all the redundant coordinates from refseq exons? I mean, I will just keep those exons with different start/end coordinates. Will this be ok?

Thanks

**adamdeluca** · 08-25-2010, 12:34 PM

Google Code Archive - Long-term storage for Google Code Project Hosting.

http://code.google.com/p/bedtools/

bedtools is really useful for those types of manipulations, you can use mergeBed to get a non-redundant bed file.

Example:
chr1 1 5
chr1 3 10
would become:
chr1 1 10

**steven** · 08-26-2010, 01:07 AM

What you want to do?
It is true that the "gene_id" field from UCSC exons is actually a transcript ID, but the same happens if you download the ensembl annotation from UCSC. As adamdeluca said, you need an additional table to map the IDs (there is an equivalent of ensGene for RefSeq of course). I would say that if you start with refseq, sticking on refseq is indeed the best option.
As for the next step, it depends on what you want to do. Keep in mind that merging exons leads to artificial exons (like the one in the example above) that do not exist in vivo.
If you just want to remove redundancy and keep distinct exons, a simple command like
awk '{print $1,$2,$3,$6}' your_exon_file.bed | sort -u
or
awk '{OFS="\t";$4=".";print $0}' your_exon_file.bed | sort -u
should work fine.
s.

**yuanzhi** · 08-26-2010, 06:01 AM

steven, thank you for your response first.

I want to know how much percentage of all refseq exon bases in one chromosome is covered by my sequencing data. The formula = the number of refseq exon bases covered by my sequencing/the total number of all refseq exon bases in that chromosome.

In this case, it seems like I should merge exons, right?

when you said "If you just want to remove redundancy and keep distinct exons", how do you define "distinct exons"?

if in this case
chr1 1 5
chr1 3 10

should I consider them as distinct exons or unite them to chr1 1 10?

Thanks

**steven** · 08-26-2010, 06:27 AM

OK i see. Two analyzes could be considered, each of them gives you an indication about the transcriptome coverage.
A) Proportion of (distinct) exons that overlap a read.
B) Proportion of the exonic bases that are covered by a read.
("a read" = "at least one read")

A) Exon-level.
1. Extract distinct exons as indicated in my previous post. Distinct exons are defined as not identical, therefore with different {chrom,start,end,strand} values just ignore the strand if your RNA-seq protocol is not strand-specific. In the previous example the 2 exons are distinct.
2. Compare them with your reads and define how many of them share at least N bases with a read (usually N=1).

B) Base-level.
1. Merge your exons (do not take the strand into account) to get the transcribed regions of the genome. In the example you would get {chr1 1 10} indeed.
2. Compute the proportion among these bases of the ones that are included in the reads.

For example, with a read like
chr1 8 10
and your 2 exons from above,
A=50%
B=30% (if 1-based system like gff, otherwise 22.2% in 0-based bed system).

A is easier, B is more precise.
You can use BEDtools or Galaxy do do all that.
s.

**adamdeluca** · 08-26-2010, 07:38 AM

B is really easy with bedtools. To get the number of covered exonic bases:

Code:

mergeBed -i exons.bed | intersectBed -a reads.bed -b stdin | mergeBed -i stdin | awk '{SUM += $3-$2} END {print SUM}'

**yuanzhi** · 08-26-2010, 07:59 AM

Steven

B) Base-level.
1. Merge your exons (do not take the strand into account) to get the transcribed regions of the genome. In the example you would get {chr1 1 10} indeed.
2. Compute the proportion among these bases of the ones that are included in the reads.

why not take the strand into account? I use BWA for mapping and was told BWA maps reads only to the "+" strand. If it is correct, maybe I should not merge exons which are on the "-" strand?

Thanks

-y

**steven** · 08-27-2010, 05:35 AM

Originally posted by yuanzhi View Post

Steven

why not take the strand into account? I use BWA for mapping and was told BWA maps reads only to the "+" strand. If it is correct, maybe I should not merge exons which are on the "-" strand?

Thanks

-y

If the protocol is not strand-specific, each read can come from any strand, including the genes from the minus strand. You do not want to remove those from the analysis. Reads mapped by BWA on the plus strand will overlap the genes from the minus strand anyway if you do not take the strand info into account. not sure i am clear though..

**sneha** · 05-22-2011, 08:01 AM

.BED files

Hi all,

I am trying to get the .BED files of all the tracks from USCS browser. Is there any way that I can proceed with this? Thanks

**qqtwee** · 05-09-2012, 05:36 AM

how can I take the strand info into account

Originally posted by steven View Post

If the protocol is not strand-specific, each read can come from any strand, including the genes from the minus strand. You do not want to remove those from the analysis. Reads mapped by BWA on the plus strand will overlap the genes from the minus strand anyway if you do not take the strand info into account. not sure i am clear though..

Hello, now my data is strand specific paired end RNA-seq,could you tell me how can I take the strand info into account when I do mapping with BWA? Is there some parameter in BWA that deal with the strand information? I am looking forward to your reply. Thanks !
Best wishes!

Topics	Statistics	Last Post
Sequencing the Two-Toed Sloth Genome Reveals Jumping Genes Tied to Its Extreme Metabolism by SEQadmin2 Started by SEQadmin2, Today, 11:58 AM	0 responses 9 views 0 reactions	Last Post by SEQadmin2 Today, 11:58 AM
A New Method Makes Hantavirus Genome Analysis Faster and More Accessible by SEQadmin2 Started by SEQadmin2, 06-05-2026, 10:09 AM	0 responses 25 views 0 reactions	Last Post by SEQadmin2 06-05-2026, 10:09 AM
A New Single-Cell Method Maps DNA-Protein Interactions by SEQadmin2 Started by SEQadmin2, 06-04-2026, 08:59 AM	0 responses 34 views 0 reactions	Last Post by SEQadmin2 06-04-2026, 08:59 AM
Long-Read RNA Sequencing Uncovers a Hidden Layer of Immune Cell Regulation by SEQadmin2 Started by SEQadmin2, 06-02-2026, 12:03 PM	0 responses 56 views 0 reactions	Last Post by SEQadmin2 06-02-2026, 12:03 PM

Unconfigured Ad

Questions about exon bed files downloaded from UCSC

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News