Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Questions about exon bed files downloaded from UCSC

    Dear All

    I just downloaded a refseq exon table (BED format) from UCSC table browser. I notice that the exon coordinates are not unique in the table. For example, in the following case, the exon start and end coordinates are all the same, but the gene IDs are different. Does it mean different genes or different isoforms?

    Code:
    chr1    100100400       100100567       NM_000642_exon_3_0_chr1_100100401_f     0       +
    chr1    100100400       100100567       NM_000644_exon_3_0_chr1_100100401_f     0       +
    chr1    100100400       100100567       NM_000643_exon_3_0_chr1_100100401_f     0       +
    chr1    100100400       100100567       NM_000028_exon_3_0_chr1_100100401_f     0       +
    chr1    100100400       100100567       NM_000646_exon_3_0_chr1_100100401_f     0       +
    chr1    100100400       100100567       NM_000645_exon_1_0_chr1_100100401_f     0       +
    Also, in this following case,

    Code:
    chr1    100088632       100088762       NM_000644_exon_0_0_chr1_100088633_f     0       +
    chr1    100088632       100088762       NM_000643_exon_0_0_chr1_100088633_f     0       +
    chr1    100088632       100089042       NM_000028_exon_0_0_chr1_100088633_f     0       +
    the exon end coordinate for the last one is different from the other two. The gene IDs also differ. What does this mean?

    All your help will be appreciated.

    -YZ

  • #2
    Refseq ids get assigned to each isoform so there is no distinguishing between different splice variants and different genes. You might be better off using ensembl gene id's and transcript id's if you care about the distinction. The ensGene table in UCSC contains these data.

    Comment


    • #3
      Thanks for your response.

      I was told that ensemble gene coordinates are slightly different from UCSC's. Currently all my work are based on UCSC's refseq. I am not sure if ensGene is compatible with my current data.

      How about if I manually remove all the redundant coordinates from refseq exons? I mean, I will just keep those exons with different start/end coordinates. Will this be ok?

      Thanks

      Comment


      • #4


        bedtools is really useful for those types of manipulations, you can use mergeBed to get a non-redundant bed file.

        Example:
        chr1 1 5
        chr1 3 10
        would become:
        chr1 1 10

        Comment


        • #5
          What you want to do?
          It is true that the "gene_id" field from UCSC exons is actually a transcript ID, but the same happens if you download the ensembl annotation from UCSC. As adamdeluca said, you need an additional table to map the IDs (there is an equivalent of ensGene for RefSeq of course). I would say that if you start with refseq, sticking on refseq is indeed the best option.
          As for the next step, it depends on what you want to do. Keep in mind that merging exons leads to artificial exons (like the one in the example above) that do not exist in vivo.
          If you just want to remove redundancy and keep distinct exons, a simple command like
          awk '{print $1,$2,$3,$6}' your_exon_file.bed | sort -u
          or
          awk '{OFS="\t";$4=".";print $0}' your_exon_file.bed | sort -u
          should work fine.
          s.

          Comment


          • #6
            steven, thank you for your response first.

            I want to know how much percentage of all refseq exon bases in one chromosome is covered by my sequencing data. The formula = the number of refseq exon bases covered by my sequencing/the total number of all refseq exon bases in that chromosome.

            In this case, it seems like I should merge exons, right?

            when you said "If you just want to remove redundancy and keep distinct exons", how do you define "distinct exons"?

            if in this case
            chr1 1 5
            chr1 3 10

            should I consider them as distinct exons or unite them to chr1 1 10?

            Thanks

            Comment


            • #7
              OK i see. Two analyzes could be considered, each of them gives you an indication about the transcriptome coverage.
              A) Proportion of (distinct) exons that overlap a read.
              B) Proportion of the exonic bases that are covered by a read.
              ("a read" = "at least one read")

              A) Exon-level.
              1. Extract distinct exons as indicated in my previous post. Distinct exons are defined as not identical, therefore with different {chrom,start,end,strand} values just ignore the strand if your RNA-seq protocol is not strand-specific. In the previous example the 2 exons are distinct.
              2. Compare them with your reads and define how many of them share at least N bases with a read (usually N=1).

              B) Base-level.
              1. Merge your exons (do not take the strand into account) to get the transcribed regions of the genome. In the example you would get {chr1 1 10} indeed.
              2. Compute the proportion among these bases of the ones that are included in the reads.

              For example, with a read like
              chr1 8 10
              and your 2 exons from above,
              A=50%
              B=30% (if 1-based system like gff, otherwise 22.2% in 0-based bed system).

              A is easier, B is more precise.
              You can use BEDtools or Galaxy do do all that.
              s.

              Comment


              • #8
                B is really easy with bedtools. To get the number of covered exonic bases:
                Code:
                mergeBed -i exons.bed | intersectBed -a reads.bed -b stdin | mergeBed -i stdin | awk '{SUM += $3-$2} END {print SUM}'

                Comment


                • #9
                  Steven

                  B) Base-level.
                  1. Merge your exons (do not take the strand into account) to get the transcribed regions of the genome. In the example you would get {chr1 1 10} indeed.
                  2. Compute the proportion among these bases of the ones that are included in the reads.
                  why not take the strand into account? I use BWA for mapping and was told BWA maps reads only to the "+" strand. If it is correct, maybe I should not merge exons which are on the "-" strand?

                  Thanks

                  -y
                  Last edited by yuanzhi; 08-26-2010, 11:52 AM.

                  Comment


                  • #10
                    Originally posted by yuanzhi View Post
                    Steven



                    why not take the strand into account? I use BWA for mapping and was told BWA maps reads only to the "+" strand. If it is correct, maybe I should not merge exons which are on the "-" strand?

                    Thanks

                    -y
                    If the protocol is not strand-specific, each read can come from any strand, including the genes from the minus strand. You do not want to remove those from the analysis. Reads mapped by BWA on the plus strand will overlap the genes from the minus strand anyway if you do not take the strand info into account. not sure i am clear though..

                    Comment


                    • #11
                      .BED files

                      Hi all,

                      I am trying to get the .BED files of all the tracks from USCS browser. Is there any way that I can proceed with this? Thanks

                      Comment


                      • #12
                        how can I take the strand info into account

                        Originally posted by steven View Post
                        If the protocol is not strand-specific, each read can come from any strand, including the genes from the minus strand. You do not want to remove those from the analysis. Reads mapped by BWA on the plus strand will overlap the genes from the minus strand anyway if you do not take the strand info into account. not sure i am clear though..
                        Hello, now my data is strand specific paired end RNA-seq,could you tell me how can I take the strand info into account when I do mapping with BWA? Is there some parameter in BWA that deal with the strand information? I am looking forward to your reply. Thanks !
                        Best wishes!

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Essential Discoveries and Tools in Epitranscriptomics
                          by seqadmin




                          The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                          04-22-2024, 07:01 AM
                        • seqadmin
                          Current Approaches to Protein Sequencing
                          by seqadmin


                          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                          04-04-2024, 04:25 PM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, Today, 08:47 AM
                        0 responses
                        10 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-11-2024, 12:08 PM
                        0 responses
                        60 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-10-2024, 10:19 PM
                        0 responses
                        57 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-10-2024, 09:21 AM
                        0 responses
                        53 views
                        0 likes
                        Last Post seqadmin  
                        Working...
                        X