Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • yuanzhi
    Member
    • Aug 2010
    • 19

    Questions about exon bed files downloaded from UCSC

    Dear All

    I just downloaded a refseq exon table (BED format) from UCSC table browser. I notice that the exon coordinates are not unique in the table. For example, in the following case, the exon start and end coordinates are all the same, but the gene IDs are different. Does it mean different genes or different isoforms?

    Code:
    chr1    100100400       100100567       NM_000642_exon_3_0_chr1_100100401_f     0       +
    chr1    100100400       100100567       NM_000644_exon_3_0_chr1_100100401_f     0       +
    chr1    100100400       100100567       NM_000643_exon_3_0_chr1_100100401_f     0       +
    chr1    100100400       100100567       NM_000028_exon_3_0_chr1_100100401_f     0       +
    chr1    100100400       100100567       NM_000646_exon_3_0_chr1_100100401_f     0       +
    chr1    100100400       100100567       NM_000645_exon_1_0_chr1_100100401_f     0       +
    Also, in this following case,

    Code:
    chr1    100088632       100088762       NM_000644_exon_0_0_chr1_100088633_f     0       +
    chr1    100088632       100088762       NM_000643_exon_0_0_chr1_100088633_f     0       +
    chr1    100088632       100089042       NM_000028_exon_0_0_chr1_100088633_f     0       +
    the exon end coordinate for the last one is different from the other two. The gene IDs also differ. What does this mean?

    All your help will be appreciated.

    -YZ
  • adamdeluca
    Member
    • Jul 2010
    • 95

    #2
    Refseq ids get assigned to each isoform so there is no distinguishing between different splice variants and different genes. You might be better off using ensembl gene id's and transcript id's if you care about the distinction. The ensGene table in UCSC contains these data.

    Comment

    • yuanzhi
      Member
      • Aug 2010
      • 19

      #3
      Thanks for your response.

      I was told that ensemble gene coordinates are slightly different from UCSC's. Currently all my work are based on UCSC's refseq. I am not sure if ensGene is compatible with my current data.

      How about if I manually remove all the redundant coordinates from refseq exons? I mean, I will just keep those exons with different start/end coordinates. Will this be ok?

      Thanks

      Comment

      • adamdeluca
        Member
        • Jul 2010
        • 95

        #4


        bedtools is really useful for those types of manipulations, you can use mergeBed to get a non-redundant bed file.

        Example:
        chr1 1 5
        chr1 3 10
        would become:
        chr1 1 10

        Comment

        • steven
          Senior Member
          • Aug 2009
          • 269

          #5
          What you want to do?
          It is true that the "gene_id" field from UCSC exons is actually a transcript ID, but the same happens if you download the ensembl annotation from UCSC. As adamdeluca said, you need an additional table to map the IDs (there is an equivalent of ensGene for RefSeq of course). I would say that if you start with refseq, sticking on refseq is indeed the best option.
          As for the next step, it depends on what you want to do. Keep in mind that merging exons leads to artificial exons (like the one in the example above) that do not exist in vivo.
          If you just want to remove redundancy and keep distinct exons, a simple command like
          awk '{print $1,$2,$3,$6}' your_exon_file.bed | sort -u
          or
          awk '{OFS="\t";$4=".";print $0}' your_exon_file.bed | sort -u
          should work fine.
          s.

          Comment

          • yuanzhi
            Member
            • Aug 2010
            • 19

            #6
            steven, thank you for your response first.

            I want to know how much percentage of all refseq exon bases in one chromosome is covered by my sequencing data. The formula = the number of refseq exon bases covered by my sequencing/the total number of all refseq exon bases in that chromosome.

            In this case, it seems like I should merge exons, right?

            when you said "If you just want to remove redundancy and keep distinct exons", how do you define "distinct exons"?

            if in this case
            chr1 1 5
            chr1 3 10

            should I consider them as distinct exons or unite them to chr1 1 10?

            Thanks

            Comment

            • steven
              Senior Member
              • Aug 2009
              • 269

              #7
              OK i see. Two analyzes could be considered, each of them gives you an indication about the transcriptome coverage.
              A) Proportion of (distinct) exons that overlap a read.
              B) Proportion of the exonic bases that are covered by a read.
              ("a read" = "at least one read")

              A) Exon-level.
              1. Extract distinct exons as indicated in my previous post. Distinct exons are defined as not identical, therefore with different {chrom,start,end,strand} values just ignore the strand if your RNA-seq protocol is not strand-specific. In the previous example the 2 exons are distinct.
              2. Compare them with your reads and define how many of them share at least N bases with a read (usually N=1).

              B) Base-level.
              1. Merge your exons (do not take the strand into account) to get the transcribed regions of the genome. In the example you would get {chr1 1 10} indeed.
              2. Compute the proportion among these bases of the ones that are included in the reads.

              For example, with a read like
              chr1 8 10
              and your 2 exons from above,
              A=50%
              B=30% (if 1-based system like gff, otherwise 22.2% in 0-based bed system).

              A is easier, B is more precise.
              You can use BEDtools or Galaxy do do all that.
              s.

              Comment

              • adamdeluca
                Member
                • Jul 2010
                • 95

                #8
                B is really easy with bedtools. To get the number of covered exonic bases:
                Code:
                mergeBed -i exons.bed | intersectBed -a reads.bed -b stdin | mergeBed -i stdin | awk '{SUM += $3-$2} END {print SUM}'

                Comment

                • yuanzhi
                  Member
                  • Aug 2010
                  • 19

                  #9
                  Steven

                  B) Base-level.
                  1. Merge your exons (do not take the strand into account) to get the transcribed regions of the genome. In the example you would get {chr1 1 10} indeed.
                  2. Compute the proportion among these bases of the ones that are included in the reads.
                  why not take the strand into account? I use BWA for mapping and was told BWA maps reads only to the "+" strand. If it is correct, maybe I should not merge exons which are on the "-" strand?

                  Thanks

                  -y
                  Last edited by yuanzhi; 08-26-2010, 11:52 AM.

                  Comment

                  • steven
                    Senior Member
                    • Aug 2009
                    • 269

                    #10
                    Originally posted by yuanzhi View Post
                    Steven



                    why not take the strand into account? I use BWA for mapping and was told BWA maps reads only to the "+" strand. If it is correct, maybe I should not merge exons which are on the "-" strand?

                    Thanks

                    -y
                    If the protocol is not strand-specific, each read can come from any strand, including the genes from the minus strand. You do not want to remove those from the analysis. Reads mapped by BWA on the plus strand will overlap the genes from the minus strand anyway if you do not take the strand info into account. not sure i am clear though..

                    Comment

                    • sneha
                      Junior Member
                      • Feb 2011
                      • 3

                      #11
                      .BED files

                      Hi all,

                      I am trying to get the .BED files of all the tracks from USCS browser. Is there any way that I can proceed with this? Thanks

                      Comment

                      • qqtwee
                        Member
                        • Feb 2011
                        • 16

                        #12
                        how can I take the strand info into account

                        Originally posted by steven View Post
                        If the protocol is not strand-specific, each read can come from any strand, including the genes from the minus strand. You do not want to remove those from the analysis. Reads mapped by BWA on the plus strand will overlap the genes from the minus strand anyway if you do not take the strand info into account. not sure i am clear though..
                        Hello, now my data is strand specific paired end RNA-seq,could you tell me how can I take the strand info into account when I do mapping with BWA? Is there some parameter in BWA that deal with the strand information? I am looking forward to your reply. Thanks !
                        Best wishes!

                        Comment

                        Latest Articles

                        Collapse

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by SEQadmin2, Today, 11:58 AM
                        0 responses
                        9 views
                        0 reactions
                        Last Post SEQadmin2  
                        Started by SEQadmin2, 06-05-2026, 10:09 AM
                        0 responses
                        25 views
                        0 reactions
                        Last Post SEQadmin2  
                        Started by SEQadmin2, 06-04-2026, 08:59 AM
                        0 responses
                        34 views
                        0 reactions
                        Last Post SEQadmin2  
                        Started by SEQadmin2, 06-02-2026, 12:03 PM
                        0 responses
                        56 views
                        0 reactions
                        Last Post SEQadmin2  
                        Working...