Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Gene level ensembl bed file?

    Hey all,
    I'm trying to run Scripture's score task and look at human gene level expression. Does anyone know where I can find a bed6 format file with Ensembl Gene annotation (ENSG) instead of Ensembl transcript annotation (ENST)?
    I realize I could use the counts from the transcript level output to get gene level output, but then I lose the fwer p-values that scripture outputs.

  • #2
    Do you mean that you wish to merge overlapping exons for Ensembl genes that have multiple isoforms? So that you have a single BED line for each gene?

    Comment


    • #3
      Yes, I am looking for a file (or working on creating a file) where each ensembl gene is one line of a bed6 format file.

      I'm not so concerned with isoforms. Most ensembl genes have multiple ensembl transcripts that point within the location of the gene. The only ensembl Human Genome bed6 files I have found have contained ensembl transciprt (ENST) ids rather than ensembl gene ids (ENSG). I do have a bed12 file at the gene level instead of the transcript level, but it is taking me some time to write a script to take care of this. Thus why I am looking for a human ENSG bed6 file.

      Does that answer your question?

      Comment


      • #4
        Use USCS table browser. Instead of selecting output to BED, choose "selected fields from table" and choose the 6 fields you need. Not sure if fields will naturally occur in the order you prefer. If not, that'd be easy to fix if you do it through galaxy.

        Comment


        • #5
          This was my initial thought as well. However, when I tried it in Galaxy, the import just sat there for ever with a message: 'waiting to run'. Perhaps this was just a temporary problem and Galaxy will do the trick...

          When I tried it directly in the UCSC table browser it gave me a BED file with one line per transcript, even though I selected the Gene table and specified that the BED be created with 'one line per whole gene'.

          Of course, the info you would need to create a gene-level BED file, is in this file. I was also able to get the necessary info from Ensembl Biomart, but again I didn't see an obvious way to output directly to BED6 format.

          Since rkusko already has the required info but in BED12 format, neither the UCSC or Ensembl option seems more convenient.

          Where did the BED12 version come from? Can you post it, or a sample of it in case someone has a ready-made converter to try...

          Comment


          • #6
            Update:

            My Galaxy task did finally complete. I used Galaxy to import the Ensembl gene annotations from UCSC and output them as BED (which can be done entirely at UCSC just as easily). Unfortunately, this produces one line per transcript not one line per gene. In retrospect, this is not surprising given that UCSC's concept of a 'gene' is basically a transcript.

            Anyway, why not start with the transcript level file and use BEDTOOLS to merge overlapping features on the same strand. You should be able to do this using the 'mergeBed' function with the '-s' option to force strandedness.

            One potential problem I see with this approach is that in rare cases there may be multiple genes, on the same strand with some overlap... So you might accidentally merge these into a single gene... mergeBed allows you to report the names of the things that were merged so you could use this option and then explicitly look for cases where transcripts from different genes were merged.

            Comment


            • #7
              Originally posted by malachig View Post
              One potential problem I see with this approach is that in rare cases there may be multiple genes, on the same strand with some overlap... So you might accidentally merge these into a single gene
              A hack, but... BEDTools's mergeBed just treats the chromosome as a string. Concatenate the ENSG id and the chr# and it will merge the way you want.

              Code:
              ENSG00000166157_chr21	9928080	10012775
              ENSG00000166157_chr21	9928080	10012775
              ENSG00000166157_chr21	9928080	9993593
              ENSG00000166157_chr21	9928080	10012775
              ENSG00000166157_chr21	9928611	10012753
              becomes
              Code:
              ENSG00000166157_chr21	9928080	10012775

              Comment


              • #8
                Originally posted by malachig View Post
                Update:

                My Galaxy task did finally complete. I used Galaxy to import the Ensembl gene annotations from UCSC and output them as BED (which can be done entirely at UCSC just as easily). Unfortunately, this produces one line per transcript not one line per gene. In retrospect, this is not surprising given that UCSC's concept of a 'gene' is basically a transcript.

                Anyway, why not start with the transcript level file and use BEDTOOLS to merge overlapping features on the same strand. You should be able to do this using the 'mergeBed' function with the '-s' option to force strandedness.

                One potential problem I see with this approach is that in rare cases there may be multiple genes, on the same strand with some overlap... So you might accidentally merge these into a single gene... mergeBed allows you to report the names of the things that were merged so you could use this option and then explicitly look for cases where transcripts from different genes were merged.
                The Bedtools are interesting. Thanks :-)

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Strategies for Sequencing Challenging Samples
                  by seqadmin


                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                  03-22-2024, 06:39 AM
                • seqadmin
                  Techniques and Challenges in Conservation Genomics
                  by seqadmin



                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                  Avian Conservation
                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                  03-08-2024, 10:41 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, Yesterday, 06:37 PM
                0 responses
                11 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, Yesterday, 06:07 PM
                0 responses
                10 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-22-2024, 10:03 AM
                0 responses
                51 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-21-2024, 07:32 AM
                0 responses
                67 views
                0 likes
                Last Post seqadmin  
                Working...
                X