Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    It seems that instead of "gene start" or "gene end" columns, I should be looking for "5' UTR Start" and "3' UTR End" columns. But these columns are not available for whatever reason. When selecting these columns in the Biomart web query, the output file returns empty spaces for the "5' UTR Start" and "3' UTR End" columns.

    Any suggestions?

    Comment


    • #17
      Can you give us an actual example?

      Comment


      • #18
        For example, have a look at S.Cerevisae, gene YAL001C:
        Biomart gives its boundaries as 147594-151166,
        whereas the experimentally confirmed TSS-TTS for this gene are 147531-151187.

        (it's on the "-" strand, so the TTS comes first, then TSS; anyway they do not coincide with the Biomart's "gene start" and "gene end" by several dozens bp. It's like this with almost all genes that I tested. I could not compare with Biomart's 5'-UTR and 3'-UTR, because Biomart returns empty spaces instead of these columns).
        Last edited by rebrendi; 03-03-2012, 03:16 AM.

        Comment


        • #19
          The biomart boundaries coincide with those from Ensembl, which is to be expected. Ensembl mentions that for S.Cerevisae, it just imports data from the Saccharomyces Genome Database (SGD). If you go to the SGD website, you also get the coordinates that you found from Biomart, but then you'll notice that you can instead search for YAL001C_5UTR and YAL001C_3UTR as a landmark. Those seem to give (more or less) the coordinates that you listed. That suggests that this is due to a quirk of how the SGD structures its data.

          If this is correct, then you might want to just parse whatever SGD has available. It'd then be good if someone notified Ensembl. I haven't seen this sort of thing happen with human or mouse data.

          Comment


          • #20
            Dpryan is right the cerrevisae data is an import so it might be an oddity of how sgd stores the data rather than ensembl

            Comment


            • #21
              dpryan,
              How do you know that this does not happen with mouse or human? Could you please tell, how exactly you download the 5'-UTR coordinates from Biomart? Or you mean that for mouse/human, but not for yeast, the "gene start" and "gene end" have the meaning of 5'-UTR and 3'-UTR?

              Comment


              • #22
                For ensembl annotated species gene start is the 5" most coordinate and gene end is the 3" most coordinate. For many species this means the utr

                Comment


                • #23
                  hmm, would be nice to know is it so for mouse and human, actually...

                  Comment


                  • #24
                    You can trivially see the differences in the Ensembl mouse/human and S. Cerevisiae genomes by looking at the genome browser. The Ensembl mouse/human genomes have obvious UTRs, but that's not the case for S. cerevisiae. You can also see this in the mouse and human gtf files, which I assume is the source of the Biomart information (don't actually use it myself).

                    Comment


                    • #25
                      Human and mouse are ensembl annotated species so yes this is true for human and mouse go and look

                      Comment


                      • #26
                        Ok, great, thank you guys!

                        Comment


                        • #27
                          Although, there is still something that I do not understand:

                          For example, let's take the Ensemble mouse annotation: around 95,000 entries.

                          Now, let's look at some other database that contains all known mouse UTRs, e.g. http://utrdb.ba.itb.cnr.it/home/statistics
                          It has only around 25,000 entries for mouse.

                          It seems that it is technically more difficult to determine the TSS position rather than just define the ORF.

                          Now, can someone explain, which values are substituted in the Biomart output file for mouse containing ~95,000 entries, if only ~25,000 genes have been experimentally characterized in terms of their TSS?

                          How would I guess, which "gene start" is the real gene start, and which "gene start" is just the start of the ORF?

                          Comment


                          • #28
                            Originally posted by rebrendi View Post
                            Although, there is still something that I do not understand:

                            For example, let's take the Ensemble mouse annotation: around 95,000 entries.

                            Now, let's look at some other database that contains all known mouse UTRs, e.g. http://utrdb.ba.itb.cnr.it/home/statistics
                            It has only around 25,000 entries for mouse.

                            It seems that it is technically more difficult to determine the TSS position rather than just define the ORF.

                            Now, can someone explain, which values are substituted in the Biomart output file for mouse containing ~95,000 entries, if only ~25,000 genes have been experimentally characterized in terms of their TSS?

                            How would I guess, which "gene start" is the real gene start, and which "gene start" is just the start of the ORF?
                            Have you read how Ensembl generates its annotation? Have you then compared it to how the database you linked to was created? You should be able to answer your own question.

                            Comment


                            • #29
                              Originally posted by dpryan View Post
                              Have you read how Ensembl generates its annotation? Have you then compared it to how the database you linked to was created? You should be able to answer your own question.
                              You mean that the former is automatically+manually created, and the latter is manually created? Ok, but that does not answer my question.

                              I can not check each individual gene as I did in the example above to find out that the S.Cerevisae genes are annotated somehow different from the other species. I am just looking for a simple way to download the data set that contains all TSS coordinates (not the ORF coordinates).

                              Comment


                              • #30
                                Originally posted by rebrendi View Post
                                You mean that the former is automatically+manually created, and the latter is manually created? Ok, but that does not answer my question.

                                I can not check each individual gene as I did in the example above to find out that the S.Cerevisae genes are annotated somehow different from the other species. I am just looking for a simple way to download the data set that contains all TSS coordinates (not the ORF coordinates).
                                Neither of them are manually created and they source from only partly overlapping datasets (well, one gets its data only from EMBL/Genbank). Your question regarding downloading TSS coordinates was already answered for mouse and human. Most other genomes are probably the same. Some, such as S. Cerevisiae, aren't created by Ensembl and so could be different.

                                Unless you're downloading hundreds of genomes, it's not a problem to quickly check a couple genes to make sure the dataset is what you think it is. That's a good thing to do anyway for any dataset you don't produce yourself. Frankly, you could have done that between when you wrote your last message and my reply.

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Essential Discoveries and Tools in Epitranscriptomics
                                  by seqadmin




                                  The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                                  Yesterday, 07:01 AM
                                • seqadmin
                                  Current Approaches to Protein Sequencing
                                  by seqadmin


                                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                  04-04-2024, 04:25 PM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 04-11-2024, 12:08 PM
                                0 responses
                                58 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 10:19 PM
                                0 responses
                                53 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 09:21 AM
                                0 responses
                                45 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-04-2024, 09:00 AM
                                0 responses
                                55 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X