Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to get mean intron, mean gene size etc. from gff

    Hi

    I have a gff file from a Maker annotation of my genome and I would like to have simple statistics like mean gene size, mean intron etc..

    I tried to do it with extraction of the features using grep (e.g. exon) and importing the resulting table into libreoffice calc. I then could subtract the end from the start coordinates of all the exons by simply copying the formula (end - start + 1) across the entire row.

    Unfortunately the Maker features do not include "intron" and the "gene" feature is defined as intron+exon+UTR. This would not be a problem if all the UTRs were a gapless extension of the first and last exon. Then I could just sum up all UTR lengths, substract that from the sum of all genes (including the UTR) and divide by number of genes to get mean gene size without UTR. Having mean gene size without UTR I could easily calculate mean intron.

    But many times a "Maker gene" has more than one 5-prime or 3-prime UTR feature per gene and contains a gap between the features because the respective transcript used as evidence aligns with a gap to the genomic sequence. If I used the procedure above to calculate mean gene and intron size I would add the gap between UTR features. For a single gene I could manually subtract that gap but there is no way to do that in a table document for many thousand genes and I have no experience in writing scripts.

    Is there any way to get this information out of the annotation? Any program?


    Thank you

  • #2
    Could you provide an example of one of your "maker genes" (those nasty ones)?

    I just recently wrote some kind of gtf/gff parser, which can handle some of your statistics. Since it is not yet well tested, i'd rather check it works before i'd give it away.

    Btw, where are you from in Germany?

    cheers,

    markus

    Comment


    • #3
      Hi

      thanks for your reply
      I'm from Hannover

      below is an example of such a "nasty gene"

      actually it is not a nasty gene. It depends on what you define as "exon": only the coding portion or any region that is represented in the transcript (including UTR). If you incorporate the UTR you can hardly compare the values of gene size, exon size or intron size between two species because the UTR prediction is highly dependent on the quality of the transcriptome.

      I have two closely related genomes with transcriptomes. But one transcriptome is better than the other. If I used the "gene" values below the species with the better transcriptome has a larger gene size and larger introns but in reality the values are almost the same. They only, artificially, differ in transcript lengths.

      So in the example below I am interested in getting the gene size without UTR (and the gaps between UTRs). The exons without UTR are already specified as "CDS". With both values I could calculate the introns.

      Hope this makes sense

      Cheers



      scaffold_148 maker gene 178766 184218 . + . ID=TrispH2_008730;Name=TrispH2_008730;Alias=augustus_masked-scaffold_148-processed-gene-0.10;Note=Similar to SVOP: Synaptic vesicle 2-related protein (Bos taurus);
      scaffold_148 maker mRNA 178766 184218 . + . ID=TrispH2_008730-RA;Parent=TrispH2_008730;Name=TrispH2_008730-RA;Alias=augustus_masked-scaffold_148-processed-gene-0.10-mRNA-1;_AED=0.24;_QI=541|0.85|0.86|1|0.78|0.73|15|539|302;_eAED=0.24;Note=Similar to SVOP: Synaptic vesicle 2-related protein (Bos taurus);
      scaffold_148 maker exon 178766 178861 . + . ID=TrispH2_008730-RA:exon:13161;Parent=TrispH2_008730-RA;
      scaffold_148 maker exon 180084 180182 . + . ID=TrispH2_008730-RA:exon:13162;Parent=TrispH2_008730-RA;
      scaffold_148 maker exon 180472 180543 . + . ID=TrispH2_008730-RA:exon:13163;Parent=TrispH2_008730-RA;
      scaffold_148 maker exon 180709 180975 . + . ID=TrispH2_008730-RA:exon:13164;Parent=TrispH2_008730-RA;
      scaffold_148 maker exon 181081 181144 . + . ID=TrispH2_008730-RA:exon:13165;Parent=TrispH2_008730-RA;
      scaffold_148 maker exon 181486 181614 . + . ID=TrispH2_008730-RA:exon:13166;Parent=TrispH2_008730-RA;
      scaffold_148 maker exon 181801 181874 . + . ID=TrispH2_008730-RA:exon:13167;Parent=TrispH2_008730-RA;
      scaffold_148 maker exon 181967 182037 . + . ID=TrispH2_008730-RA:exon:13168;Parent=TrispH2_008730-RA;
      scaffold_148 maker exon 182269 182367 . + . ID=TrispH2_008730-RA:exon:13169;Parent=TrispH2_008730-RA;
      scaffold_148 maker exon 182614 182721 . + . ID=TrispH2_008730-RA:exon:13170;Parent=TrispH2_008730-RA;
      scaffold_148 maker exon 182814 182971 . + . ID=TrispH2_008730-RA:exon:13171;Parent=TrispH2_008730-RA;
      scaffold_148 maker exon 183056 183145 . + . ID=TrispH2_008730-RA:exon:13172;Parent=TrispH2_008730-RA;
      scaffold_148 maker exon 183362 183495 . + . ID=TrispH2_008730-RA:exon:13173;Parent=TrispH2_008730-RA;
      scaffold_148 maker exon 183605 183641 . + . ID=TrispH2_008730-RA:exon:13174;Parent=TrispH2_008730-RA;
      scaffold_148 maker exon 183728 184218 . + . ID=TrispH2_008730-RA:exon:13175;Parent=TrispH2_008730-RA;
      scaffold_148 maker five_prime_UTR 178766 178861 . + . ID=TrispH2_008730-RA:five_prime_utr;Parent=TrispH2_008730-RA;
      scaffold_148 maker five_prime_UTR 180084 180182 . + . ID=TrispH2_008730-RA:five_prime_utr;Parent=TrispH2_008730-RA;
      scaffold_148 maker five_prime_UTR 180472 180543 . + . ID=TrispH2_008730-RA:five_prime_utr;Parent=TrispH2_008730-RA;
      scaffold_148 maker five_prime_UTR 180709 180975 . + . ID=TrispH2_008730-RA:five_prime_utr;Parent=TrispH2_008730-RA;
      scaffold_148 maker five_prime_UTR 181081 181087 . + . ID=TrispH2_008730-RA:five_prime_utr;Parent=TrispH2_008730-RA;
      scaffold_148 maker CDS 181088 181144 . + 0 ID=TrispH2_008730-RA:cds;Parent=TrispH2_008730-RA;
      scaffold_148 maker CDS 181486 181614 . + 0 ID=TrispH2_008730-RA:cds;Parent=TrispH2_008730-RA;
      scaffold_148 maker CDS 181801 181874 . + 0 ID=TrispH2_008730-RA:cds;Parent=TrispH2_008730-RA;
      scaffold_148 maker CDS 181967 182037 . + 1 ID=TrispH2_008730-RA:cds;Parent=TrispH2_008730-RA;
      scaffold_148 maker CDS 182269 182367 . + 2 ID=TrispH2_008730-RA:cds;Parent=TrispH2_008730-RA;
      scaffold_148 maker CDS 182614 182721 . + 2 ID=TrispH2_008730-RA:cds;Parent=TrispH2_008730-RA;
      scaffold_148 maker CDS 182814 182971 . + 2 ID=TrispH2_008730-RA:cds;Parent=TrispH2_008730-RA;
      scaffold_148 maker CDS 183056 183145 . + 0 ID=TrispH2_008730-RA:cds;Parent=TrispH2_008730-RA;
      scaffold_148 maker CDS 183362 183484 . + 0 ID=TrispH2_008730-RA:cds;Parent=TrispH2_008730-RA;
      scaffold_148 maker three_prime_UTR 183485 183495 . + . ID=TrispH2_008730-RA:three_prime_utr;Parent=TrispH2_008730-RA;
      scaffold_148 maker three_prime_UTR 183605 183641 . + . ID=TrispH2_008730-RA:three_prime_utr;Parent=TrispH2_008730-RA;
      scaffold_148 maker three_prime_UTR 183728 184218 . + . ID=TrispH2_008730-RA:three_prime_utr;Parent=TrispH2_008730-RA;

      Comment


      • #4
        Hmm, as in any bioinformatics topic, definitions of gtf/gff files differ, but I think i got something that might actually help you :

        GTF File: /usr/local/storage/references/smalltest.gtf
        Stats File: stats.tsv
        Start read in for: /usr/local/storage/references/smalltest.gtf on 8 threads

        Adding Children to chromosome named scaffold_148 : 1
        scaffold_148 chromosome gene 0 1 1 1 5453 5453
        scaffold_148 gene mRNA 0 1 1 1 5453 5453
        scaffold_148 mRNA exon 0 1 15 15 1989 132.6
        scaffold_148 mRNA CDS 0 1 9 9 909 101
        scaffold_148 mRNA exon 1 1 14 14 3464 247.429
        scaffold_148 mRNA CDS 1 1 10 10 4544 454.4
        scaffold_148 mRNA exon 2 1 14 14 3464 247.429
        scaffold_148 mRNA CDS 2 1 8 8 1488 186
        total chromosome gene 0 1 1 1 5453 5453
        total gene mRNA 0 1 1 1 5453 5453
        total mRNA exon 0 1 15 15 1989 132.6
        total mRNA CDS 0 1 9 9 909 101
        total mRNA exon 1 1 14 14 3464 247.429
        total mRNA CDS 1 1 10 10 4544 454.4
        total mRNA exon 2 1 14 14 3464 247.429
        total mRNA CDS 2 1 8 8 1488 186

        Bette double check the results, but I hope it's quite ok.

        The 4th column describes the mode. 0 uses the feature in column 3, 1 takes the difference of feature in column 2 - feature in column 3.
        2 is similar to 1, but sets the boundaries to subtract from to [min(start feature col 3), max( end feature col 3)].

        For isntance "scaffold_148 mRNA exon 0 1 15 15 1989 132.6" would get you details on any exon within mRNA in the scaffold_148 "chromosome".
        Here there is just one mRNA, 15 exons, making an average of 15 exons per mRNA.
        Those 15 exons sum up to 1989bp, on average 132.6 bp per exon.

        The line "total mRNA CDS 2 1 8 8 1488 186" describes that in all chromosome, in total 1 mRNA has been found. This mRNA has 8 intronic features (8 on average) having 1488bp in total, and thus an average length of 186bp per intron.

        Here an intron must lie between two CDSs. In fact, in mode 2, the region from 178766 to 181087 is NOT an intron, while in mode 1, it is an intron.

        One question: do you have linux/mingw, or is it windows only? (the latter requiring some kind of GUI)

        Cheers ^_^

        Comment


        • #5
          Thanks Markus, I have to check that next week, cause I am away from home on a meeting. I am working with Linux so that is not a problem. Cheers

          Comment


          • #6
            Hi Markus, I have to say, this is exactly the kind of tool I've been looking for. Any chance it could be made available? I'm doing a lot of genome annotation and this looks like it would be great to get an idea of general stats of a given annotation. Very nice.

            Comment


            • #7
              Originally posted by mjoppich View Post
              Adding Children to chromosome named scaffold_148 : 1
              scaffold_148 chromosome gene 0 1 1 1 5453 5453
              scaffold_148 gene mRNA 0 1 1 1 5453 5453
              scaffold_148 mRNA exon 0 1 15 15 1989 132.6
              scaffold_148 mRNA CDS 0 1 9 9 909 101
              scaffold_148 mRNA exon 1 1 14 14 3464 247.429
              scaffold_148 mRNA CDS 1 1 10 10 4544 454.4
              scaffold_148 mRNA exon 2 1 14 14 3464 247.429
              scaffold_148 mRNA CDS 2 1 8 8 1488 186
              total chromosome gene 0 1 1 1 5453 5453
              total gene mRNA 0 1 1 1 5453 5453
              total mRNA exon 0 1 15 15 1989 132.6
              total mRNA CDS 0 1 9 9 909 101
              total mRNA exon 1 1 14 14 3464 247.429
              total mRNA CDS 1 1 10 10 4544 454.4
              total mRNA exon 2 1 14 14 3464 247.429
              total mRNA CDS 2 1 8 8 1488 186
              Hi Markus

              thank you very much. Looks quite useful. As I understand the output, the results seem to be OK.

              - so first column is mode, next column is # mRNAs, then total count of respective feature, then mean of feature/mRNA, then length summation of feature, then mean length of feature

              - in mode 2 the same; only that the space between features is counted as feature (intron)

              - only thing I don't get is mode 1. For exon it is the same as mode 2 but not for CDs where it has 10 features (but there are only 9 CDs features). But I don't think I need it.

              Another question I have: Does it sum up only for each scaffold or also for the entire assembly? This would also be important for the calculation of "gene size" without UTR (start of first CDs to end of last CDs), because it is not included in the output. The mean can be easily calculated with total CDs length plus total intron length divided by number of mRNAs but only if it is summed up over the entire assembly.

              Best
              Kai

              Comment


              • #8
                @jcorn427
                Sorry for the late answer, for some reason I have not gotten an email notification ...

                Public release is the plan. However I still need to test a bit more, add something here and there.

                Also I am quite a fan of providing bioinformatic tools which are easy to use (by anyone). Thus having some kind of (basic) Qt-based GUI is a must before any release.

                @balaena
                For mode 2 on CDS any non-CDS interval within 181088 and 183484 is taken. That makes 8 gaps, which is also shown below.

                For mode 1 on CDS any non-CDS interval within 178766 to 184218 is taken (mRNA as reference). So you have two additional gaps at the beginning ( 178766 to 181088) and end (183484 to 184218). Sums up to 10 then

                Output currently is for each scaffold, as well as for all (total).

                Since it is not yet published, would you mind contacting me via [email protected] for the software (english/german)?
                As this part of the library has been customized for your needs at the moment, you are kind of a beta-tester ^_^

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Essential Discoveries and Tools in Epitranscriptomics
                  by seqadmin


                  The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
                  Yesterday, 07:01 AM
                • seqadmin
                  Current Approaches to Protein Sequencing
                  by seqadmin


                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                  04-04-2024, 04:25 PM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 04-11-2024, 12:08 PM
                0 responses
                39 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 10:19 PM
                0 responses
                41 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 09:21 AM
                0 responses
                35 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-04-2024, 09:00 AM
                0 responses
                55 views
                0 likes
                Last Post seqadmin  
                Working...
                X