Hi
I have a gff file from a Maker annotation of my genome and I would like to have simple statistics like mean gene size, mean intron etc..
I tried to do it with extraction of the features using grep (e.g. exon) and importing the resulting table into libreoffice calc. I then could subtract the end from the start coordinates of all the exons by simply copying the formula (end - start + 1) across the entire row.
Unfortunately the Maker features do not include "intron" and the "gene" feature is defined as intron+exon+UTR. This would not be a problem if all the UTRs were a gapless extension of the first and last exon. Then I could just sum up all UTR lengths, substract that from the sum of all genes (including the UTR) and divide by number of genes to get mean gene size without UTR. Having mean gene size without UTR I could easily calculate mean intron.
But many times a "Maker gene" has more than one 5-prime or 3-prime UTR feature per gene and contains a gap between the features because the respective transcript used as evidence aligns with a gap to the genomic sequence. If I used the procedure above to calculate mean gene and intron size I would add the gap between UTR features. For a single gene I could manually subtract that gap but there is no way to do that in a table document for many thousand genes and I have no experience in writing scripts.
Is there any way to get this information out of the annotation? Any program?
Thank you
I have a gff file from a Maker annotation of my genome and I would like to have simple statistics like mean gene size, mean intron etc..
I tried to do it with extraction of the features using grep (e.g. exon) and importing the resulting table into libreoffice calc. I then could subtract the end from the start coordinates of all the exons by simply copying the formula (end - start + 1) across the entire row.
Unfortunately the Maker features do not include "intron" and the "gene" feature is defined as intron+exon+UTR. This would not be a problem if all the UTRs were a gapless extension of the first and last exon. Then I could just sum up all UTR lengths, substract that from the sum of all genes (including the UTR) and divide by number of genes to get mean gene size without UTR. Having mean gene size without UTR I could easily calculate mean intron.
But many times a "Maker gene" has more than one 5-prime or 3-prime UTR feature per gene and contains a gap between the features because the respective transcript used as evidence aligns with a gap to the genomic sequence. If I used the procedure above to calculate mean gene and intron size I would add the gap between UTR features. For a single gene I could manually subtract that gap but there is no way to do that in a table document for many thousand genes and I have no experience in writing scripts.
Is there any way to get this information out of the annotation? Any program?
Thank you
Comment