Seqanswers Leaderboard Ad

**mjoppich** · 09-09-2015, 09:01 AM

Could you provide an example of one of your "maker genes" (those nasty ones)?

I just recently wrote some kind of gtf/gff parser, which can handle some of your statistics. Since it is not yet well tested, i'd rather check it works before i'd give it away.

Btw, where are you from in Germany?

cheers,

markus

**balaena** · 09-10-2015, 05:52 AM

Hi

thanks for your reply
I'm from Hannover

below is an example of such a "nasty gene"

actually it is not a nasty gene. It depends on what you define as "exon": only the coding portion or any region that is represented in the transcript (including UTR). If you incorporate the UTR you can hardly compare the values of gene size, exon size or intron size between two species because the UTR prediction is highly dependent on the quality of the transcriptome.

I have two closely related genomes with transcriptomes. But one transcriptome is better than the other. If I used the "gene" values below the species with the better transcriptome has a larger gene size and larger introns but in reality the values are almost the same. They only, artificially, differ in transcript lengths.

So in the example below I am interested in getting the gene size without UTR (and the gaps between UTRs). The exons without UTR are already specified as "CDS". With both values I could calculate the introns.

Hope this makes sense

Cheers

scaffold_148 maker gene 178766 184218 . + . ID=TrispH2_008730;Name=TrispH2_008730;Alias=augustus_masked-scaffold_148-processed-gene-0.10;Note=Similar to SVOP: Synaptic vesicle 2-related protein (Bos taurus);
scaffold_148 maker mRNA 178766 184218 . + . ID=TrispH2_008730-RA;Parent=TrispH2_008730;Name=TrispH2_008730-RA;Alias=augustus_masked-scaffold_148-processed-gene-0.10-mRNA-1;_AED=0.24;_QI=541|0.85|0.86|1|0.78|0.73|15|539|302;_eAED=0.24;Note=Similar to SVOP: Synaptic vesicle 2-related protein (Bos taurus);
scaffold_148 maker exon 178766 178861 . + . ID=TrispH2_008730-RA:exon:13161;Parent=TrispH2_008730-RA;
scaffold_148 maker exon 180084 180182 . + . ID=TrispH2_008730-RA:exon:13162;Parent=TrispH2_008730-RA;
scaffold_148 maker exon 180472 180543 . + . ID=TrispH2_008730-RA:exon:13163;Parent=TrispH2_008730-RA;
scaffold_148 maker exon 180709 180975 . + . ID=TrispH2_008730-RA:exon:13164;Parent=TrispH2_008730-RA;
scaffold_148 maker exon 181081 181144 . + . ID=TrispH2_008730-RA:exon:13165;Parent=TrispH2_008730-RA;
scaffold_148 maker exon 181486 181614 . + . ID=TrispH2_008730-RA:exon:13166;Parent=TrispH2_008730-RA;
scaffold_148 maker exon 181801 181874 . + . ID=TrispH2_008730-RA:exon:13167;Parent=TrispH2_008730-RA;
scaffold_148 maker exon 181967 182037 . + . ID=TrispH2_008730-RA:exon:13168;Parent=TrispH2_008730-RA;
scaffold_148 maker exon 182269 182367 . + . ID=TrispH2_008730-RA:exon:13169;Parent=TrispH2_008730-RA;
scaffold_148 maker exon 182614 182721 . + . ID=TrispH2_008730-RA:exon:13170;Parent=TrispH2_008730-RA;
scaffold_148 maker exon 182814 182971 . + . ID=TrispH2_008730-RA:exon:13171;Parent=TrispH2_008730-RA;
scaffold_148 maker exon 183056 183145 . + . ID=TrispH2_008730-RA:exon:13172;Parent=TrispH2_008730-RA;
scaffold_148 maker exon 183362 183495 . + . ID=TrispH2_008730-RA:exon:13173;Parent=TrispH2_008730-RA;
scaffold_148 maker exon 183605 183641 . + . ID=TrispH2_008730-RA:exon:13174;Parent=TrispH2_008730-RA;
scaffold_148 maker exon 183728 184218 . + . ID=TrispH2_008730-RA:exon:13175;Parent=TrispH2_008730-RA;
scaffold_148 maker five_prime_UTR 178766 178861 . + . ID=TrispH2_008730-RA:five_prime_utr;Parent=TrispH2_008730-RA;
scaffold_148 maker five_prime_UTR 180084 180182 . + . ID=TrispH2_008730-RA:five_prime_utr;Parent=TrispH2_008730-RA;
scaffold_148 maker five_prime_UTR 180472 180543 . + . ID=TrispH2_008730-RA:five_prime_utr;Parent=TrispH2_008730-RA;
scaffold_148 maker five_prime_UTR 180709 180975 . + . ID=TrispH2_008730-RA:five_prime_utr;Parent=TrispH2_008730-RA;
scaffold_148 maker five_prime_UTR 181081 181087 . + . ID=TrispH2_008730-RA:five_prime_utr;Parent=TrispH2_008730-RA;
scaffold_148 maker CDS 181088 181144 . + 0 ID=TrispH2_008730-RA:cds;Parent=TrispH2_008730-RA;
scaffold_148 maker CDS 181486 181614 . + 0 ID=TrispH2_008730-RA:cds;Parent=TrispH2_008730-RA;
scaffold_148 maker CDS 181801 181874 . + 0 ID=TrispH2_008730-RA:cds;Parent=TrispH2_008730-RA;
scaffold_148 maker CDS 181967 182037 . + 1 ID=TrispH2_008730-RA:cds;Parent=TrispH2_008730-RA;
scaffold_148 maker CDS 182269 182367 . + 2 ID=TrispH2_008730-RA:cds;Parent=TrispH2_008730-RA;
scaffold_148 maker CDS 182614 182721 . + 2 ID=TrispH2_008730-RA:cds;Parent=TrispH2_008730-RA;
scaffold_148 maker CDS 182814 182971 . + 2 ID=TrispH2_008730-RA:cds;Parent=TrispH2_008730-RA;
scaffold_148 maker CDS 183056 183145 . + 0 ID=TrispH2_008730-RA:cds;Parent=TrispH2_008730-RA;
scaffold_148 maker CDS 183362 183484 . + 0 ID=TrispH2_008730-RA:cds;Parent=TrispH2_008730-RA;
scaffold_148 maker three_prime_UTR 183485 183495 . + . ID=TrispH2_008730-RA:three_prime_utr;Parent=TrispH2_008730-RA;
scaffold_148 maker three_prime_UTR 183605 183641 . + . ID=TrispH2_008730-RA:three_prime_utr;Parent=TrispH2_008730-RA;
scaffold_148 maker three_prime_UTR 183728 184218 . + . ID=TrispH2_008730-RA:three_prime_utr;Parent=TrispH2_008730-RA;

**mjoppich** · 09-15-2015, 08:44 AM

Hmm, as in any bioinformatics topic, definitions of gtf/gff files differ, but I think i got something that might actually help you :

GTF File: /usr/local/storage/references/smalltest.gtf
Stats File: stats.tsv
Start read in for: /usr/local/storage/references/smalltest.gtf on 8 threads

Adding Children to chromosome named scaffold_148 : 1
scaffold_148 chromosome gene 0 1 1 1 5453 5453
scaffold_148 gene mRNA 0 1 1 1 5453 5453
scaffold_148 mRNA exon 0 1 15 15 1989 132.6
scaffold_148 mRNA CDS 0 1 9 9 909 101
scaffold_148 mRNA exon 1 1 14 14 3464 247.429
scaffold_148 mRNA CDS 1 1 10 10 4544 454.4
scaffold_148 mRNA exon 2 1 14 14 3464 247.429
scaffold_148 mRNA CDS 2 1 8 8 1488 186
total chromosome gene 0 1 1 1 5453 5453
total gene mRNA 0 1 1 1 5453 5453
total mRNA exon 0 1 15 15 1989 132.6
total mRNA CDS 0 1 9 9 909 101
total mRNA exon 1 1 14 14 3464 247.429
total mRNA CDS 1 1 10 10 4544 454.4
total mRNA exon 2 1 14 14 3464 247.429
total mRNA CDS 2 1 8 8 1488 186

Bette double check the results, but I hope it's quite ok.

The 4th column describes the mode. 0 uses the feature in column 3, 1 takes the difference of feature in column 2 - feature in column 3.
2 is similar to 1, but sets the boundaries to subtract from to [min(start feature col 3), max( end feature col 3)].

For isntance "scaffold_148 mRNA exon 0 1 15 15 1989 132.6" would get you details on any exon within mRNA in the scaffold_148 "chromosome".
Here there is just one mRNA, 15 exons, making an average of 15 exons per mRNA.
Those 15 exons sum up to 1989bp, on average 132.6 bp per exon.

The line "total mRNA CDS 2 1 8 8 1488 186" describes that in all chromosome, in total 1 mRNA has been found. This mRNA has 8 intronic features (8 on average) having 1488bp in total, and thus an average length of 186bp per intron.

Here an intron must lie between two CDSs. In fact, in mode 2, the region from 178766 to 181087 is NOT an intron, while in mode 1, it is an intron.

One question: do you have linux/mingw, or is it windows only? (the latter requiring some kind of GUI)

Cheers ^_^

**balaena** · 09-15-2015, 12:39 PM

Thanks Markus, I have to check that next week, cause I am away from home on a meeting. I am working with Linux so that is not a problem. Cheers

**jcorn427** · 09-16-2015, 10:20 AM

Hi Markus, I have to say, this is exactly the kind of tool I've been looking for. Any chance it could be made available? I'm doing a lot of genome annotation and this looks like it would be great to get an idea of general stats of a given annotation. Very nice.

**balaena** · 09-21-2015, 01:39 AM

Originally posted by mjoppich View Post

Adding Children to chromosome named scaffold_148 : 1
scaffold_148 chromosome gene 0 1 1 1 5453 5453
scaffold_148 gene mRNA 0 1 1 1 5453 5453
scaffold_148 mRNA exon 0 1 15 15 1989 132.6
scaffold_148 mRNA CDS 0 1 9 9 909 101
scaffold_148 mRNA exon 1 1 14 14 3464 247.429
scaffold_148 mRNA CDS 1 1 10 10 4544 454.4
scaffold_148 mRNA exon 2 1 14 14 3464 247.429
scaffold_148 mRNA CDS 2 1 8 8 1488 186
total chromosome gene 0 1 1 1 5453 5453
total gene mRNA 0 1 1 1 5453 5453
total mRNA exon 0 1 15 15 1989 132.6
total mRNA CDS 0 1 9 9 909 101
total mRNA exon 1 1 14 14 3464 247.429
total mRNA CDS 1 1 10 10 4544 454.4
total mRNA exon 2 1 14 14 3464 247.429
total mRNA CDS 2 1 8 8 1488 186

Hi Markus

thank you very much. Looks quite useful. As I understand the output, the results seem to be OK.

- so first column is mode, next column is # mRNAs, then total count of respective feature, then mean of feature/mRNA, then length summation of feature, then mean length of feature

- in mode 2 the same; only that the space between features is counted as feature (intron)

- only thing I don't get is mode 1. For exon it is the same as mode 2 but not for CDs where it has 10 features (but there are only 9 CDs features). But I don't think I need it.

Another question I have: Does it sum up only for each scaffold or also for the entire assembly? This would also be important for the calculation of "gene size" without UTR (start of first CDs to end of last CDs), because it is not included in the output. The mean can be easily calculated with total CDs length plus total intron length divided by number of mRNAs but only if it is summed up over the entire assembly.

Best
Kai

**mjoppich** · 09-21-2015, 05:02 AM

@jcorn427
Sorry for the late answer, for some reason I have not gotten an email notification ...

Public release is the plan. However I still need to test a bit more, add something here and there.

Also I am quite a fan of providing bioinformatic tools which are easy to use (by anyone). Thus having some kind of (basic) Qt-based GUI is a must before any release.

@balaena
For mode 2 on CDS any non-CDS interval within 181088 and 183484 is taken. That makes 8 gaps, which is also shown below.

For mode 1 on CDS any non-CDS interval within 178766 to 184218 is taken (mRNA as reference). So you have two additional gaps at the beginning ( 178766 to 181088) and end (183484 to 184218). Sums up to 10 then

Output currently is for each scaffold, as well as for all (total).

Since it is not yet published, would you mind contacting me via [email protected] for the software (english/german)?
As this part of the library has been customized for your needs at the moment, you are kind of a beta-tester ^_^

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 39 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 41 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 35 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 55 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

How to get mean intron, mean gene size etc. from gff

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News