[GlimmerHMM] Is my understanding right?

syintel87

Member

Join Date: Dec 2012

Posts: 81
- Share
- Tweet
#1

[GlimmerHMM] Is my understanding right?

02-05-2014, 05:34 AM

Hello, I have a question on GlimmerHMM.
Would you have a look at the description below to see if my understanding is right?
My organism is M. Chitwoodi.
Thank you in advance.

1. What I have as input data:
- contig.fasta(output of ABySS): 185,458 contigs
- cDNA.fasta(cDNA cluster from nematode.net): 5,880 genes

contig.fasta (output of ABySS, de novo assembler)

> contig1
...
> contig2
...
> contig185458
...

cDNA.fasta (M. Chitwoodi cDNA cluster from nematode.net)

> MC1
...
> MC2
...
> MC5880
...

2. There are two options to run GlimmerHMM:

2-1. glimmerhmm

Input: only one longest contig which is extracted from contig.fasta
If I run just "glimmerhmm", I do not need to use whole contig.fasta file.
Input sequence could be only the longest contig.
I can use built-in training directory (Celegans) to predict genes on the longest contig.
(+): easy, fast
(-): Result could be biased. Gene prediction can be done on only one contig.

2-2. trainGlimmerHMM

Input: whole contig.fasta file, exon file
If I train whole contig file, this contig fasta file is used itself to be trained.
However, I need to create exon file.

Through alignment of one contig of contig.fasta and whole set of cDNA.fasta, find start and end site of exons.
Alignment can be done by blast or SIM4.
Repeat this 185,458 times.
Merge 185,458 exon files into one. (first column: contig ID, second column: start site, third column: end site)
Train contig.fasta file along with the exon file.
(+): reliable result, gene prediction on every contig
(-): too much time and computation when doing blast and creating exon file

Last edited by syintel87; 02-05-2014, 05:36 AM.
Tags: None
syintel87

Member

Join Date: Dec 2012

Posts: 81
- Share
- Tweet
#2

02-05-2014, 11:10 AM

I want to clarify my question above.
If I choose to run trainGlimmerHMM, would you see if my understanding is right?
Would you see the description below again to see if my understanding is right?

***
"contig.fasta" consists of 185,457 contigs that were de novo assembled through abyss. Different contig is separated by ">" like:
> contig1
...
> contig 185457

***
If I want to run trainGlimmerHMM with "contig.fasta", I have to provide exon file which contains exon information of every contig like:
contig1 exon1startSite endSite
contig1 exon2startSite endSite
...
contig185457 exon1startSite endSite
contig185457 exon2startSite endSite

***
However, it seems that sim4 or blast takes only one contig as query. This means I need to separate each contig to blast with cDNA cluster like:
blast contig1.fasta cDNA.fasta
blast contig2.fasta cDNA.fasta
...
blast contig185457.fasta cDNA.fasta

***
Then, I would be able to get exon site for each contig, so that I can provide trainGlimmerHMM for exon file which contains all exon site of all contigs of contig.fasta.

***
In summary, I need to do individually blast all the contigs 185457 to figure out exon site of all of them. Each blast will take different contig fasta file but the same cDNA fasta file as inputs.
Comment

Previous template Next

Advancing Precision Medicine for Rare Diseases in Children

by seqadmin

Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
- Channel: Articles
12-16-2024, 07:57 AM
Recent Advances in Sequencing Technologies

by seqadmin

Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

Long-Read Sequencing
Long-read sequencing has seen remarkable advancements,...
- Channel: Articles
12-02-2024, 01:49 PM

Topics	Statistics	Last Post
Evaluating Genome Sequencing for ECMO Patients in the NICU by seqadmin Started by seqadmin, 12-17-2024, 10:28 AM	0 responses 26 views 0 likes	Last Post by seqadmin 12-17-2024, 10:28 AM
New Genetic Toolkit Refines Studies on Gene Function and Disease by seqadmin Started by seqadmin, 12-13-2024, 08:24 AM	0 responses 42 views 0 likes	Last Post by seqadmin 12-13-2024, 08:24 AM
Study Links Brain Mechanism to Emotional Responses in Animals and Humans by seqadmin Started by seqadmin, 12-12-2024, 07:41 AM	0 responses 28 views 0 likes	Last Post by seqadmin 12-12-2024, 07:41 AM
Study Identifies Ribosomal RNA Fingerprints as Early Cancer Biomarkers by seqadmin Started by seqadmin, 12-11-2024, 07:45 AM	0 responses 42 views 0 likes	Last Post by seqadmin 12-11-2024, 07:45 AM

Seqanswers Leaderboard Ad

Announcement

[GlimmerHMM] Is my understanding right?

Comment

Latest Articles

ad_right_rmr

News