SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics

Similar Threads
Thread Thread Starter Forum Replies Last Post
introns as a subset of mRNAs rathankar General 5 07-27-2012 07:01 AM
TopHat2 with genes lacking introns valei Bioinformatics 1 06-11-2012 06:00 PM
"coverage" of introns, intergenic regions for RNASEQ PFS Bioinformatics 2 09-07-2011 02:51 PM
Length exon, intron, promoter and intergenic regions in human genome khb Bioinformatics 1 01-17-2011 02:03 PM
UCSC introns doublehelix82 General 1 01-04-2011 10:02 AM

Reply
 
Thread Tools
Old 10-09-2012, 08:17 AM   #1
gaby
Junior Member
 
Location: Spain

Join Date: Sep 2010
Posts: 5
Default GFF to fasta (intergenic regions and introns)

Hello everyone!
I wonder if anyone knows any way of, given a gff file and the corresponding genomic segment in fasta format, extract introns and intergenic sequences in fasta format. I'm just starting to write a script for do the job, but if anyone knows an existing solution would greatly appreciate it.
Cheers,
Gabriel
gaby is offline   Reply With Quote
Old 10-09-2012, 08:21 AM   #2
aggp11
Member
 
Location: Wisconsin

Join Date: Jun 2011
Posts: 81
Default

Hi,

So does the GFF file have the intergenic and intronic coordinates in it? If that's the case, you could try Bedtools's "fastaFromBed" module that needs an input Fasta and a Bed/GFF/VCF file and outputs the target Fasta file.

Praful
aggp11 is offline   Reply With Quote
Old 10-09-2012, 08:39 AM   #3
gaby
Junior Member
 
Location: Spain

Join Date: Sep 2010
Posts: 5
Default

Hi Praful,
Thank you very much for your reply. Nop, my gff file doesn't have the intergenic and intronic coordinates. I just have the "exon", "gene", "CDS" and "mRNA". So I was thing to use the exon coordinate to extract the introns and the gene coordinate to extract the intergenic regions...
Gabriel
gaby is offline   Reply With Quote
Old 10-09-2012, 08:43 AM   #4
aggp11
Member
 
Location: Wisconsin

Join Date: Jun 2011
Posts: 81
Default

Gabriel,

Bedtools also has a "maskFastaFromBed" module. May be you could use this to first mask out the Exons in your Fasta file (using the GFF file) and then just write a short script to pull out the unmasked sequences. I don't know if this would work, but this is something you could try as a quick check.

Praful
aggp11 is offline   Reply With Quote
Old 10-09-2012, 10:13 AM   #5
gaby
Junior Member
 
Location: Spain

Join Date: Sep 2010
Posts: 5
Default

Thanks Praful,
Is a good idea to start!
gaby is offline   Reply With Quote
Old 10-16-2012, 07:06 PM   #6
achal13r
Junior Member
 
Location: Paris, France

Join Date: Oct 2012
Posts: 5
Default GFF-Ex: A genome feature extraction package

Visit GFF-Ex: http://bioinfo.icgeb.res.in/gff/

GFF-Ex, a Genome Feature extraction package extracts Gene, Exon, Intron, Upstream Region of Gene (Promoters), Intergenic and CDS/cDNA sequences by just tweeting in the Genome Feature File (gff) along with the corresponding genome/chromosome sequence. GFF-Ex. is a fusion of shell and Perl, developed for platforms supporting UNIX based file system.

Installation is easy and very user friendly tool. Works well with files in GFF-2 format and will be upgrading to GFF-3 format as well.
achal13r is offline   Reply With Quote
Old 02-05-2013, 10:10 AM   #7
cascoamarillo
Senior Member
 
Location: MA

Join Date: Oct 2010
Posts: 101
Default

Hi,
I'm in the same situation; I want to pull out the introme dataset (eg. fasta file) where my annotation (gff3) file doesn't have Introns as features. So I start usingGFF-Ex. In principle, it seems to work fine extracting all these features into fasta files; but when you take a closer look at the sequences, something goes wrong: the number and intron size for each gene ID looks correct (in you compare with the gap between exons in your reference). But the sequence it self (after blasting in the GBrowser) doesn't match at all with that region, instead it match with some other region of the genome (not related at all). I've checked this thing with several genes in Neurospora crassa. Has anyone experience this? Is there another way to pull out the introns? Thanks.
cascoamarillo is offline   Reply With Quote
Old 02-06-2013, 04:10 AM   #8
gaby
Junior Member
 
Location: Spain

Join Date: Sep 2010
Posts: 5
Default

Quote:
Originally Posted by cascoamarillo View Post
Hi,
I'm in the same situation; I want to pull out the introme dataset (eg. fasta file) where my annotation (gff3) file doesn't have Introns as features. So I start usingGFF-Ex. In principle, it seems to work fine extracting all these features into fasta files; but when you take a closer look at the sequences, something goes wrong: the number and intron size for each gene ID looks correct (in you compare with the gap between exons in your reference). But the sequence it self (after blasting in the GBrowser) doesn't match at all with that region, instead it match with some other region of the genome (not related at all). I've checked this thing with several genes in Neurospora crassa. Has anyone experience this? Is there another way to pull out the introns? Thanks.
I had the same problems with GFF-Ex. Not sure what is going on with this software. Finally I wrote my own script to do it. It just need some changes to be useful to any case. I can try to do it soon to make it available.
gaby is offline   Reply With Quote
Old 02-07-2013, 03:38 PM   #9
cascoamarillo
Senior Member
 
Location: MA

Join Date: Oct 2010
Posts: 101
Default

Quote:
Originally Posted by gaby View Post
I had the same problems with GFF-Ex. Not sure what is going on with this software. Finally I wrote my own script to do it. It just need some changes to be useful to any case. I can try to do it soon to make it available.
All right, good to know that it doesn't happen only to me. I've also tried with a different referent annotated genome and the same. It would to be good if you find an approach to do this, cos I'm running out of options to get the intron sequences. Let me know if I can help in any aspect. Thanks.
cascoamarillo is offline   Reply With Quote
Old 02-09-2013, 08:55 AM   #10
Laval
Junior Member
 
Location: Levensque West

Join Date: Feb 2013
Posts: 4
Default

I also tried GFF-Ex but it is not working well with large data like Soybean Genome. I am trying to get intron sequences using gff3 (from Phytozome) which have included coordinates for mRNA, UTRs, exons and CDS. Please suggest me any tool or script to get intron sequences from GFF3
Laval is offline   Reply With Quote
Old 02-09-2013, 12:41 PM   #11
AlexReynolds
Member
 
Location: Seattle, WA

Join Date: Feb 2013
Posts: 38
Default

Convert GFF3 to BED with gff2bed.

This BED data will contain the <feature> type (intron, CDS, etc.), so you could grep on that feature type to filter the BED data down to classes or categories of data, e.g.:

HTML Code:
$ gff2bed < foo.gff | sort-bed - | grep intron - > introns.foo.bed
Then convert from BED to FASTA with bed2fasta or similar scripts.
AlexReynolds is offline   Reply With Quote
Old 02-10-2013, 06:54 PM   #12
Laval
Junior Member
 
Location: Levensque West

Join Date: Feb 2013
Posts: 4
Default

Quote:
Originally Posted by AlexReynolds View Post
Convert GFF3 to BED with gff2bed.

This BED data will contain the <feature> type (intron, CDS, etc.), so you could grep on that feature type to filter the BED data down to classes or categories of data, e.g.:

HTML Code:
$ gff2bed < foo.gff | sort-bed - | grep intron - > introns.foo.bed
Then convert from BED to FASTA with bed2fasta or similar scripts.

I have tried but geting error like - sort-bed: command not found
Traceback (most recent call last):
File "./gff2bed.py", line 107, in <module>
sys.exit(main(*sys.argv))
File "./gff2bed.py", line 94, in main
cols['attributes']])
IOError: [Errno 32] Broken pipe
Laval is offline   Reply With Quote
Old 02-10-2013, 07:53 PM   #13
AlexReynolds
Member
 
Location: Seattle, WA

Join Date: Feb 2013
Posts: 38
Default

Install the BEDOPS tools that come packaged with the gff2bed script (follow the gff2bed link for more info). The suite comes with sort-bed, as well as gff2bed.
AlexReynolds is offline   Reply With Quote
Old 03-22-2013, 05:59 AM   #14
achal13r
Junior Member
 
Location: Paris, France

Join Date: Oct 2012
Posts: 5
Default GFF-Ex: @Laval, @cascoamarillo, @gaby

I think you people are mistaken. Though the accuracy of the tool has been tested earlier, still because of the query, I re-verified the results of GFF-Ex. The results are satisfactory.
I ran the GFF-Ex with the example files given in the installation directory. After getting the sequence information of the introns, I took a random sequence from the output intron file and aligned it to the corresponding genome file, using BLAST. The results were same as it is specified in the used gff file. What I can visualize or infer from here is, you people might be aligning the different genome reference against the intron sequences, fetched from annotation file (gff) of other version or source.
Anyway, there are few things which have to be taken care of when running GFF-Ex.
1. The gff file should be in gff2 format. (GFF-Ex version for gff3 format file is in progress, which is to be released soon, Keep visiting GFF-Ex)
2. The genome file should be in fasta format.
3. Both the input files (gff & genome fasta files) should be of same version and from same source. You cannot use the gff and genome information from either different version or source.
GFF-Ex is a user-friendly tool that comes with a jargon of accuracy, speed and sensitivity. GFF-Ex is suitable for extracting sequence information of multiple features either specified (gene, exon, CDS) or un-specified (introns, intergenic and region upstream to genes) within gff file. I would like you all to explore it more and take care of the input files.
Personal annotation queries, related to GFF-Ex can also be posted at [email protected]
achal13r is offline   Reply With Quote
Old 08-27-2013, 10:37 AM   #15
panos_ed
Junior Member
 
Location: Geneva, Switzerland

Join Date: May 2010
Posts: 8
Default

Quote:
Originally Posted by AlexReynolds View Post
Install the BEDOPS tools that come packaged with the gff2bed script (follow the gff2bed link for more info). The suite comes with sort-bed, as well as gff2bed.
Hello Alex!

I've downloaded the precompiled binaries for Linux (64bit) and I still get this "index out range" error. In my case, however, the problem appears to be in the "source" field (the second one), not in the "attributes"...

Code:
Traceback (most recent call last):
  File "/home/panos/Programs/temp/bedops-read-only/bin/gff2bed", line 212, in <module>
    sys.exit(main(*sys.argv))
  File "/home/panos/Programs/temp/bedops-read-only/bin/gff2bed", line 162, in main
    cols['source'] = elems[1]
IndexError: list index out of range
You can get one of the gff files I'm using from here.

Any ideas?
panos_ed is offline   Reply With Quote
Old 08-27-2013, 10:53 AM   #16
AlexReynolds
Member
 
Location: Seattle, WA

Join Date: Feb 2013
Posts: 38
Default

That appears to be a different error from what is described in a previous comment. Thanks for the error report and sample GFF file. I'll take a look when I can.
AlexReynolds is offline   Reply With Quote
Old 08-28-2013, 09:57 AM   #17
niti217
Member
 
Location: USA

Join Date: Dec 2011
Posts: 10
Default

Quote:
Originally Posted by achal13r View Post
Visit GFF-Ex: http://bioinfo.icgeb.res.in/gff/

GFF-Ex, a Genome Feature extraction package extracts Gene, Exon, Intron, Upstream Region of Gene (Promoters), Intergenic and CDS/cDNA sequences by just tweeting in the Genome Feature File (gff) along with the corresponding genome/chromosome sequence. GFF-Ex. is a fusion of shell and Perl, developed for platforms supporting UNIX based file system.

Installation is easy and very user friendly tool. Works well with files in GFF-2 format and will be upgrading to GFF-3 format as well.

I was wondering if I could use GFF-Ex to seek information on the reverse strand - especially the non coding regions on the opposite strand. Please let me know. Thanks
niti217 is offline   Reply With Quote
Old 09-24-2013, 02:41 AM   #18
albireo
Member
 
Location: Europe

Join Date: Sep 2012
Posts: 39
Default

Hi, I'm getting the same error. Has anybody found out the reason of this exception? Thanks!


Quote:
Originally Posted by panos_ed View Post
Hello Alex!

I've downloaded the precompiled binaries for Linux (64bit) and I still get this "index out range" error. In my case, however, the problem appears to be in the "source" field (the second one), not in the "attributes"...

Code:
Traceback (most recent call last):
  File "/home/panos/Programs/temp/bedops-read-only/bin/gff2bed", line 212, in <module>
    sys.exit(main(*sys.argv))
  File "/home/panos/Programs/temp/bedops-read-only/bin/gff2bed", line 162, in main
    cols['source'] = elems[1]
IndexError: list index out of range
You can get one of the gff files I'm using from here.

Any ideas?
albireo is offline   Reply With Quote
Old 09-24-2013, 03:47 AM   #19
AlexReynolds
Member
 
Location: Seattle, WA

Join Date: Feb 2013
Posts: 38
Default

There has been a change to the GTF standard, such that gtf2bed fails to parse newer GTF files (such as the example input that you link to, thanks).

This conversion script will be fixed in BEDOPS 2.3, which will be released later this week.
AlexReynolds is offline   Reply With Quote
Old 09-24-2013, 04:02 AM   #20
albireo
Member
 
Location: Europe

Join Date: Sep 2012
Posts: 39
Default

Thanks a lot Alex
albireo is offline   Reply With Quote
Reply

Tags
extraction, fasta, gff3, intergene region, intron

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:49 AM.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.