SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Ref Genes to Custom Genome puggie Bioinformatics 1 03-11-2012 06:38 AM
question re: finding genomic reference (ref.fa) file brien_riley General 0 03-14-2011 11:19 AM
Need Help On NCBI ref Gene coordinates File dafil Bioinformatics 5 02-10-2011 04:57 AM
1000 genome data and other human ref sequence differ johnadam33 Bioinformatics 4 01-05-2011 03:16 AM
Ref Genome Repeat Masker kwebb Bioinformatics 6 03-29-2010 10:45 PM

Reply
 
Thread Tools
Old 11-06-2012, 09:53 AM   #1
archgen
Junior Member
 
Location: Arizona

Join Date: Aug 2011
Posts: 5
Default The one file to rule them all - ref genome

This might be a simple question. But since I'm a molecular archaeologist, I'm years behind the bioinfo times (or it feels that way), and I'm hoping this forum will be a good place to start.

I just got back a boatload of Illumina PE sequencing reads for a handful of species in one genus. In order to start with any analysis, I need a reference genome, in one, neat little file (ok maybe not 'little').

There are FTPs (specifically, Sanger and NCBI genome) where I can access the sequence data for the three previously completed genomes in my genus of interest. But upon initial examination, each chromosome is represented by the eleven following file extensions: *.asn *.faa *.fnn *.fna *.frn *.gbk *.gff *.ptt *.rnt *.rpt *.val.

I know that faa, fnn, fna, etc are all FASTA file formats with different types of information. Do I just need to cat the fna files for each chromosome?

Simply, how do I build the one file to rule them all? And is this how others have approached creating a reference genome file- to index in BWA, for example?

Any insight is appreciated!
archgen is offline   Reply With Quote
Old 11-06-2012, 10:27 AM   #2
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,080
Default

You can "cat" the fasta formatted nucleotide sequence files to create a common "reference genome" file. This can be used for making the indexes.
GenoMax is offline   Reply With Quote
Old 11-06-2012, 01:53 PM   #3
archgen
Junior Member
 
Location: Arizona

Join Date: Aug 2011
Posts: 5
Default

Thanks for the reply.

Just to be clear on your response, "cat" only the .fna files for each chromosome, not any of the other fasta formatted sequence files, i.e. the .ffn with coding region info?

Again, much appreciated. Relieved that it seems like a simple solution.
archgen is offline   Reply With Quote
Old 11-07-2012, 05:19 AM   #4
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,080
Default

Quote:
Originally Posted by archgen View Post
Thanks for the reply.

Just to be clear on your response, "cat" only the .fna files for each chromosome, not any of the other fasta formatted sequence files, i.e. the .ffn with coding region info?

Again, much appreciated. Relieved that it seems like a simple solution.
A simple "multi-fasta" formatted file that only has the ">fasta header" followed by the sequence starting on the subsequent line for all sequences.
GenoMax is offline   Reply With Quote
Old 11-07-2012, 07:35 AM   #5
westerman
Rick Westerman
 
Location: Purdue University, Indiana, USA

Join Date: Jun 2008
Posts: 1,104
Default

Also it is often the case that the repository has a whole genome file already available thus alleviating the need to cat the individual chromosome files.
westerman is offline   Reply With Quote
Old 11-07-2012, 08:18 AM   #6
TonyBrooks
Senior Member
 
Location: London

Join Date: Jun 2009
Posts: 298
Default

Illumina have helpfully supplied iGenomes archives for some common species.
These contain BWA and Bowtie indices making alignment a walk in the park (even I can do it!) There's no need to deal with FASTA (although that data is also in the archive you download from the Illumina website.
https://my.illumina.com/Message/iGenome/
I think some of these files are also available on the Cufflinks page (http://cufflinks.cbcb.umd.edu/igenomes.html) if you don't have an Illumina login. They also contain RNA-Seq annotation, but you can just ignore that for genome assembly - the references are still there.
TonyBrooks is offline   Reply With Quote
Old 11-07-2012, 03:32 PM   #7
archgen
Junior Member
 
Location: Arizona

Join Date: Aug 2011
Posts: 5
Default

Sadly, I'm not working with any model organisms with well-known reference genomes. But it's good to know those sites exist for future projects.

Thanks again for the feedback.
archgen is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 07:46 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO