SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
New Ribo-Zero Gold Kit (Human/Mouse/Rat) epibio Vendor Forum 7 08-01-2014 12:35 PM
Ribo-Zero (Human/Mouse/Rat) kit now in magnetic format epibio Vendor Forum 1 02-06-2012 06:48 AM
Find all occurrences of a sequence in a fasta file dphansti Bioinformatics 3 12-06-2011 06:11 AM
Comparing mouse and human differentially expressed genes stephenhart General 3 11-16-2011 01:14 AM
Complete Genomics Releasing 60 Human Genomes krobison Genomic Resequencing 13 02-10-2011 06:34 AM

Reply
 
Thread Tools
Old 02-23-2010, 02:22 PM   #1
iloveneworleans
Member
 
Location: new orleans

Join Date: Jun 2009
Posts: 12
Default Where can I find the complete FASTA format sequence(human and mouse)?

On the EBI database website(http://www.ebi.ac.uk/astd/download.html), they only provide the FASTA format sequence of all exons or transcripts to download.
Anybody know where I can find the complete FASTA format sequence(human and mouse) that can match with "Feb 2008 Release 1.1"? I want to use the complete FASTA format sequence as the reference genome to align the RNA-seq data.
Thanks in advance!
iloveneworleans is offline   Reply With Quote
Old 02-23-2010, 11:42 PM   #2
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 869
Default

You can get all complete assemblies from Ensembl:

http://www.ensembl.org/info/data/ftp/index.html

..or NCBI

ftp://ftp.ncbi.nih.gov/genomes/

You'll need to check your details about the exact assembly to use though. The description you included doesn't obviously match to any human or mouse assembly - maybe you're looking at a description of an annotation set rather than an underlying assembly? Both of those sites will give you the latest assembly for each species by default.
simonandrews is offline   Reply With Quote
Old 02-24-2010, 05:46 AM   #3
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 991
Default

Simon Andrews pointed out the right places to look at.

Three remarks on Ensembl's human FASTA files to save you the time of falling in these traps:

- Do not use the repeat-mapped sequences ("_rm" in the filenames). Judging which repeats are detrimental is better left to the aligner.

- It seems convenient to download the file denoted "toplevel", as it contains all the other FASTA sequences in one big file. However, this means that all the MHC variants are included. if you feed this to the aligner, it will not realize that all these MHC sequences are variant of the _same_ region and consider it as repetitive. Better kick out the variant sequences before using the toplevel file, or download all the chromosome files individually and feed them all together to the aligner.

- If you later use annotation, be sure to use the corresponding data, e.g., the GTF file from Ensembl. If you mix different assemblies, or maybe even NCBI's and Ensembl's representation of the same assembly build, the coordinates might not fit.

Simon
Simon Anders is offline   Reply With Quote
Old 02-24-2010, 10:05 AM   #4
iloveneworleans
Member
 
Location: new orleans

Join Date: Jun 2009
Posts: 12
Default

Thanks Simon Andrews and Simon Anders!

From Ensemble and NCBI ftp server, we can get all complete assemblies. But I think EBI might have their own complete assemblies to download. As Simon Anders said, if I am using the complete assemblies downloaded from Ensemble or NCBI to align the RNA-seq data and using the annotation file (GTF file) from EBI, then the coordinates might not fit.

Although EBI has provided the FASTA sequence file and annotation file (GTF file) to download, the FASTA format sequence files are based on all exons or transcripts instead of complete sequence file. I think these FASTA sequence file for all exons or transcripts should be extracted from the complete sequence file. Why EBI doesn't provide it to download? Or is EBI also using the same complete assemblies from Ensemble or NCBI?
iloveneworleans is offline   Reply With Quote
Old 02-24-2010, 10:44 AM   #5
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 991
Default

First of all: I got quite confused what you mean by EBI. Note that the European Biooinformatics Institute (EBI, in Hinxton, Cambs., England) hosts a lot of data, among them the whole EnsEMBL project (which they administer jointly with the Sanger Institute, also in Hinxton) and the ASTD project that you mentioned in the first post.

That confusion aside, two points:

- How deeply do you want to go into alternative splicing? Note that the GTF file from Ensembl also contains information about all well-documented transcripts, i.e., it is usually all you need. Making use of this information is actually not that easy, but the new 'cufflinks' tool might help a lot.

- I'd suppose that you have very good chances that the GTF files from the ASTD project are compatible with the coordinates from the Ensembl FASTA files, as both come from Hinxton.

I just had a look into one of the GTF files from ASTD. The features are annotated with Ensembl Gene IDs ("ENSG000..."), which look promising. You can simply compare the coordinates of a few of the features from the file with the same genes on the Ensembl web site to make sure that the coordinates are consistent.

However, the file also states:

# Datasources:
# ASTD release 1.1(15/02/2008)
# EnsEMBL homo_sapiens 41_36c

This might indicate an old data version. The current Ensembl version is 56, using Homo sapiens build GRCh37. Maybe this is for the previous build, NCBI36? Note the small link "View in archive site" at the bottom of the Ensembl home page, which allows you to access old versions of the data.

Simon
Simon Anders is offline   Reply With Quote
Old 02-24-2010, 04:00 PM   #6
iloveneworleans
Member
 
Location: new orleans

Join Date: Jun 2009
Posts: 12
Default

Thanks Simon very much!

I thought Ensemble is also an institute like European Biooinformatics Institute (EBI) and NCBI, actually Ensembl is a joint project between EMBL - EBI and the Wellcome Trust Sanger Institute. That's why I was also confused.

So, actually the annotation file and FASTA formate sequence file provided by EBI webiste(http://www.ebi.ac.uk/astd/download.html) are also same with those releases on the Ensembl web site(http://uswest.ensembl.org/info/data/ftp/index.html).
The only difference is that the current release on EBI website is the old data version (41_36c) from Ensemble instead of the latest version(56).
iloveneworleans is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 04:01 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO