Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Where can I find the complete FASTA format sequence(human and mouse)?

    On the EBI database website(http://www.ebi.ac.uk/astd/download.html), they only provide the FASTA format sequence of all exons or transcripts to download.
    Anybody know where I can find the complete FASTA format sequence(human and mouse) that can match with "Feb 2008 Release 1.1"? I want to use the complete FASTA format sequence as the reference genome to align the RNA-seq data.
    Thanks in advance!

  • #2
    You can get all complete assemblies from Ensembl:

    http://www.ensembl.org/info/data/ftp/index.html

    ..or NCBI

    ftp://ftp.ncbi.nih.gov/genomes/

    You'll need to check your details about the exact assembly to use though. The description you included doesn't obviously match to any human or mouse assembly - maybe you're looking at a description of an annotation set rather than an underlying assembly? Both of those sites will give you the latest assembly for each species by default.

    Comment


    • #3
      Simon Andrews pointed out the right places to look at.

      Three remarks on Ensembl's human FASTA files to save you the time of falling in these traps:

      - Do not use the repeat-mapped sequences ("_rm" in the filenames). Judging which repeats are detrimental is better left to the aligner.

      - It seems convenient to download the file denoted "toplevel", as it contains all the other FASTA sequences in one big file. However, this means that all the MHC variants are included. if you feed this to the aligner, it will not realize that all these MHC sequences are variant of the _same_ region and consider it as repetitive. Better kick out the variant sequences before using the toplevel file, or download all the chromosome files individually and feed them all together to the aligner.

      - If you later use annotation, be sure to use the corresponding data, e.g., the GTF file from Ensembl. If you mix different assemblies, or maybe even NCBI's and Ensembl's representation of the same assembly build, the coordinates might not fit.

      Simon

      Comment


      • #4
        Thanks Simon Andrews and Simon Anders!

        From Ensemble and NCBI ftp server, we can get all complete assemblies. But I think EBI might have their own complete assemblies to download. As Simon Anders said, if I am using the complete assemblies downloaded from Ensemble or NCBI to align the RNA-seq data and using the annotation file (GTF file) from EBI, then the coordinates might not fit.

        Although EBI has provided the FASTA sequence file and annotation file (GTF file) to download, the FASTA format sequence files are based on all exons or transcripts instead of complete sequence file. I think these FASTA sequence file for all exons or transcripts should be extracted from the complete sequence file. Why EBI doesn't provide it to download? Or is EBI also using the same complete assemblies from Ensemble or NCBI?

        Comment


        • #5
          First of all: I got quite confused what you mean by EBI. Note that the European Biooinformatics Institute (EBI, in Hinxton, Cambs., England) hosts a lot of data, among them the whole EnsEMBL project (which they administer jointly with the Sanger Institute, also in Hinxton) and the ASTD project that you mentioned in the first post.

          That confusion aside, two points:

          - How deeply do you want to go into alternative splicing? Note that the GTF file from Ensembl also contains information about all well-documented transcripts, i.e., it is usually all you need. Making use of this information is actually not that easy, but the new 'cufflinks' tool might help a lot.

          - I'd suppose that you have very good chances that the GTF files from the ASTD project are compatible with the coordinates from the Ensembl FASTA files, as both come from Hinxton.

          I just had a look into one of the GTF files from ASTD. The features are annotated with Ensembl Gene IDs ("ENSG000..."), which look promising. You can simply compare the coordinates of a few of the features from the file with the same genes on the Ensembl web site to make sure that the coordinates are consistent.

          However, the file also states:

          # Datasources:
          # ASTD release 1.1(15/02/2008)
          # EnsEMBL homo_sapiens 41_36c

          This might indicate an old data version. The current Ensembl version is 56, using Homo sapiens build GRCh37. Maybe this is for the previous build, NCBI36? Note the small link "View in archive site" at the bottom of the Ensembl home page, which allows you to access old versions of the data.

          Simon

          Comment


          • #6
            Thanks Simon very much!

            I thought Ensemble is also an institute like European Biooinformatics Institute (EBI) and NCBI, actually Ensembl is a joint project between EMBL - EBI and the Wellcome Trust Sanger Institute. That's why I was also confused.

            So, actually the annotation file and FASTA formate sequence file provided by EBI webiste(http://www.ebi.ac.uk/astd/download.html) are also same with those releases on the Ensembl web site(http://uswest.ensembl.org/info/data/ftp/index.html).
            The only difference is that the current release on EBI website is the old data version (41_36c) from Ensemble instead of the latest version(56).

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Advancing Precision Medicine for Rare Diseases in Children
              by seqadmin




              Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
              12-16-2024, 07:57 AM
            • seqadmin
              Recent Advances in Sequencing Technologies
              by seqadmin



              Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

              Long-Read Sequencing
              Long-read sequencing has seen remarkable advancements,...
              12-02-2024, 01:49 PM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 12-17-2024, 10:28 AM
            0 responses
            39 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 12-13-2024, 08:24 AM
            0 responses
            52 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 12-12-2024, 07:41 AM
            0 responses
            38 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 12-11-2024, 07:45 AM
            0 responses
            46 views
            0 likes
            Last Post seqadmin  
            Working...
            X