Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Where can I find the complete FASTA format sequence(human and mouse)?

    On the EBI database website(http://www.ebi.ac.uk/astd/download.html), they only provide the FASTA format sequence of all exons or transcripts to download.
    Anybody know where I can find the complete FASTA format sequence(human and mouse) that can match with "Feb 2008 Release 1.1"? I want to use the complete FASTA format sequence as the reference genome to align the RNA-seq data.
    Thanks in advance!

  • #2
    You can get all complete assemblies from Ensembl:

    http://www.ensembl.org/info/data/ftp/index.html

    ..or NCBI

    ftp://ftp.ncbi.nih.gov/genomes/

    You'll need to check your details about the exact assembly to use though. The description you included doesn't obviously match to any human or mouse assembly - maybe you're looking at a description of an annotation set rather than an underlying assembly? Both of those sites will give you the latest assembly for each species by default.

    Comment


    • #3
      Simon Andrews pointed out the right places to look at.

      Three remarks on Ensembl's human FASTA files to save you the time of falling in these traps:

      - Do not use the repeat-mapped sequences ("_rm" in the filenames). Judging which repeats are detrimental is better left to the aligner.

      - It seems convenient to download the file denoted "toplevel", as it contains all the other FASTA sequences in one big file. However, this means that all the MHC variants are included. if you feed this to the aligner, it will not realize that all these MHC sequences are variant of the _same_ region and consider it as repetitive. Better kick out the variant sequences before using the toplevel file, or download all the chromosome files individually and feed them all together to the aligner.

      - If you later use annotation, be sure to use the corresponding data, e.g., the GTF file from Ensembl. If you mix different assemblies, or maybe even NCBI's and Ensembl's representation of the same assembly build, the coordinates might not fit.

      Simon

      Comment


      • #4
        Thanks Simon Andrews and Simon Anders!

        From Ensemble and NCBI ftp server, we can get all complete assemblies. But I think EBI might have their own complete assemblies to download. As Simon Anders said, if I am using the complete assemblies downloaded from Ensemble or NCBI to align the RNA-seq data and using the annotation file (GTF file) from EBI, then the coordinates might not fit.

        Although EBI has provided the FASTA sequence file and annotation file (GTF file) to download, the FASTA format sequence files are based on all exons or transcripts instead of complete sequence file. I think these FASTA sequence file for all exons or transcripts should be extracted from the complete sequence file. Why EBI doesn't provide it to download? Or is EBI also using the same complete assemblies from Ensemble or NCBI?

        Comment


        • #5
          First of all: I got quite confused what you mean by EBI. Note that the European Biooinformatics Institute (EBI, in Hinxton, Cambs., England) hosts a lot of data, among them the whole EnsEMBL project (which they administer jointly with the Sanger Institute, also in Hinxton) and the ASTD project that you mentioned in the first post.

          That confusion aside, two points:

          - How deeply do you want to go into alternative splicing? Note that the GTF file from Ensembl also contains information about all well-documented transcripts, i.e., it is usually all you need. Making use of this information is actually not that easy, but the new 'cufflinks' tool might help a lot.

          - I'd suppose that you have very good chances that the GTF files from the ASTD project are compatible with the coordinates from the Ensembl FASTA files, as both come from Hinxton.

          I just had a look into one of the GTF files from ASTD. The features are annotated with Ensembl Gene IDs ("ENSG000..."), which look promising. You can simply compare the coordinates of a few of the features from the file with the same genes on the Ensembl web site to make sure that the coordinates are consistent.

          However, the file also states:

          # Datasources:
          # ASTD release 1.1(15/02/2008)
          # EnsEMBL homo_sapiens 41_36c

          This might indicate an old data version. The current Ensembl version is 56, using Homo sapiens build GRCh37. Maybe this is for the previous build, NCBI36? Note the small link "View in archive site" at the bottom of the Ensembl home page, which allows you to access old versions of the data.

          Simon

          Comment


          • #6
            Thanks Simon very much!

            I thought Ensemble is also an institute like European Biooinformatics Institute (EBI) and NCBI, actually Ensembl is a joint project between EMBL - EBI and the Wellcome Trust Sanger Institute. That's why I was also confused.

            So, actually the annotation file and FASTA formate sequence file provided by EBI webiste(http://www.ebi.ac.uk/astd/download.html) are also same with those releases on the Ensembl web site(http://uswest.ensembl.org/info/data/ftp/index.html).
            The only difference is that the current release on EBI website is the old data version (41_36c) from Ensemble instead of the latest version(56).

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM
            • seqadmin
              Strategies for Sequencing Challenging Samples
              by seqadmin


              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
              03-22-2024, 06:39 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 04-11-2024, 12:08 PM
            0 responses
            18 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 10:19 PM
            0 responses
            22 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 09:21 AM
            0 responses
            17 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-04-2024, 09:00 AM
            0 responses
            49 views
            0 likes
            Last Post seqadmin  
            Working...
            X