Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • The one file to rule them all - ref genome

    This might be a simple question. But since I'm a molecular archaeologist, I'm years behind the bioinfo times (or it feels that way), and I'm hoping this forum will be a good place to start.

    I just got back a boatload of Illumina PE sequencing reads for a handful of species in one genus. In order to start with any analysis, I need a reference genome, in one, neat little file (ok maybe not 'little').

    There are FTPs (specifically, Sanger and NCBI genome) where I can access the sequence data for the three previously completed genomes in my genus of interest. But upon initial examination, each chromosome is represented by the eleven following file extensions: *.asn *.faa *.fnn *.fna *.frn *.gbk *.gff *.ptt *.rnt *.rpt *.val.

    I know that faa, fnn, fna, etc are all FASTA file formats with different types of information. Do I just need to cat the fna files for each chromosome?

    Simply, how do I build the one file to rule them all? And is this how others have approached creating a reference genome file- to index in BWA, for example?

    Any insight is appreciated!

  • #2
    You can "cat" the fasta formatted nucleotide sequence files to create a common "reference genome" file. This can be used for making the indexes.

    Comment


    • #3
      Thanks for the reply.

      Just to be clear on your response, "cat" only the .fna files for each chromosome, not any of the other fasta formatted sequence files, i.e. the .ffn with coding region info?

      Again, much appreciated. Relieved that it seems like a simple solution.

      Comment


      • #4
        Originally posted by archgen View Post
        Thanks for the reply.

        Just to be clear on your response, "cat" only the .fna files for each chromosome, not any of the other fasta formatted sequence files, i.e. the .ffn with coding region info?

        Again, much appreciated. Relieved that it seems like a simple solution.
        A simple "multi-fasta" formatted file that only has the ">fasta header" followed by the sequence starting on the subsequent line for all sequences.

        Comment


        • #5
          Also it is often the case that the repository has a whole genome file already available thus alleviating the need to cat the individual chromosome files.

          Comment


          • #6
            Illumina have helpfully supplied iGenomes archives for some common species.
            These contain BWA and Bowtie indices making alignment a walk in the park (even I can do it!) There's no need to deal with FASTA (although that data is also in the archive you download from the Illumina website.

            I think some of these files are also available on the Cufflinks page (http://cufflinks.cbcb.umd.edu/igenomes.html) if you don't have an Illumina login. They also contain RNA-Seq annotation, but you can just ignore that for genome assembly - the references are still there.

            Comment


            • #7
              Sadly, I'm not working with any model organisms with well-known reference genomes. But it's good to know those sites exist for future projects.

              Thanks again for the feedback.

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Current Approaches to Protein Sequencing
                by seqadmin


                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                04-04-2024, 04:25 PM
              • seqadmin
                Strategies for Sequencing Challenging Samples
                by seqadmin


                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                03-22-2024, 06:39 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 04-11-2024, 12:08 PM
              0 responses
              25 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 10:19 PM
              0 responses
              29 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 09:21 AM
              0 responses
              25 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-04-2024, 09:00 AM
              0 responses
              52 views
              0 likes
              Last Post seqadmin  
              Working...
              X