Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • The one file to rule them all - ref genome

    This might be a simple question. But since I'm a molecular archaeologist, I'm years behind the bioinfo times (or it feels that way), and I'm hoping this forum will be a good place to start.

    I just got back a boatload of Illumina PE sequencing reads for a handful of species in one genus. In order to start with any analysis, I need a reference genome, in one, neat little file (ok maybe not 'little').

    There are FTPs (specifically, Sanger and NCBI genome) where I can access the sequence data for the three previously completed genomes in my genus of interest. But upon initial examination, each chromosome is represented by the eleven following file extensions: *.asn *.faa *.fnn *.fna *.frn *.gbk *.gff *.ptt *.rnt *.rpt *.val.

    I know that faa, fnn, fna, etc are all FASTA file formats with different types of information. Do I just need to cat the fna files for each chromosome?

    Simply, how do I build the one file to rule them all? And is this how others have approached creating a reference genome file- to index in BWA, for example?

    Any insight is appreciated!

  • #2
    You can "cat" the fasta formatted nucleotide sequence files to create a common "reference genome" file. This can be used for making the indexes.

    Comment


    • #3
      Thanks for the reply.

      Just to be clear on your response, "cat" only the .fna files for each chromosome, not any of the other fasta formatted sequence files, i.e. the .ffn with coding region info?

      Again, much appreciated. Relieved that it seems like a simple solution.

      Comment


      • #4
        Originally posted by archgen View Post
        Thanks for the reply.

        Just to be clear on your response, "cat" only the .fna files for each chromosome, not any of the other fasta formatted sequence files, i.e. the .ffn with coding region info?

        Again, much appreciated. Relieved that it seems like a simple solution.
        A simple "multi-fasta" formatted file that only has the ">fasta header" followed by the sequence starting on the subsequent line for all sequences.

        Comment


        • #5
          Also it is often the case that the repository has a whole genome file already available thus alleviating the need to cat the individual chromosome files.

          Comment


          • #6
            Illumina have helpfully supplied iGenomes archives for some common species.
            These contain BWA and Bowtie indices making alignment a walk in the park (even I can do it!) There's no need to deal with FASTA (although that data is also in the archive you download from the Illumina website.

            I think some of these files are also available on the Cufflinks page (http://cufflinks.cbcb.umd.edu/igenomes.html) if you don't have an Illumina login. They also contain RNA-Seq annotation, but you can just ignore that for genome assembly - the references are still there.

            Comment


            • #7
              Sadly, I'm not working with any model organisms with well-known reference genomes. But it's good to know those sites exist for future projects.

              Thanks again for the feedback.

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Investigating the Gut Microbiome Through Diet and Spatial Biology
                by seqadmin




                The human gut contains trillions of microorganisms that impact digestion, immune functions, and overall health1. Despite major breakthroughs, we’re only beginning to understand the full extent of the microbiome’s influence on health and disease. Advances in next-generation sequencing and spatial biology have opened new windows into this complex environment, yet many questions remain. This article highlights two recent studies exploring how diet influences microbial...
                02-24-2025, 06:31 AM
              • seqadmin
                Quality Control Essentials for Next-Generation Sequencing Workflows
                by seqadmin




                Like all molecular biology applications, next-generation sequencing (NGS) workflows require diligent quality control (QC) measures to ensure accurate and reproducible results. Proper QC begins at nucleic acid extraction and continues all the way through to data analysis. This article outlines the key QC steps in an NGS workflow, along with the commonly used tools and techniques.

                Nucleic Acid Quality Control
                Preparing for NGS starts with isolating the...
                02-10-2025, 01:58 PM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 03-03-2025, 01:15 PM
              0 responses
              151 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 02-28-2025, 12:58 PM
              0 responses
              229 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 02-24-2025, 02:48 PM
              0 responses
              599 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 02-21-2025, 02:46 PM
              0 responses
              262 views
              0 likes
              Last Post seqadmin  
              Working...
              X