Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • The one file to rule them all - ref genome

    This might be a simple question. But since I'm a molecular archaeologist, I'm years behind the bioinfo times (or it feels that way), and I'm hoping this forum will be a good place to start.

    I just got back a boatload of Illumina PE sequencing reads for a handful of species in one genus. In order to start with any analysis, I need a reference genome, in one, neat little file (ok maybe not 'little').

    There are FTPs (specifically, Sanger and NCBI genome) where I can access the sequence data for the three previously completed genomes in my genus of interest. But upon initial examination, each chromosome is represented by the eleven following file extensions: *.asn *.faa *.fnn *.fna *.frn *.gbk *.gff *.ptt *.rnt *.rpt *.val.

    I know that faa, fnn, fna, etc are all FASTA file formats with different types of information. Do I just need to cat the fna files for each chromosome?

    Simply, how do I build the one file to rule them all? And is this how others have approached creating a reference genome file- to index in BWA, for example?

    Any insight is appreciated!

  • #2
    You can "cat" the fasta formatted nucleotide sequence files to create a common "reference genome" file. This can be used for making the indexes.

    Comment


    • #3
      Thanks for the reply.

      Just to be clear on your response, "cat" only the .fna files for each chromosome, not any of the other fasta formatted sequence files, i.e. the .ffn with coding region info?

      Again, much appreciated. Relieved that it seems like a simple solution.

      Comment


      • #4
        Originally posted by archgen View Post
        Thanks for the reply.

        Just to be clear on your response, "cat" only the .fna files for each chromosome, not any of the other fasta formatted sequence files, i.e. the .ffn with coding region info?

        Again, much appreciated. Relieved that it seems like a simple solution.
        A simple "multi-fasta" formatted file that only has the ">fasta header" followed by the sequence starting on the subsequent line for all sequences.

        Comment


        • #5
          Also it is often the case that the repository has a whole genome file already available thus alleviating the need to cat the individual chromosome files.

          Comment


          • #6
            Illumina have helpfully supplied iGenomes archives for some common species.
            These contain BWA and Bowtie indices making alignment a walk in the park (even I can do it!) There's no need to deal with FASTA (although that data is also in the archive you download from the Illumina website.

            I think some of these files are also available on the Cufflinks page (http://cufflinks.cbcb.umd.edu/igenomes.html) if you don't have an Illumina login. They also contain RNA-Seq annotation, but you can just ignore that for genome assembly - the references are still there.

            Comment


            • #7
              Sadly, I'm not working with any model organisms with well-known reference genomes. But it's good to know those sites exist for future projects.

              Thanks again for the feedback.

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Recent Innovations in Spatial Biology
                by seqadmin


                Spatial biology is an exciting field that encompasses a wide range of techniques and technologies aimed at mapping the organization and interactions of various biomolecules in their native environments. As this area of research progresses, new tools and methodologies are being introduced, accompanied by efforts to establish benchmarking standards and drive technological innovation.

                3D Genomics
                While spatial biology often involves studying proteins and RNAs in their...
                Yesterday, 07:30 PM
              • seqadmin
                Advancing Precision Medicine for Rare Diseases in Children
                by seqadmin




                Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
                12-16-2024, 07:57 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 12-30-2024, 01:35 PM
              0 responses
              26 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 12-17-2024, 10:28 AM
              0 responses
              41 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 12-13-2024, 08:24 AM
              0 responses
              55 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 12-12-2024, 07:41 AM
              0 responses
              41 views
              0 likes
              Last Post seqadmin  
              Working...
              X