Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Generating unique sequence from a fasta file

    Hi,

    I am a computer programmer with absolutely negligible biology background working on an application framework for analyzing the human genome. Now I have access to the genome dataset from the NCB's ftp site.

    I have decided to use the GRCh38 encoded sequence files for the purposes of my application. However since there are multiple overlapping sequences in these files pertaining to the individual chromosomes, I would like to extract the entire stretch with non-overlapping/unique sequences only.

    I need some guidance as to how I can proceed with this.

    Based on some preliminary research that I conducted, I found out that I can use the FASTX Toolkit for the tasks that I am looking to accomplish. However I am not able to understand the purpose and function of the different tools like fasta_formatter or fastx_collapser from the available documentation, due to which I am not able to identify if what I am doing is indeed correct.

  • #2
    This is kind of difficult... you could try a program like Minimus2, or Dedupe to get rid of redundancy, but I think it would be best to either use the whole genome, or else just use only the primary chromosomes (1-22, X, Y, M) and throw away all the little addenda and alt contigs.

    Comment


    • #3
      Originally posted by Brian Bushnell View Post
      This is kind of difficult... you could try a program like Minimus2, or Dedupe to get rid of redundancy, but I think it would be best to either use the whole genome, or else just use only the primary chromosomes (1-22, X, Y, M) and throw away all the little addenda and alt contigs.
      Thank you for your reply. I am sorry, how can I consume the entire genome?
      What I have with me right now are the FASTA files that represent the sequences of the individual chromosomes 1-22, X, Y and M. Is there any documentation that you could point me to so that I can understand the jargon associated with such data (alt contigs and the like).

      Thank You

      Comment


      • #4
        If you have 25 files, you're fine. Just use those. The alt contigs are smaller files that are not really necessary in most cases. They represent differences that are present in some people.

        NCBI's FTP site does have some files that describe the contents of each directory, but they are a little hard to understand... I'm not really sure where a good resource is describing the human genome files.

        Suffice to say - a "typical" person should have DNA corresponding to the 25 files 1-22, X, M, and possibly Y, depending on gender. If you have more than 25 files, the remainder are more controversial - maybe only some people have them; or maybe everyone has them but it's not clear where they go in the chromosome. For most analyses it's safe to ignore them. I'd say it's best to ignore them unless you understand exactly what they are and how to use them properly, since using them when mapping, for example, can cause spurious multimapping of reads which gives you inferior results.

        Comment


        • #5
          Originally posted by Brian Bushnell View Post
          If you have 25 files, you're fine. Just use those. The alt contigs are smaller files that are not really necessary in most cases. They represent differences that are present in some people.

          NCBI's FTP site does have some files that describe the contents of each directory, but they are a little hard to understand... I'm not really sure where a good resource is describing the human genome files.

          Suffice to say - a "typical" person should have DNA corresponding to the 25 files 1-22, X, M, and possibly Y, depending on gender. If you have more than 25 files, the remainder are more controversial - maybe only some people have them; or maybe everyone has them but it's not clear where they go in the chromosome. For most analyses it's safe to ignore them. I'd say it's best to ignore them unless you understand exactly what they are and how to use them properly, since using them when mapping, for example, can cause spurious multimapping of reads which gives you inferior results.
          Thank you for your prompt reply. A confirmation from you about what those files represent makes things a lot more clearer for me.

          I just have one doubt though, these files contain multiple sequences and my application logic consumes an entire file for processing as compared to consuming just a sequence from the file. My question is, can these sequences overlap?

          The first line of the file referencing chromosome 1 begins with
          >gi|568815364|ref|NT_077402.3| Homo sapiens chromosome 1 genomic scaffold, GRCh38 Primary Assembly HSCHR1_CTG1
          There are multiple entries like the one above, in the rest of the file. Each record begins with a descriptor like this and is then followed by a huge sequence.

          For the purpose of analysis, is it sound idea to get rid of the descriptor entries and concatenate the sequences together? Now, if there is an overlap of sequence within multiple descriptor records then it will hinder my analysis. However, if there isn't scope for overlapping then it will make my life so much easier with respect to programming.

          Comment


          • #6
            You may have downloaded the wrong files.

            Go here:
            ftp://ftp.ncbi.nlm.nih.gov/genomes/H...romosomes/seq/

            And download these:

            ftp://ftp.ncbi.nlm.nih.gov/genomes/H....p2_chr1.fa.gz

            ...etc. Generally,

            ftp://ftp.ncbi.nlm.nih.gov/genomes/H....p2_chr*.fa.gz

            There are also other things in the directory like "hs_ref_GRCh38.p2_unplaced.fa.gz" and "hs_ref_GRCh38.p2_alts.fa.gz" and "hs_ref_GRCh38.p2_unlocalized.fa.gz". You can get those if you want.

            But you do not want any of the .mfa.gz files, or the ones that look like this:
            "hs_alt_CHM1_1.1_chr1.fa.gz".

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Strategies for Sequencing Challenging Samples
              by seqadmin


              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
              03-22-2024, 06:39 AM
            • seqadmin
              Techniques and Challenges in Conservation Genomics
              by seqadmin



              The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

              Avian Conservation
              Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
              03-08-2024, 10:41 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 03-27-2024, 06:37 PM
            0 responses
            12 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 03-27-2024, 06:07 PM
            0 responses
            11 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 03-22-2024, 10:03 AM
            0 responses
            53 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 03-21-2024, 07:32 AM
            0 responses
            69 views
            0 likes
            Last Post seqadmin  
            Working...
            X