Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Generating unique sequence from a fasta file

    Hi,

    I am a computer programmer with absolutely negligible biology background working on an application framework for analyzing the human genome. Now I have access to the genome dataset from the NCB's ftp site.

    I have decided to use the GRCh38 encoded sequence files for the purposes of my application. However since there are multiple overlapping sequences in these files pertaining to the individual chromosomes, I would like to extract the entire stretch with non-overlapping/unique sequences only.

    I need some guidance as to how I can proceed with this.

    Based on some preliminary research that I conducted, I found out that I can use the FASTX Toolkit for the tasks that I am looking to accomplish. However I am not able to understand the purpose and function of the different tools like fasta_formatter or fastx_collapser from the available documentation, due to which I am not able to identify if what I am doing is indeed correct.

  • #2
    This is kind of difficult... you could try a program like Minimus2, or Dedupe to get rid of redundancy, but I think it would be best to either use the whole genome, or else just use only the primary chromosomes (1-22, X, Y, M) and throw away all the little addenda and alt contigs.

    Comment


    • #3
      Originally posted by Brian Bushnell View Post
      This is kind of difficult... you could try a program like Minimus2, or Dedupe to get rid of redundancy, but I think it would be best to either use the whole genome, or else just use only the primary chromosomes (1-22, X, Y, M) and throw away all the little addenda and alt contigs.
      Thank you for your reply. I am sorry, how can I consume the entire genome?
      What I have with me right now are the FASTA files that represent the sequences of the individual chromosomes 1-22, X, Y and M. Is there any documentation that you could point me to so that I can understand the jargon associated with such data (alt contigs and the like).

      Thank You

      Comment


      • #4
        If you have 25 files, you're fine. Just use those. The alt contigs are smaller files that are not really necessary in most cases. They represent differences that are present in some people.

        NCBI's FTP site does have some files that describe the contents of each directory, but they are a little hard to understand... I'm not really sure where a good resource is describing the human genome files.

        Suffice to say - a "typical" person should have DNA corresponding to the 25 files 1-22, X, M, and possibly Y, depending on gender. If you have more than 25 files, the remainder are more controversial - maybe only some people have them; or maybe everyone has them but it's not clear where they go in the chromosome. For most analyses it's safe to ignore them. I'd say it's best to ignore them unless you understand exactly what they are and how to use them properly, since using them when mapping, for example, can cause spurious multimapping of reads which gives you inferior results.

        Comment


        • #5
          Originally posted by Brian Bushnell View Post
          If you have 25 files, you're fine. Just use those. The alt contigs are smaller files that are not really necessary in most cases. They represent differences that are present in some people.

          NCBI's FTP site does have some files that describe the contents of each directory, but they are a little hard to understand... I'm not really sure where a good resource is describing the human genome files.

          Suffice to say - a "typical" person should have DNA corresponding to the 25 files 1-22, X, M, and possibly Y, depending on gender. If you have more than 25 files, the remainder are more controversial - maybe only some people have them; or maybe everyone has them but it's not clear where they go in the chromosome. For most analyses it's safe to ignore them. I'd say it's best to ignore them unless you understand exactly what they are and how to use them properly, since using them when mapping, for example, can cause spurious multimapping of reads which gives you inferior results.
          Thank you for your prompt reply. A confirmation from you about what those files represent makes things a lot more clearer for me.

          I just have one doubt though, these files contain multiple sequences and my application logic consumes an entire file for processing as compared to consuming just a sequence from the file. My question is, can these sequences overlap?

          The first line of the file referencing chromosome 1 begins with
          >gi|568815364|ref|NT_077402.3| Homo sapiens chromosome 1 genomic scaffold, GRCh38 Primary Assembly HSCHR1_CTG1
          There are multiple entries like the one above, in the rest of the file. Each record begins with a descriptor like this and is then followed by a huge sequence.

          For the purpose of analysis, is it sound idea to get rid of the descriptor entries and concatenate the sequences together? Now, if there is an overlap of sequence within multiple descriptor records then it will hinder my analysis. However, if there isn't scope for overlapping then it will make my life so much easier with respect to programming.

          Comment


          • #6
            You may have downloaded the wrong files.

            Go here:
            ftp://ftp.ncbi.nlm.nih.gov/genomes/H...romosomes/seq/

            And download these:

            ftp://ftp.ncbi.nlm.nih.gov/genomes/H....p2_chr1.fa.gz

            ...etc. Generally,

            ftp://ftp.ncbi.nlm.nih.gov/genomes/H....p2_chr*.fa.gz

            There are also other things in the directory like "hs_ref_GRCh38.p2_unplaced.fa.gz" and "hs_ref_GRCh38.p2_alts.fa.gz" and "hs_ref_GRCh38.p2_unlocalized.fa.gz". You can get those if you want.

            But you do not want any of the .mfa.gz files, or the ones that look like this:
            "hs_alt_CHM1_1.1_chr1.fa.gz".

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Recent Advances in Sequencing Analysis Tools
              by seqadmin


              The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
              05-06-2024, 07:48 AM
            • seqadmin
              Essential Discoveries and Tools in Epitranscriptomics
              by seqadmin




              The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
              04-22-2024, 07:01 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, Today, 07:03 AM
            0 responses
            9 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 05-10-2024, 06:35 AM
            0 responses
            27 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 05-09-2024, 02:46 PM
            0 responses
            32 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 05-07-2024, 06:57 AM
            0 responses
            26 views
            0 likes
            Last Post seqadmin  
            Working...
            X