Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Publicly available FASTA Database ?

    Hi I'm very new to Bioinformatics, and I'm working on DNA sequencing using some of algorithms like BWA, BFAST etc.

    My question is; Is there a public source that I can download FASTA database file for using it as a reference sequence ?
    or
    Is there someone here who can help me access to a FASTA database file ?

    Thanks in Advance...

  • #2
    You can find fasta formatted sequence files at -

    NCBI: ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/
    UCSC: ftp://hgdownload.cse.ucsc.edu/goldenPath/

    You will need to traverse down the right directory for the genome that you are interested in with UCSC link.

    NCBI has the files grouped based on the blast databases.

    Comment


    • #3
      Originally posted by GenoMax View Post
      You can find fasta formatted sequence files at -

      NCBI: ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/
      UCSC: ftp://hgdownload.cse.ucsc.edu/goldenPath/

      You will need to traverse down the right directory for the genome that you are interested in with UCSC link.

      NCBI has the files grouped based on the blast databases.
      Thank you for your reply..

      I checked and download the human_genomic file and extract it. but the file inside it is not a file with the fasta or fa extension.

      Do I need to do any additional thing to transform it to a fasta file ?
      and there is md5 file for each database. What do we do with these md5 files ?

      Comment


      • #4
        I don't recall exactly where things are on the NCBI server, but have you looked here on the UCSC server? I'm pretty sure those would be hg19 reference sequence fasta files.

        The md5 files probably the checksum of the associated file so you can ensure that the fasta (or whatever format) file isn't corrupted. You can read the wikipedia article on md5 for more details.

        Comment


        • #5
          I've downloaded the fasta files and unzip them.
          I indexed one of the fasta files and then run the command below to align my fastq file(SRR035022_1.filt.fastq) to the fasta.

          $ bwa aln ~/fasta/chr1.fa ~/datasets/SRR035022_1.filt.fastq > aln_sa.sai
          [bwa_aln] 17bp reads: max_diff = 2
          [bwa_aln] 38bp reads: max_diff = 3
          [bwa_aln] 64bp reads: max_diff = 4
          [bwa_aln] 93bp reads: max_diff = 5
          [bwa_aln] 124bp reads: max_diff = 6
          [bwa_aln] 157bp reads: max_diff = 7
          [bwa_aln] 190bp reads: max_diff = 8
          [bwa_aln] 225bp reads: max_diff = 9
          [bwa_seq_open] fail to open file '/home/ukursuncu/datasets/SRR035022_1.filt.fastq'. Abort!
          Aborted

          What would I be doing wrong ?

          Comment


          • #6
            Does /home/ukursuncu/datasets/SRR035022_1.filt.fastq exist and do you have read permission for it? Try typing the following and posting the output
            ls -l /home/ukursuncu/datasets/SRR035022_1.filt.fastq
            I would guess that you just have a typo (happens to me all the time!).

            Comment


            • #7
              The output is as below:

              -rw-r--r-- 1 ukursuncu ukursuncu 5571960039 2011-05-10 14:00 /home/ukursuncu/datasets/SRR035022_1.filt.fastq

              Comment


              • #8
                Or Can you suggest some resources where I can download fastq reads from ?

                Comment


                • #9
                  Kurusuni:

                  One does not download reference sequences in fastq format. Your sequence file is generally in the "fastq" format (based on the name you have provided it seems to be). The reference sequence that you downloaded is a fasta format file. It appears that you must have downloaded the right files and indexed them (can you send the output of ls -l chr1* from your home directory)?

                  What operating system are you running this on? Is it 32-bit or 64-bit? How much RAM do you have on your machine?

                  Originally posted by kursuni View Post
                  Or Can you suggest some resources where I can download fastq reads from ?

                  Comment


                  • #10
                    Get the "chromFa.tar.gz" file from this link: ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/ (this is for hg19 the newest assembly of human genome). Once you unzip and untar the file there will be several large fasta files (one per chromosome) with an extension ".fa". You can index the individual files or concatenate them together to make a single "hg19.fa" that you can then index with bwa.

                    *chromFa.tar.gz - The assembly sequence in one file per chromosome.
                    Repeats from RepeatMasker and Tandem Repeats Finder (with period
                    of 12 or less) are shown in lower case; non-repeating sequence is
                    shown in upper case.

                    MD5 files are used to check for integrity of transferred data (like a fingerprint). If you did not get any errors during the download then you can ignore the MD5 files for now to keep things simple.

                    Originally posted by kursuni View Post
                    Thank you for your reply..

                    I checked and download the human_genomic file and extract it. but the file inside it is not a file with the fasta or fa extension.

                    Do I need to do any additional thing to transform it to a fasta file ?
                    and there is md5 file for each database. What do we do with these md5 files ?

                    Comment


                    • #11
                      Originally posted by kursuni View Post
                      Or Can you suggest some resources where I can download fastq reads from ?
                      if u want, u can generate simulated fastq files from your fasta file using a software such as dwgsim. Just an alternative

                      -A

                      Comment


                      • #12
                        Originally posted by GenoMax View Post
                        Get the "chromFa.tar.gz" file from this link: ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/ (this is for hg19 the newest assembly of human genome). Once you unzip and untar the file there will be several large fasta files (one per chromosome) with an extension ".fa". You can index the individual files or concatenate them together to make a single "hg19.fa" that you can then index with bwa.

                        *chromFa.tar.gz - The assembly sequence in one file per chromosome.
                        Repeats from RepeatMasker and Tandem Repeats Finder (with period
                        of 12 or less) are shown in lower case; non-repeating sequence is
                        shown in upper case.

                        MD5 files are used to check for integrity of transferred data (like a fingerprint). If you did not get any errors during the download then you can ignore the MD5 files for now to keep things simple.
                        Thank you for your reply..

                        I've already downloaded the fasta files from the source you gave above. The problem is not with the fasta file, I first index the fasta file and then I try to align my fastq file to the reference fasta, but it returns error written previous posts.

                        by the way, how can we concatenate these files to make a single hg19.fa ?

                        Comment


                        • #13
                          You can concatenate files with the "cat" command. At the command line, type:
                          man cat
                          for usage. You'd do well to find someone local familiar with unix. That would likely solve your problems.

                          Comment


                          • #14
                            If you can send answers for questions in my post #9, we may be able to see what the problem is.

                            $ cat chr1.fa chr2.fa (type names of all chromosome files) chrM.fa > hg19.fa (will make a single large multiple fasta file for hg19).

                            Originally posted by kursuni View Post
                            Thank you for your reply..

                            I've already downloaded the fasta files from the source you gave above. The problem is not with the fasta file, I first index the fasta file and then I try to align my fastq file to the reference fasta, but it returns error written previous posts.

                            by the way, how can we concatenate these files to make a single hg19.fa ?

                            Comment


                            • #15
                              Originally posted by GenoMax View Post
                              If you can send answers for questions in my post #9, we may be able to see what the problem is.

                              $ cat chr1.fa chr2.fa (type names of all chromosome files) chrM.fa > hg19.fa (will make a single large multiple fasta file for hg19).
                              I run the command above and it did create a single hg19.fa file. but when I try to index it, it gives an error as below.

                              $ bwa index -p index_hg19 -a bwtsw -c /home/path/fasta/hg19.fa
                              [bwa_index] fail to open file '/home/path/fasta/hg19.fa'. Abort!
                              Aborted

                              What would be wrong ?

                              Thanks in Advance...

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Essential Discoveries and Tools in Epitranscriptomics
                                by seqadmin




                                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                                Yesterday, 07:01 AM
                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              58 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              53 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              45 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              55 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X