Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • kursuni
    Member
    • May 2011
    • 15

    Publicly available FASTA Database ?

    Hi I'm very new to Bioinformatics, and I'm working on DNA sequencing using some of algorithms like BWA, BFAST etc.

    My question is; Is there a public source that I can download FASTA database file for using it as a reference sequence ?
    or
    Is there someone here who can help me access to a FASTA database file ?

    Thanks in Advance...
  • GenoMax
    Senior Member
    • Feb 2008
    • 7142

    #2
    You can find fasta formatted sequence files at -

    NCBI: ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/
    UCSC: ftp://hgdownload.cse.ucsc.edu/goldenPath/

    You will need to traverse down the right directory for the genome that you are interested in with UCSC link.

    NCBI has the files grouped based on the blast databases.

    Comment

    • kursuni
      Member
      • May 2011
      • 15

      #3
      Originally posted by GenoMax View Post
      You can find fasta formatted sequence files at -

      NCBI: ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/
      UCSC: ftp://hgdownload.cse.ucsc.edu/goldenPath/

      You will need to traverse down the right directory for the genome that you are interested in with UCSC link.

      NCBI has the files grouped based on the blast databases.
      Thank you for your reply..

      I checked and download the human_genomic file and extract it. but the file inside it is not a file with the fasta or fa extension.

      Do I need to do any additional thing to transform it to a fasta file ?
      and there is md5 file for each database. What do we do with these md5 files ?

      Comment

      • dpryan
        Devon Ryan
        • Jul 2011
        • 3478

        #4
        I don't recall exactly where things are on the NCBI server, but have you looked here on the UCSC server? I'm pretty sure those would be hg19 reference sequence fasta files.

        The md5 files probably the checksum of the associated file so you can ensure that the fasta (or whatever format) file isn't corrupted. You can read the wikipedia article on md5 for more details.

        Comment

        • kursuni
          Member
          • May 2011
          • 15

          #5
          I've downloaded the fasta files and unzip them.
          I indexed one of the fasta files and then run the command below to align my fastq file(SRR035022_1.filt.fastq) to the fasta.

          $ bwa aln ~/fasta/chr1.fa ~/datasets/SRR035022_1.filt.fastq > aln_sa.sai
          [bwa_aln] 17bp reads: max_diff = 2
          [bwa_aln] 38bp reads: max_diff = 3
          [bwa_aln] 64bp reads: max_diff = 4
          [bwa_aln] 93bp reads: max_diff = 5
          [bwa_aln] 124bp reads: max_diff = 6
          [bwa_aln] 157bp reads: max_diff = 7
          [bwa_aln] 190bp reads: max_diff = 8
          [bwa_aln] 225bp reads: max_diff = 9
          [bwa_seq_open] fail to open file '/home/ukursuncu/datasets/SRR035022_1.filt.fastq'. Abort!
          Aborted

          What would I be doing wrong ?

          Comment

          • dpryan
            Devon Ryan
            • Jul 2011
            • 3478

            #6
            Does /home/ukursuncu/datasets/SRR035022_1.filt.fastq exist and do you have read permission for it? Try typing the following and posting the output
            ls -l /home/ukursuncu/datasets/SRR035022_1.filt.fastq
            I would guess that you just have a typo (happens to me all the time!).

            Comment

            • kursuni
              Member
              • May 2011
              • 15

              #7
              The output is as below:

              -rw-r--r-- 1 ukursuncu ukursuncu 5571960039 2011-05-10 14:00 /home/ukursuncu/datasets/SRR035022_1.filt.fastq

              Comment

              • kursuni
                Member
                • May 2011
                • 15

                #8
                Or Can you suggest some resources where I can download fastq reads from ?

                Comment

                • GenoMax
                  Senior Member
                  • Feb 2008
                  • 7142

                  #9
                  Kurusuni:

                  One does not download reference sequences in fastq format. Your sequence file is generally in the "fastq" format (based on the name you have provided it seems to be). The reference sequence that you downloaded is a fasta format file. It appears that you must have downloaded the right files and indexed them (can you send the output of ls -l chr1* from your home directory)?

                  What operating system are you running this on? Is it 32-bit or 64-bit? How much RAM do you have on your machine?

                  Originally posted by kursuni View Post
                  Or Can you suggest some resources where I can download fastq reads from ?

                  Comment

                  • GenoMax
                    Senior Member
                    • Feb 2008
                    • 7142

                    #10
                    Get the "chromFa.tar.gz" file from this link: ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/ (this is for hg19 the newest assembly of human genome). Once you unzip and untar the file there will be several large fasta files (one per chromosome) with an extension ".fa". You can index the individual files or concatenate them together to make a single "hg19.fa" that you can then index with bwa.

                    *chromFa.tar.gz - The assembly sequence in one file per chromosome.
                    Repeats from RepeatMasker and Tandem Repeats Finder (with period
                    of 12 or less) are shown in lower case; non-repeating sequence is
                    shown in upper case.

                    MD5 files are used to check for integrity of transferred data (like a fingerprint). If you did not get any errors during the download then you can ignore the MD5 files for now to keep things simple.

                    Originally posted by kursuni View Post
                    Thank you for your reply..

                    I checked and download the human_genomic file and extract it. but the file inside it is not a file with the fasta or fa extension.

                    Do I need to do any additional thing to transform it to a fasta file ?
                    and there is md5 file for each database. What do we do with these md5 files ?

                    Comment

                    • arkal
                      advancing one byte at a time!
                      • Jun 2011
                      • 56

                      #11
                      Originally posted by kursuni View Post
                      Or Can you suggest some resources where I can download fastq reads from ?
                      if u want, u can generate simulated fastq files from your fasta file using a software such as dwgsim. Just an alternative

                      -A

                      Comment

                      • kursuni
                        Member
                        • May 2011
                        • 15

                        #12
                        Originally posted by GenoMax View Post
                        Get the "chromFa.tar.gz" file from this link: ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/ (this is for hg19 the newest assembly of human genome). Once you unzip and untar the file there will be several large fasta files (one per chromosome) with an extension ".fa". You can index the individual files or concatenate them together to make a single "hg19.fa" that you can then index with bwa.

                        *chromFa.tar.gz - The assembly sequence in one file per chromosome.
                        Repeats from RepeatMasker and Tandem Repeats Finder (with period
                        of 12 or less) are shown in lower case; non-repeating sequence is
                        shown in upper case.

                        MD5 files are used to check for integrity of transferred data (like a fingerprint). If you did not get any errors during the download then you can ignore the MD5 files for now to keep things simple.
                        Thank you for your reply..

                        I've already downloaded the fasta files from the source you gave above. The problem is not with the fasta file, I first index the fasta file and then I try to align my fastq file to the reference fasta, but it returns error written previous posts.

                        by the way, how can we concatenate these files to make a single hg19.fa ?

                        Comment

                        • dpryan
                          Devon Ryan
                          • Jul 2011
                          • 3478

                          #13
                          You can concatenate files with the "cat" command. At the command line, type:
                          man cat
                          for usage. You'd do well to find someone local familiar with unix. That would likely solve your problems.

                          Comment

                          • GenoMax
                            Senior Member
                            • Feb 2008
                            • 7142

                            #14
                            If you can send answers for questions in my post #9, we may be able to see what the problem is.

                            $ cat chr1.fa chr2.fa (type names of all chromosome files) chrM.fa > hg19.fa (will make a single large multiple fasta file for hg19).

                            Originally posted by kursuni View Post
                            Thank you for your reply..

                            I've already downloaded the fasta files from the source you gave above. The problem is not with the fasta file, I first index the fasta file and then I try to align my fastq file to the reference fasta, but it returns error written previous posts.

                            by the way, how can we concatenate these files to make a single hg19.fa ?

                            Comment

                            • kursuni
                              Member
                              • May 2011
                              • 15

                              #15
                              Originally posted by GenoMax View Post
                              If you can send answers for questions in my post #9, we may be able to see what the problem is.

                              $ cat chr1.fa chr2.fa (type names of all chromosome files) chrM.fa > hg19.fa (will make a single large multiple fasta file for hg19).
                              I run the command above and it did create a single hg19.fa file. but when I try to index it, it gives an error as below.

                              $ bwa index -p index_hg19 -a bwtsw -c /home/path/fasta/hg19.fa
                              [bwa_index] fail to open file '/home/path/fasta/hg19.fa'. Abort!
                              Aborted

                              What would be wrong ?

                              Thanks in Advance...

                              Comment

                              Latest Articles

                              Collapse

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, Yesterday, 10:09 AM
                              0 responses
                              10 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-04-2026, 08:59 AM
                              0 responses
                              20 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-02-2026, 12:03 PM
                              0 responses
                              27 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-02-2026, 11:40 AM
                              0 responses
                              22 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...