SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
ChIP-Seq: hmChIP: a database and web server for exploring publicly available human an Newsbot! Literature Watch 1 12-04-2011 02:33 AM
help needed to retrieve fasta reads from fasta db prashanthpnu Bioinformatics 1 06-21-2011 05:59 AM
Publicly available NGS data? tldgID Bioinformatics 10 05-25-2011 02:11 PM
Publicly available test NGS exome? emucaki Bioinformatics 3 01-19-2011 05:23 AM
Real biological 454 data publicly available for benchmarking Springbok28 454 Pyrosequencing 1 11-25-2009 11:48 PM

Reply
 
Thread Tools
Old 09-24-2011, 08:58 AM   #1
kursuni
Member
 
Location: Hoboken, NJ

Join Date: May 2011
Posts: 15
Default Publicly available FASTA Database ?

Hi I'm very new to Bioinformatics, and I'm working on DNA sequencing using some of algorithms like BWA, BFAST etc.

My question is; Is there a public source that I can download FASTA database file for using it as a reference sequence ?
or
Is there someone here who can help me access to a FASTA database file ?

Thanks in Advance...
kursuni is offline   Reply With Quote
Old 09-24-2011, 11:51 AM   #2
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,049
Default

You can find fasta formatted sequence files at -

NCBI: ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/
UCSC: ftp://hgdownload.cse.ucsc.edu/goldenPath/

You will need to traverse down the right directory for the genome that you are interested in with UCSC link.

NCBI has the files grouped based on the blast databases.
GenoMax is offline   Reply With Quote
Old 09-26-2011, 10:28 AM   #3
kursuni
Member
 
Location: Hoboken, NJ

Join Date: May 2011
Posts: 15
Default

Quote:
Originally Posted by GenoMax View Post
You can find fasta formatted sequence files at -

NCBI: ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/
UCSC: ftp://hgdownload.cse.ucsc.edu/goldenPath/

You will need to traverse down the right directory for the genome that you are interested in with UCSC link.

NCBI has the files grouped based on the blast databases.
Thank you for your reply..

I checked and download the human_genomic file and extract it. but the file inside it is not a file with the fasta or fa extension.

Do I need to do any additional thing to transform it to a fasta file ?
and there is md5 file for each database. What do we do with these md5 files ?
kursuni is offline   Reply With Quote
Old 09-26-2011, 11:21 AM   #4
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

I don't recall exactly where things are on the NCBI server, but have you looked here on the UCSC server? I'm pretty sure those would be hg19 reference sequence fasta files.

The md5 files probably the checksum of the associated file so you can ensure that the fasta (or whatever format) file isn't corrupted. You can read the wikipedia article on md5 for more details.
dpryan is offline   Reply With Quote
Old 09-26-2011, 01:50 PM   #5
kursuni
Member
 
Location: Hoboken, NJ

Join Date: May 2011
Posts: 15
Default

I've downloaded the fasta files and unzip them.
I indexed one of the fasta files and then run the command below to align my fastq file(SRR035022_1.filt.fastq) to the fasta.

$ bwa aln ~/fasta/chr1.fa ~/datasets/SRR035022_1.filt.fastq > aln_sa.sai
[bwa_aln] 17bp reads: max_diff = 2
[bwa_aln] 38bp reads: max_diff = 3
[bwa_aln] 64bp reads: max_diff = 4
[bwa_aln] 93bp reads: max_diff = 5
[bwa_aln] 124bp reads: max_diff = 6
[bwa_aln] 157bp reads: max_diff = 7
[bwa_aln] 190bp reads: max_diff = 8
[bwa_aln] 225bp reads: max_diff = 9
[bwa_seq_open] fail to open file '/home/ukursuncu/datasets/SRR035022_1.filt.fastq'. Abort!
Aborted

What would I be doing wrong ?
kursuni is offline   Reply With Quote
Old 09-26-2011, 02:24 PM   #6
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Does /home/ukursuncu/datasets/SRR035022_1.filt.fastq exist and do you have read permission for it? Try typing the following and posting the output
ls -l /home/ukursuncu/datasets/SRR035022_1.filt.fastq
I would guess that you just have a typo (happens to me all the time!).
dpryan is offline   Reply With Quote
Old 09-26-2011, 02:44 PM   #7
kursuni
Member
 
Location: Hoboken, NJ

Join Date: May 2011
Posts: 15
Default

The output is as below:

-rw-r--r-- 1 ukursuncu ukursuncu 5571960039 2011-05-10 14:00 /home/ukursuncu/datasets/SRR035022_1.filt.fastq
kursuni is offline   Reply With Quote
Old 09-26-2011, 04:31 PM   #8
kursuni
Member
 
Location: Hoboken, NJ

Join Date: May 2011
Posts: 15
Default

Or Can you suggest some resources where I can download fastq reads from ?
kursuni is offline   Reply With Quote
Old 09-27-2011, 03:48 AM   #9
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,049
Default

Kurusuni:

One does not download reference sequences in fastq format. Your sequence file is generally in the "fastq" format (based on the name you have provided it seems to be). The reference sequence that you downloaded is a fasta format file. It appears that you must have downloaded the right files and indexed them (can you send the output of ls -l chr1* from your home directory)?

What operating system are you running this on? Is it 32-bit or 64-bit? How much RAM do you have on your machine?

Quote:
Originally Posted by kursuni View Post
Or Can you suggest some resources where I can download fastq reads from ?
GenoMax is offline   Reply With Quote
Old 09-27-2011, 03:54 AM   #10
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,049
Default

Get the "chromFa.tar.gz" file from this link: ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/ (this is for hg19 the newest assembly of human genome). Once you unzip and untar the file there will be several large fasta files (one per chromosome) with an extension ".fa". You can index the individual files or concatenate them together to make a single "hg19.fa" that you can then index with bwa.

*chromFa.tar.gz - The assembly sequence in one file per chromosome.
Repeats from RepeatMasker and Tandem Repeats Finder (with period
of 12 or less) are shown in lower case; non-repeating sequence is
shown in upper case.

MD5 files are used to check for integrity of transferred data (like a fingerprint). If you did not get any errors during the download then you can ignore the MD5 files for now to keep things simple.

Quote:
Originally Posted by kursuni View Post
Thank you for your reply..

I checked and download the human_genomic file and extract it. but the file inside it is not a file with the fasta or fa extension.

Do I need to do any additional thing to transform it to a fasta file ?
and there is md5 file for each database. What do we do with these md5 files ?
GenoMax is offline   Reply With Quote
Old 09-27-2011, 03:59 AM   #11
arkal
advancing one byte at a time!
 
Location: Bangalore, India

Join Date: Jun 2011
Posts: 56
Default

Quote:
Originally Posted by kursuni View Post
Or Can you suggest some resources where I can download fastq reads from ?
if u want, u can generate simulated fastq files from your fasta file using a software such as dwgsim. Just an alternative

-A
arkal is offline   Reply With Quote
Old 09-27-2011, 01:20 PM   #12
kursuni
Member
 
Location: Hoboken, NJ

Join Date: May 2011
Posts: 15
Default

Quote:
Originally Posted by GenoMax View Post
Get the "chromFa.tar.gz" file from this link: ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/ (this is for hg19 the newest assembly of human genome). Once you unzip and untar the file there will be several large fasta files (one per chromosome) with an extension ".fa". You can index the individual files or concatenate them together to make a single "hg19.fa" that you can then index with bwa.

*chromFa.tar.gz - The assembly sequence in one file per chromosome.
Repeats from RepeatMasker and Tandem Repeats Finder (with period
of 12 or less) are shown in lower case; non-repeating sequence is
shown in upper case.

MD5 files are used to check for integrity of transferred data (like a fingerprint). If you did not get any errors during the download then you can ignore the MD5 files for now to keep things simple.
Thank you for your reply..

I've already downloaded the fasta files from the source you gave above. The problem is not with the fasta file, I first index the fasta file and then I try to align my fastq file to the reference fasta, but it returns error written previous posts.

by the way, how can we concatenate these files to make a single hg19.fa ?
kursuni is offline   Reply With Quote
Old 09-27-2011, 01:52 PM   #13
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

You can concatenate files with the "cat" command. At the command line, type:
man cat
for usage. You'd do well to find someone local familiar with unix. That would likely solve your problems.
dpryan is offline   Reply With Quote
Old 09-28-2011, 03:47 AM   #14
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,049
Default

If you can send answers for questions in my post #9, we may be able to see what the problem is.

$ cat chr1.fa chr2.fa (type names of all chromosome files) chrM.fa > hg19.fa (will make a single large multiple fasta file for hg19).

Quote:
Originally Posted by kursuni View Post
Thank you for your reply..

I've already downloaded the fasta files from the source you gave above. The problem is not with the fasta file, I first index the fasta file and then I try to align my fastq file to the reference fasta, but it returns error written previous posts.

by the way, how can we concatenate these files to make a single hg19.fa ?
GenoMax is offline   Reply With Quote
Old 09-29-2011, 04:58 PM   #15
kursuni
Member
 
Location: Hoboken, NJ

Join Date: May 2011
Posts: 15
Default

Quote:
Originally Posted by GenoMax View Post
If you can send answers for questions in my post #9, we may be able to see what the problem is.

$ cat chr1.fa chr2.fa (type names of all chromosome files) chrM.fa > hg19.fa (will make a single large multiple fasta file for hg19).
I run the command above and it did create a single hg19.fa file. but when I try to index it, it gives an error as below.

$ bwa index -p index_hg19 -a bwtsw -c /home/path/fasta/hg19.fa
[bwa_index] fail to open file '/home/path/fasta/hg19.fa'. Abort!
Aborted

What would be wrong ?

Thanks in Advance...
kursuni is offline   Reply With Quote
Old 09-30-2011, 12:49 AM   #16
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Let's try a little problem solving ourselves prior to posting next time. You received an error message that says, "fail to open file '/home/path/fasta/hg19.fa'. Abort!". Did you bother checking if '/home/path/fasta/hg19.fa' exists? If you don't know how to do that then you should buy a book on using Linux before doing anything else. That will solve most of your problems.
dpryan is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 09:15 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO