Seqanswers Leaderboard Ad

**GenoMax** · 09-24-2011, 11:51 AM

You can find fasta formatted sequence files at -

NCBI: ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/
UCSC: ftp://hgdownload.cse.ucsc.edu/goldenPath/

You will need to traverse down the right directory for the genome that you are interested in with UCSC link.

NCBI has the files grouped based on the blast databases.

**kursuni** · 09-26-2011, 10:28 AM

Originally posted by GenoMax View Post

You can find fasta formatted sequence files at -

NCBI: ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/
UCSC: ftp://hgdownload.cse.ucsc.edu/goldenPath/

You will need to traverse down the right directory for the genome that you are interested in with UCSC link.

NCBI has the files grouped based on the blast databases.

Thank you for your reply..

I checked and download the human_genomic file and extract it. but the file inside it is not a file with the fasta or fa extension.

Do I need to do any additional thing to transform it to a fasta file ?
and there is md5 file for each database. What do we do with these md5 files ?

**dpryan** · 09-26-2011, 11:21 AM

I don't recall exactly where things are on the NCBI server, but have you looked here on the UCSC server? I'm pretty sure those would be hg19 reference sequence fasta files.

The md5 files probably the checksum of the associated file so you can ensure that the fasta (or whatever format) file isn't corrupted. You can read the wikipedia article on md5 for more details.

**kursuni** · 09-26-2011, 01:50 PM

I've downloaded the fasta files and unzip them.
I indexed one of the fasta files and then run the command below to align my fastq file(SRR035022_1.filt.fastq) to the fasta.

$ bwa aln ~/fasta/chr1.fa ~/datasets/SRR035022_1.filt.fastq > aln_sa.sai
[bwa_aln] 17bp reads: max_diff = 2
[bwa_aln] 38bp reads: max_diff = 3
[bwa_aln] 64bp reads: max_diff = 4
[bwa_aln] 93bp reads: max_diff = 5
[bwa_aln] 124bp reads: max_diff = 6
[bwa_aln] 157bp reads: max_diff = 7
[bwa_aln] 190bp reads: max_diff = 8
[bwa_aln] 225bp reads: max_diff = 9
[bwa_seq_open] fail to open file '/home/ukursuncu/datasets/SRR035022_1.filt.fastq'. Abort!
Aborted

What would I be doing wrong ?

**dpryan** · 09-26-2011, 02:24 PM

Does /home/ukursuncu/datasets/SRR035022_1.filt.fastq exist and do you have read permission for it? Try typing the following and posting the output
ls -l /home/ukursuncu/datasets/SRR035022_1.filt.fastq
I would guess that you just have a typo (happens to me all the time!).

**kursuni** · 09-26-2011, 02:44 PM

The output is as below:

-rw-r--r-- 1 ukursuncu ukursuncu 5571960039 2011-05-10 14:00 /home/ukursuncu/datasets/SRR035022_1.filt.fastq

**kursuni** · 09-26-2011, 04:31 PM

Or Can you suggest some resources where I can download fastq reads from ?

**GenoMax** · 09-27-2011, 03:48 AM

Kurusuni:

One does not download reference sequences in fastq format. Your sequence file is generally in the "fastq" format (based on the name you have provided it seems to be). The reference sequence that you downloaded is a fasta format file. It appears that you must have downloaded the right files and indexed them (can you send the output of ls -l chr1* from your home directory)?

What operating system are you running this on? Is it 32-bit or 64-bit? How much RAM do you have on your machine?

Originally posted by kursuni View Post

Or Can you suggest some resources where I can download fastq reads from ?

**GenoMax** · 09-27-2011, 03:54 AM

Get the "chromFa.tar.gz" file from this link: ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/ (this is for hg19 the newest assembly of human genome). Once you unzip and untar the file there will be several large fasta files (one per chromosome) with an extension ".fa". You can index the individual files or concatenate them together to make a single "hg19.fa" that you can then index with bwa.

*chromFa.tar.gz - The assembly sequence in one file per chromosome.
Repeats from RepeatMasker and Tandem Repeats Finder (with period
of 12 or less) are shown in lower case; non-repeating sequence is
shown in upper case.

MD5 files are used to check for integrity of transferred data (like a fingerprint). If you did not get any errors during the download then you can ignore the MD5 files for now to keep things simple.

Originally posted by kursuni View Post

Thank you for your reply..

I checked and download the human_genomic file and extract it. but the file inside it is not a file with the fasta or fa extension.

Do I need to do any additional thing to transform it to a fasta file ?
and there is md5 file for each database. What do we do with these md5 files ?

**arkal** · 09-27-2011, 03:59 AM

Originally posted by kursuni View Post

Or Can you suggest some resources where I can download fastq reads from ?

if u want, u can generate simulated fastq files from your fasta file using a software such as dwgsim. Just an alternative

-A

**kursuni** · 09-27-2011, 01:20 PM

Originally posted by GenoMax View Post

Get the "chromFa.tar.gz" file from this link: ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/ (this is for hg19 the newest assembly of human genome). Once you unzip and untar the file there will be several large fasta files (one per chromosome) with an extension ".fa". You can index the individual files or concatenate them together to make a single "hg19.fa" that you can then index with bwa.

*chromFa.tar.gz - The assembly sequence in one file per chromosome.
Repeats from RepeatMasker and Tandem Repeats Finder (with period
of 12 or less) are shown in lower case; non-repeating sequence is
shown in upper case.

MD5 files are used to check for integrity of transferred data (like a fingerprint). If you did not get any errors during the download then you can ignore the MD5 files for now to keep things simple.

Thank you for your reply..

I've already downloaded the fasta files from the source you gave above. The problem is not with the fasta file, I first index the fasta file and then I try to align my fastq file to the reference fasta, but it returns error written previous posts.

by the way, how can we concatenate these files to make a single hg19.fa ?

**dpryan** · 09-27-2011, 01:52 PM

You can concatenate files with the "cat" command. At the command line, type:
man cat
for usage. You'd do well to find someone local familiar with unix. That would likely solve your problems.

**GenoMax** · 09-28-2011, 03:47 AM

If you can send answers for questions in my post #9, we may be able to see what the problem is.

$ cat chr1.fa chr2.fa (type names of all chromosome files) chrM.fa > hg19.fa (will make a single large multiple fasta file for hg19).

Originally posted by kursuni View Post

Thank you for your reply..

I've already downloaded the fasta files from the source you gave above. The problem is not with the fasta file, I first index the fasta file and then I try to align my fastq file to the reference fasta, but it returns error written previous posts.

by the way, how can we concatenate these files to make a single hg19.fa ?

**kursuni** · 09-29-2011, 04:58 PM

Originally posted by GenoMax View Post

If you can send answers for questions in my post #9, we may be able to see what the problem is.

$ cat chr1.fa chr2.fa (type names of all chromosome files) chrM.fa > hg19.fa (will make a single large multiple fasta file for hg19).

I run the command above and it did create a single hg19.fa file. but when I try to index it, it gives an error as below.

$ bwa index -p index_hg19 -a bwtsw -c /home/path/fasta/hg19.fa
[bwa_index] fail to open file '/home/path/fasta/hg19.fa'. Abort!
Aborted

What would be wrong ?

Thanks in Advance...

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 58 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 53 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 45 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 55 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Publicly available FASTA Database ?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News