Seqanswers Leaderboard Ad

**GenoMax** · 01-28-2015, 09:37 AM

Fasta file format is meant for plain sequence files (without quality information). There may be extensions of Fasta format but the normal usage is for plain sequence.

What you are looking for are Fastq format files, which has become the de facto standard for NGS data.

You can get fastq data files (there is a utility needed to retrieve data called srftoolkit) from NCBI Short Read Archive (SRA) (http://www.ncbi.nlm.nih.gov/sra) Fastq files can be several gigabytes in size.

**dpryan** · 01-28-2015, 09:37 AM

1. Probably JGI. They seem to have a LOT of bacteria datasets, that should be more tractable.
2. No, that would be incredibly unusual. They're usually stored in fastq, typically in separate files.
3. Each fragment sequenced produces two reads, one from each end. So if you sequence 100 million fragments, you'll have 200 million reads (100 million pairs). This is as opposed to single-end reads, where you just sequence one end of each fragment.

**mastal** · 01-28-2015, 09:42 AM

2. fastq format

**schakalakka** · 02-03-2015, 06:04 AM

Thank you for your answers.

So I have to look up fastq and write a parser. I hoped I can avoid it.

**rhinoceros** · 02-03-2015, 06:13 AM

2. Some assemblers require input in fasta format, e.g. IDBA wants pairs to be consecutive sequences in fasta format and they have bundled a small script for converting from fastq to fasta..

Also, don't reinvent the wheel. I'm sure there are many OSS fastq parsers available. A good place to start could be https://github.com/samtools/htslib

**dpryan** · 02-03-2015, 06:28 AM

HTSlib doesn't have a fastq parser. Anyway, with any modern data it's fine to assume that fastq entries are always 4 lines, so a parser is then trivial to write.

**milw** · 02-03-2015, 07:47 AM

If you're using Python, there's a decent parser already at https://scipher.wordpress.com/2010/0...-fastq-parser/

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 58 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 53 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 45 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 55 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Sample FASTA data and questions about paired end reads

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News