Seqanswers Leaderboard Ad

**colindaven** · 11-26-2010, 05:17 AM

Try here for the readset

http://www.ncbi.nlm.nih.gov/sra

I suggest typing "yoruban" in search to get some human data.

If you use bowtie try

36bp isn't really state of the art any more. Try 100bp paired end data.

Novoalign, Shrimp2 or bwa are also very fast and nice aligners which can handle gapped alignments of reads (typical with longer reads), in contrast to bowtie.

**laura** · 11-26-2010, 05:42 AM

If you look at ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/ you can see what data is available currently from 1000genomes. The sequence.index file will be most useful, if you look at the read and base count columns you can work out read length aswell

As the previous responder mentioned you probably want to go for 75-100bp reads now to be closer to how things work now

You might want to decide if you want a mix of technologies or just stick with one

**GerryB** · 11-26-2010, 09:20 PM

Thanks for the suggestions! Especially about the fact that using longer reads is more typical of actual use.

Laura, thanks also for the FTP.. looks like a ton of info there.

Colin, gapped short read alignments are usually due to paired end reads? So your short read you're trying to search for in the database would be something like a known 50bp, an unknown sized gap of 300-600 bp, and another known 50bp? [My numbers there are made up, I just want to test my understanding.]

And another question, what match error level is typically are used in reads? I notice Bowtie's paper shows with k=0,1, or 2, but it becomes exponentially slower as k increases. What k is typically used? Would higher k>2 matches be useful or are they so noisy that they wouldn't be used even if they were fast? Finally, are all of these mismatch choices based on k transcription differences (a single bp mismatch with the reference) or are they an edit difference, additionally allowing spurious insertions and deletions?

And a final "what's typical use?" question:
How many reads are typical for a researcher to run? 10 million a day or something? [I have no idea, that's a wild guess, and may be orders of magnitude off in either direction.]

Thanks again!

**bioinfosm** · 11-27-2010, 12:37 PM

You raise some interesting questions, but it would be useful to do the evaluation with someone actually working with the data. It becomes really difficult to run the tools and compare at various levels.

I have been looking at ways to compare aligners as well, using simulated data or the 1kg sequenced sample with genotyped variants.. but at multiple levels it does not remain an apples vs apples comparison due to various limitations, like a tool does not call variants, etc.

**drio** · 11-27-2010, 03:07 PM

Originally posted by bioinfosm View Post

but at multiple levels it does not remain an apples vs apples comparison due to various limitations, like a tool does not call variants, etc.

Agreed. I found this thread very useful in case you want to go for the simulated option.

Topics	Statistics	Last Post
Genomics-Driven Care in Neurodevelopmental Disorders Shows Promising Results by seqadmin Started by seqadmin, 01-09-2025, 04:04 PM	0 responses 434 views 0 likes	Last Post by seqadmin 01-09-2025, 04:04 PM
Study Questions Accuracy of Genetic Testing for Opioid Use Disorder Risk by seqadmin Started by seqadmin, 01-09-2025, 09:42 AM	0 responses 441 views 0 likes	Last Post by seqadmin 01-09-2025, 09:42 AM
New Algorithm Brings Precision and Scalability to Single-Cell RNA Analysis by seqadmin Started by seqadmin, 01-08-2025, 03:17 PM	0 responses 458 views 0 likes	Last Post by seqadmin 01-08-2025, 03:17 PM
Nanopores as Precision Diagnostic Tools in Molecular Biology by seqadmin Started by seqadmin, 01-03-2025, 11:18 AM	1 response 50 views 1 like	Last Post by Tonia 01-05-2025, 12:15 PM

Seqanswers Leaderboard Ad

Announcement

Short read benchmark data

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News