![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
random subset paired-end fastq | dnusol | Bioinformatics | 15 | 04-17-2016 03:36 AM |
fastq-dump and paired end reads | moritz | Bioinformatics | 3 | 01-09-2014 02:57 AM |
Fastq: Paired end reads and mapping | cedance | Bioinformatics | 7 | 06-18-2011 01:33 PM |
how to identify paired end from qseq or fastq | zhaowei | Bioinformatics | 1 | 02-02-2011 01:46 PM |
Why are Illumina paired-end SRA datasets made up of 3 FASTQ files? | Bio.X2Y | Illumina/Solexa | 9 | 12-21-2010 12:36 PM |
![]() |
|
Thread Tools |
![]() |
#1 |
Junior Member
Location: USA Join Date: Dec 2011
Posts: 5
|
![]()
Hi guys,
I'm very new to this. I have a paired end data set from HiSeq1000. I want to take the first 10,000 or 100,000 reads out of ~40mil reads to use for tests rather than putting the entire 40 mil reads through the tests. What is the easiest way to generate files of only the first 100,000 reads?? Thanks. |
![]() |
![]() |
![]() |
#2 |
Senior Member
Location: San Diego Join Date: May 2008
Posts: 912
|
![]()
I'm not really familar with HiSeq data foramts, but there are probably some kind of coordinates, maybe by tile. If the data below from the checked answer is correct, you could use a grep to get only the entries with '[sequencer name]_[run]:[lane]:10:'. That's be a fairly random.
Or simpler, head -400000 to get the top 1000k reads? The ones in the very beginning are probably at the edge of the flow-cell, and will have more bad quality reads. Pulling from the middle might get you more good reads. http://biostar.stackexchange.com/que...d-naming-tiles |
![]() |
![]() |
![]() |
#3 |
Devon Ryan
Location: Freiburg, Germany Join Date: Jul 2011
Posts: 3,480
|
![]()
Have a look at this thread. You'll often find that the first batch of reads in a file are crap (at least from some machines), so you're better off randomly selecting them.
|
![]() |
![]() |
![]() |
Thread Tools | |
|
|