![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
"allele balance ratio" and "quality by depth" in VCF files | efoss | Bioinformatics | 2 | 10-25-2011 12:13 PM |
Relatively large proportion of "LOWDATA", "FAIL" of FPKM_status running cufflink | ruben6um | Bioinformatics | 3 | 10-12-2011 01:39 AM |
The position file formats ".clocs" and "_pos.txt"? Ist there any difference? | elgor | Illumina/Solexa | 0 | 06-27-2011 08:55 AM |
"Systems biology and administration" & "Genome generation: no engineering allowed" | seb567 | Bioinformatics | 0 | 05-25-2010 01:19 PM |
SEQanswers second "publication": "How to map billions of short reads onto genomes" | ECO | Literature Watch | 0 | 06-30-2009 12:49 AM |
![]() |
|
Thread Tools |
![]() |
#1 |
Senior Member
Location: Boston, MA Join Date: Nov 2010
Posts: 100
|
![]()
Hi all,
I often subsample my fastq files by using the unix 'head' command, rather than randomly getting reads from random positions in my fastq file. My setup is as follows: Casava output Code:
file1.fastq.gz file2.fastq.gz . . file#.fastq.gz Code:
gunzip *.gz -c | gzip -9 > file.fastq.gz Code:
head -n 100000 file.fastq.gz | <downstream analysis, e.g. blastn> Thanks very much in advance |
![]() |
![]() |
![]() |
#2 |
Member
Location: Denver, CO Join Date: Mar 2011
Posts: 37
|
![]()
it's definitely not very rigorous. htseq has a fastq reader in it which could help you subsample randomly.
htseq tour previous discussion on subsampling paired-end data: http://seqanswers.com/forums/showthread.php?t=12070 |
![]() |
![]() |
![]() |
#3 |
David Eccles (gringer)
Location: Wellington, New Zealand Join Date: May 2011
Posts: 838
|
![]()
I do this often (mostly to make sure that the output data format is correct) and often find that it's a really bad representation of the data due to strange biases, low mapping quality, and odd mappings with unequal distribution to name a few. SOLiD or Illumina output files have a pile of rubbish at the start of the file from bad sequencing reads (the edges of chips / cells seem to be more prone to error), which heavily influences the results gleaned from the first few reads.
If you've been already doing this, I'm somewhat surprised you haven't discovered this already. If you can afford the time and want something more representative but keeping within quick command-line stuff, use 'shuf -n' instead of 'head -n'. You could also try dumping the first million or so reads (or more), then take the next million or so (but there'll still be some bias there as well). Last edited by gringer; 12-28-2011 at 11:18 AM. |
![]() |
![]() |
![]() |
#4 |
Senior Member
Location: Boston, MA Join Date: Nov 2010
Posts: 100
|
![]()
Thanks for the replies - that's very helpful.
I have only been running this over the last couple of days (but we have loads where I need to do it) and didn't notice anything majorly wrong, other than some weird QC scores for a couple of files (gringer, I guess that is what you have seen as well). We have a script to subsample correctly, so I will probably just use that - the main problem being that it needs to read the entire file before being able to output a subsample. In that sense 'head' was a lot faster. gringer, as for shuf, I don't think that will work? Reason being that each read spans 2 (fasta) or 4 (fastq) lines? Otherwise, this is exactly what I would be looking for - given a file of n lines, randomly take a read from line k (so 4 lines, always starting at a position that can be divided by 4). But I haven't been able to find a script or command that will do that. |
![]() |
![]() |
![]() |
#5 | |
Peter (Biopython etc)
Location: Dundee, Scotland, UK Join Date: Jul 2009
Posts: 1,543
|
![]() Quote:
Yes, since the read order produced by Illumina reflects the slide positions, so the first and last reads will be from the edges and generally of poorer quality (so I've read). |
|
![]() |
![]() |
![]() |
#6 | |
David Eccles (gringer)
Location: Wellington, New Zealand Join Date: May 2011
Posts: 838
|
![]() Quote:
The only ways I can think of at the moment are fairly complex piped shell commands, or small programs. I guess one "simple" way to do it would be to seek to a random file position, then read forward until you find a start of sequence marker, but that's a lot of work for too little benefit. FWIW, doing a strict random seek on a fastq file is not possible because both @ and + can appear in the quality strings, and you can get multi-line fastq files. Edit: I've just thought of another way that would work for a single pass (i.e. could work for piped input), assuming you knew the approximate number of records in the file. You decide in advance what records will be selected (i.e. generate 1M numbers between 1 and 100M), sort the list, then run through the file and print only the records that appear in the list. Last edited by gringer; 12-28-2011 at 04:39 PM. |
|
![]() |
![]() |
![]() |
#7 | |
Senior Member
Location: Boston, MA Join Date: Nov 2010
Posts: 100
|
![]() Quote:
So, a talented student wrote up a quick script in python for me and it works really well. It appears that it is about as fast as writing to disk, so overall what I was looking for. It takes the following argument: Code:
python subsampler.py <filename> <number of sequences to extract> Code:
#!/usr/bin/env python import random import sys #name of the input file (fasta or fastq) #assumes input file is standard fasta/fastq format fileName = sys.argv[1] #number of sequences to subsample numSeq = int(sys.argv[2]) increment = 0 #if it's a fasta file if (fileName.find(".fasta") != -1): increment = 2 #else if it's a fastq file elif (fileName.find(".fastq") != -1): increment = 4 #quit if neither else: sys.stdout.write("not a fasta/fastq file\n") sys.exit() FILE = open(fileName, 'r') reads = list() index = 0 #put the entire file in a list for line in FILE: if(index % increment == 0): reads.append(line) else: reads[(index/increment)] += line index += 1 FILE.close() #figure out total number of reads, error if applicable totalReads = index/increment if(totalReads < numSeq): sys.stdout.write("You only have "+str(totalReads)+" reads!\n") sys.exit() #shuffle the reads! random.shuffle(reads) #output the reads to stdout for i in range(0, numSeq): sys.stdout.write(reads[i]) |
|
![]() |
![]() |
![]() |
#8 | |
David Eccles (gringer)
Location: Wellington, New Zealand Join Date: May 2011
Posts: 838
|
![]() Quote:
Edit: Just noticed that it won't work with multi-line FASTA or FASTQ files, but for most NGS things you don't need to worry about that. |
|
![]() |
![]() |
![]() |
#9 |
Senior Member
Location: Western Australia Join Date: Feb 2010
Posts: 308
|
![]()
kga1978, thanks for posting the script. I was just working on a perl script to do the same, but being a beginner at scripting it was going to take me an annoying amount of time. So thanks and now I can do something more productive today!!!
__________________
-------------- Ethan |
![]() |
![]() |
![]() |
#10 |
Senior Member
Location: Boston, MA Join Date: Nov 2010
Posts: 100
|
![]()
Hey gringer,
Thanks for your comments - very helpful and I'll make sure to pass them on. For now though, I don't really mind holding everything in memory - should make it faster and memory isn't a problem (for now....). And actually, I won't work on stdin at the moment as it requires the .fasta or .fastq extension to work (this could be easily fixed by specifying file-type though...). |
![]() |
![]() |
![]() |
#11 |
Senior Member
Location: Boston, MA Join Date: Nov 2010
Posts: 100
|
![]()
Just to follow up on the first part of my question, I now ran a couple of subsample experiments and compared them to the entire dataset as well as to 'head -n' and 'tail -n'. As noted by other posters, there are definitely biases when doing 'head' and 'tail'.
I subsampled 1 million reads from a total of 50 million and ran FastQC (to calculate avg. Q-score) and aligned the reads to hg19 using BWA. This is what I got: ![]() |
![]() |
![]() |
![]() |
#12 |
Senior Member
Location: Boston, MA Join Date: Nov 2010
Posts: 100
|
![]()
Also, following gringer's suggestion, here's a less memory-hungry version of the downsampling script:
Code:
#!/usr/bin/env python import random import sys random.seed() #name of the input file (fasta or fastq) #assumes input file is standard fasta/fastq format fileName = sys.argv[1] #number of sequences to subsample numSeq = int(sys.argv[2]) increment = 0 #if it's a fasta file if (fileName.find(".fasta") != -1): increment = 2 #else if it's a fastq file elif (fileName.find(".fastq") != -1): increment = 4 #quit if neither else: sys.stdout.write("not a fasta/fastq file\n") sys.exit() FILE = open(fileName, 'r') totalReads = list() index = 0 for line in FILE: if(index % increment == 0): totalReads.append(index/increment) index += 1 FILE.close() if(len(totalReads) < numSeq): sys.stdout.write("You only have "+str(len(totalReads))+" reads!\n") sys.exit() ttl = len(totalReads) random.shuffle(totalReads) totalReads = totalReads[0: numSeq] totalReads.sort() FILE = open(fileName, 'r') listIndex = 0 for i in range(0, ttl): curRead = "" for j in range(0, increment): curRead += FILE.readline() if (i == totalReads[listIndex]): sys.stdout.write(curRead) listIndex += 1 if(listIndex == numSeq): break FILE.close() |
![]() |
![]() |
![]() |
#13 |
Member
Location: Chicago, IL Join Date: Sep 2010
Posts: 13
|
![]()
Nobody asked for it, but I wanted it so maybe somebody else can use it, too.
Here's a rewrite of the python script that can now take two paired files as input and maintain pairing in the subsets. It can also still take just a single-end file as input. It's not pretty, but it does the job. Changes: Now have to enter the number of reads desired first, then the input file name(s). Instead of going to stdout, the reads are output to out_1.fastq (or .fasta) and out_2.fastq (or .fasta). It would be relatively simple to put in an output prefix parameter, but this was good enough for my purposes. Code:
#!/usr/bin/env python import random import sys random.seed() #number of sequences to subsample numSeq = int(sys.argv[1]) #name of the input file (fasta or fastq) #assumes input file is standard fasta/fastq format fileName1 = sys.argv[2] fileName2 = sys.argv[3] increment = 0 #if it's a fasta file if (fileName1.find(".fasta") != -1): increment = 2 #else if it's a fastq file elif (fileName1.find(".fastq") != -1): increment = 4 #quit if neither else: sys.stdout.write("not a fasta/fastq file\n") sys.exit() FILE1 = open(fileName1, 'r') totalReads1 = list() index = 0 for line in FILE1: if(index % increment == 0): totalReads1.append(index/increment) index += 1 FILE1.close() if(len(totalReads1) < numSeq): sys.stdout.write("You only have "+str(len(totalReads))+" reads!\n") sys.exit() if (fileName2): FILE2 = open(fileName2, 'r') totalReads2 = list() index = 0 for line in FILE2: if (index % increment == 0): totalReads2.append(index/increment) index += 1 FILE2.close() if (len(totalReads1) != len(totalReads2)): sys.stdout.write("read counts do not match\n") sys.exit() ttl = len(totalReads1) random.shuffle(totalReads1) totalReads1 = totalReads1[0: numSeq] totalReads1.sort() FILE1 = open(fileName1, 'r') if (fileName2): FILE2 = open(fileName2, 'r') listIndex = 0 if (increment == 4): OUTFILE1 = open('out_1.fastq', 'w') if (fileName2): OUTFILE2 = open('out_2.fastq', 'w') else: OUTFILE1 = open('out_1.fasta', 'w') if (fileName2): OUTFILE2 = open('out_2.fasta', 'w') for i in range(0, ttl): curRead1 = "" curRead2 = "" for j in range(0, increment): curRead1 += FILE1.readline() if (fileName2): curRead2 += FILE2.readline() if (i == totalReads1[listIndex]): OUTFILE1.write(curRead1) if (fileName2): OUTFILE2.write(curRead2) listIndex += 1 if(listIndex == numSeq): break FILE1.close() if (fileName2): FILE2.close() OUTFILE1.close() if (fileName2): OUTFILE2.close() |
![]() |
![]() |
![]() |
#14 | |
Junior Member
Location: balitmore, md Join Date: Jan 2012
Posts: 3
|
![]() Quote:
>File_Name GACGAACGCTGGCGGCGTGCCTAATACATGCAAGTCGAGCGAGTCTGCCTTGAAGATCGG AGTGCTTGCACTCTGTGAAACAAGATACAGGCTAGCGGCGGACGGGTGAGTAACACGTGG... i.e., the sequence is not one continuous string, but rather a set of lines. Do you know how to modify the code or modify my files to work with subsampler.py? |
|
![]() |
![]() |
![]() |
#15 | |
David Eccles (gringer)
Location: Wellington, New Zealand Join Date: May 2011
Posts: 838
|
![]() Quote:
A rough pseudo-code for this is as follows: First pass -- counting records
|
|
![]() |
![]() |
![]() |
#16 |
Senior Member
Location: San Diego Join Date: May 2008
Posts: 912
|
![]()
I think a simpler way to do it...just get all the reads from a particular set of tiles (I don't know how that translates on HiSeq machines) Since the first few and last few tiles are bad, pick some tiles in the middle, by grepping the read names, which have teh cluster cooridnates embedded in them. Or zgrep, if the fastqs are already zipped. That should be random with respect to where the read align to the genome, with better quality than picking the top and bottom tiles with head or foot.
But yes, this won't work on multi-line fastqs. |
![]() |
![]() |
![]() |
#17 |
Junior Member
Location: Sydney, Australia Join Date: Jun 2012
Posts: 4
|
![]()
Hey, just notices this thread. This Python solution may be a bit late but it will work on both multi-line fasta and fastq files. Set up is such that all files in same dir will be subsampled, but this can be changed. Also three subsamplings taken, but you can change this easily enough or just use one of them.
Code:
import random import glob inFiles = glob.glob('*.fasta') + glob.glob('*.fastq') outFiles = [] num = int(raw_input("Enter number of random sequences to select:\n")) for i in range(len(inFiles)): for k in range(3): fNames = [] fSeqs = [] same = 0 outFiles.append(file(str(inFiles[i])+'_'+'Rand_'+str(num)+'-'+str(k+1)+'.fa', 'wt')) for line in open(inFiles[i]): if ((line[0] == '>') or (line[0] == '@')): fNames.append(line) same = 0 else: if (same != 1): fSeqs.append(line) same = 1 else: fSeqs[(len(fSeqs)-1)] += line curr = (len(outFiles)-1) for j in range(num): a = random.randint(0, (len(fNames)-1)) outFiles[curr].write(fNames.pop(a)) outFiles[curr].write(fSeqs.pop(a)) raw_input("Done.") |
![]() |
![]() |
![]() |
#18 | |
David Eccles (gringer)
Location: Wellington, New Zealand Join Date: May 2011
Posts: 838
|
![]() Quote:
|
|
![]() |
![]() |
![]() |
#19 |
Senior Member
Location: bethesda Join Date: Feb 2009
Posts: 700
|
![]()
A simple script extracting random fastq records using reservoir sampling ...
# usage: program inputfastq numsubsamples outputsubampfqfile # example: ./job.fqreservoir SRR451888.fq 1000 subsamp.fq cat $1 | \ awk '{if ((p%4)<3) printf("%s\t",$0); else printf("%s\n",$0); p++;}' | \ awk -v k=$2 'BEGIN{srand(systime() + PROCINFO["pid"]);}{y=x++<k?x-1:int(rand()*x);if(y<k)a[y]=$0}END{for(z in a)print a[z]}' | \ tr '\t' '\n' > $3 See previous discussion here: http://seqanswers.com/forums/showthread.php?t=16845 Note the acutal line #3 is a hack of lh3's script WITH seeding, i.e. BEGIN{srand(systime() + PROCINFO["pid"]);} Remove the seed code to get same results every time you run it. Otherwise leave it in to get different results. This program joins the 4 fastq lines together into one line , spearating them with a tab character. These combined lines are subsampled. At the end, the "tr" command seperates the selected combined lines back into 4 lines. This script used the UNIX/LINUX/BSD standard utilities "tr", "cat", and "awk". Last edited by Richard Finney; 06-17-2012 at 09:39 AM. |
![]() |
![]() |
![]() |
#20 | |
David Eccles (gringer)
Location: Wellington, New Zealand Join Date: May 2011
Posts: 838
|
![]() Quote:
http://blogs.msdn.com/b/spt/archive/...-sampling.aspx The modified pseudo-code for a fasta/fastq file using reservoir sampling would be as follows:
|
|
![]() |
![]() |
![]() |
Thread Tools | |
|
|