![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
where is the error in my input files? | shuang | Bioinformatics | 3 | 08-23-2011 02:23 AM |
input files for IMAGE | Maegwin | Bioinformatics | 4 | 04-22-2011 05:54 PM |
SVA input files | srd | Introductions | 0 | 03-16-2011 07:17 AM |
IMAGE input files | skingan | Genomic Resequencing | 0 | 07-29-2010 01:02 PM |
BWA - input files | Bruins | Bioinformatics | 2 | 07-07-2010 12:43 AM |
![]() |
|
Thread Tools |
![]() |
#1 |
Senior Member
Location: Boston, MA Join Date: Nov 2010
Posts: 100
|
![]()
Hi all,
I am using Novoalign to align 50bp illumina SE reads to a viral genome. The reads were processed with Casava 1.8, so I have a bunch of individual read files - e.g. read001.gz read002.gz . . read036.gz Now, is there a way I can tell Novoalign to align all of these reads in one go? (I know I could pipe it with gunzip, but I would rather not do that). I have tried to put them in a folder and point Novoalign like so: novoalign -f Reads/ -c 6 -F ILM1.8 --**ILQ_SKIP -d reference -k -o SAM 2> reads.novoalign_logS.txt | samtools view -S -b -q 1 - | samtools sort - reads_sortedS But that doesn't work - neither does replacing 'Reads/' with read0*.gz. Any thoughts? |
![]() |
![]() |
![]() |
#2 |
Senior Member
Location: Rochester, MN Join Date: Mar 2009
Posts: 191
|
![]()
Why not just align each of them independantly, then merge BAMS?
|
![]() |
![]() |
![]() |
#3 |
Senior Member
Location: Boston, MA Join Date: Nov 2010
Posts: 100
|
![]()
I guess I could do that, but that workflow isn't very optimal either - what we do isn't (currently) scripted, so I would have to make a billion commands to make this work.
I guess I will settle with the gunzip pipe and then hopefully NovoCraft will get this into a future release - other programs like Mosaik can easily handle multiple small files - and with the new Illumina output, it would be good for other programs to do that as well. Any other workarounds? |
![]() |
![]() |
![]() |
#4 |
Senior Member
Location: Birmingham, UK Join Date: Jul 2009
Posts: 356
|
![]()
Yeah, why not just gunzip -c *.gz > reads.fq and start from there?
|
![]() |
![]() |
![]() |
#5 |
Senior Member
Location: Boston, MA Join Date: Nov 2010
Posts: 100
|
![]()
Hi nickloman,
Ya, that may be the best option - I'm just worried that this will slow down the whole process, but maybe not that much. I guess I could test it with a real experiment ![]() |
![]() |
![]() |
![]() |
#6 |
Senior Member
Location: Birmingham, UK Join Date: Jul 2009
Posts: 356
|
![]()
Yep "why think, why not do the experiment"
![]() |
![]() |
![]() |
![]() |
#7 |
Senior Member
Location: Boston, MA Join Date: Nov 2010
Posts: 100
|
![]()
Bummer, I spoke too soon - I thought Novoalign could read from stdin, but that doesn't appear to be the case. I tried the following commands, but none of them work:
gunzip *.gz -c | novoalign -f - -c 6 -F ILM1.8 **ILQ_SKIP -d reference -k -o SAM gunzip *.gz -c | novoalign -f stdin -c 6 -F ILM1.8 **ILQ_SKIP -d reference -k -o SAM gunzip *.gz -c | novoalign -c 6 -F ILM1.8 **ILQ_SKIP -d reference -k -o SAM I could of course gunzip, cat and save the file and then use that as the input, but again, that workflow really is pretty far from optimal. Bummer. NovoCraft - we need the ability to input more than one sequence file (or two for PE) with the new Casava 1.8 pipeline. |
![]() |
![]() |
![]() |
#8 |
Senior Member
Location: Birmingham, UK Join Date: Jul 2009
Posts: 356
|
![]()
That was the point of my post, simply "gunzip -c *.gz > reads.fq" and then use reads.fq as the input.
|
![]() |
![]() |
![]() |
#9 |
Senior Member
Location: Birmingham, UK Join Date: Jul 2009
Posts: 356
|
![]()
Also if you really want pipes, you might find /dev/stdin or /dev/fd/0 may work.
|
![]() |
![]() |
![]() |
#10 |
Senior Member
Location: Boston, MA Join Date: Nov 2010
Posts: 100
|
![]()
You know what - that actually worked....
gunzip *.gz -c | novoalign -f /dev/stdin -F ILM1.8 --ILQ_USE.... I'll settle with that option - not too many compromises here. Thanks nickloman! |
![]() |
![]() |
![]() |
#11 | |
Senior Member
Location: Rochester, MN Join Date: Mar 2009
Posts: 191
|
![]() Quote:
|
|
![]() |
![]() |
![]() |
#12 |
Senior Member
Location: Boston, MA Join Date: Nov 2010
Posts: 100
|
![]()
Hi RockChalkJayhawk
I guess, by 'optimal' I meant 'so I don't have to write too many commands'. I'm not familiar with bash loops, but I just googled it and it looks fairly easy - thanks for heads up. I have been avoiding any sort of scripting (we don't have that many sequences and our genomes are small), but I guess I need to take the plunge ![]() |
![]() |
![]() |
![]() |
#13 |
Senior Member
Location: Rochester, MN Join Date: Mar 2009
Posts: 191
|
![]()
if the names of your files are read001.gz, read002.gz, read036.gz, etc. do
Code:
for x in read *gz do gunzip $x| novoalign -f /dev/stdin -F ILM1.8 --ILQ_USE....{enter the rest here] done |
![]() |
![]() |
![]() |
#14 | |
Senior Member
Location: 45°30'25.22"N / 9°15'53.00"E Join Date: Apr 2009
Posts: 258
|
![]() Quote:
https://bitbucket.org/dawe/utils/src...e6/makeMake.sh Just modify it for novoalign, like this: Code:
#!/bin/bash # Look for binaries we need, exit if not found SAMTOOLS=/home/elia/bin/samtools NOVOALIGN=/home/elia/dawe/novocraft/novoalign INDEXPATH=/home/elia/dawe/index [[ -z ${BWA} ]] && exit 1 [[ -z ${SAMTOOLS} ]] && exit 1 # Use the SampleSheet.csv to create the script and align # SampleSheets has a header we want to skip nlines=(`wc -l SampleSheet.csv`) while read line do # There may be reasons to skip this line... i.e. when there's no indexed genome Skip=1 # Split the SampleSheet line using commas, store values we need (ID, Lane, Genome and Index) IFS="," lineArray=($line) Lane=`printf "%03d" ${lineArray[1]}` ID=${lineArray[2]} Genome=${lineArray[3]} if [ -z ${lineArray[4]} ] then # If there's no index, set it to NoIndex, just like Illumina does Index="NoIndex" else Index=${lineArray[4]} fi # Create the Experiment name, this is also the name Illumina gives to files expName=${ID}_${Index}_L${Lane} # Check how many chunks we have, this will be the number of SGE array tasks ntask=`ls ${ID}_${Index}_L${Lane}_R1_*.fastq.gz | wc -l` # Also check if this is paired end sequencing. This changes the bwa line isPE=0 if [ -f ${ID}_${Index}_L${Lane}_R2_001.fastq.gz ] then isPE=1 fi # Ok, what if we don't have the genome? [[ -d ${INDEXPATH}/${Genome} ]] && Skip=0 if [ $Skip -eq 1 ] then echo "Missing ${Genome}, you should correct the file by hand (align_${expName}) and qsub it" GFile="FIXME" else # If there's the genome, look for the proper index (assuming there's one index per directory ^__^ ) GFile=`ls ${INDEXPATH}/${Genome}/*.nix` #GFile=${GFile%%.bwt} echo "Using ${GFile}" fi done < <(tail -n $(( ${nlines[0]} - 1 )) SampleSheet.csv) # generate Makefile unset IFS # set the list of BAM files bamTargets='' ibam='' for x in `seq ${ntask}` do chunk=$(printf "%03d" ${x}) ibam="${ibam} INPUT=${expName}_${chunk}.bam" bamTargets="${bamTargets} ${expName}_${chunk}.bam" done cat > Makefile << ENDOFMAKE SHELL = /bin/bash NOVO = /home/elia/dawe/novocraft/novoalign SAMTOOLS = ${SAMTOOLS} REFINDEX = ${GFile} PICARD_BASE = /home/elia/dawe/software/picard-tools-1.56 all: ${expName}.bam.bai /bin/mv ${expName}.bam ${expName}.bam.lock /bin/rm -f ${bamTargets} /bin/mv ${expName}.bam.lock ${expName}.bam @echo "Time to give some meaning to these files" ${expName}.bam: ${bamTargets} java -jar \$(PICARD_BASE)/MergeSamFiles.jar OUTPUT=${expName}.bam ${ibam} ASSUME_SORTED=True SORT_ORDER=coordinate VALIDATION_STRINGENCY=SILENT ${expName}.bam.bai: ${expName}.bam \$(SAMTOOLS) index ${expName}.bam ENDOFMAKE for x in `seq ${ntask}` do chunk=`printf "%03d" ${x}` if [ $isPE == 1 ] then cat >> Makefile << ENDOFMAKE1 ${expName}_${chunk}.bam: ${expName}_R1_${chunk}.fastq.gz ${expName}_R2_${chunk}.fastq.gz \$(NOVO) -d \$(REFINDEX) -f ${expName}_R1_${chunk}.fastq.gz ${expName}_R2_${chunk}.fastq.gz -i 200,50 -o SAM | \$(SAMTOOLS) view -Su - | \$(SAMTOOLS) sort - ${expName}_${chunk} ENDOFMAKE1 else cat >> Makefile << ENDOFMAKE2 ${expName}_${chunk}.bam: ${expName}_R1_${chunk}.fastq.gz \$(BWA) samse \$(BWAOPT_SE) \$(REFINDEX) <(\$(BWA) aln \$(BWAOPT_ALN) \$(REFINDEX) ${expName}_R1_${chunk}.fastq.gz) ${expName}_R1_${chunk}.fastq.gz | \$(SAMTOOLS) view -Su - | \$(SAMTOOLS) sort - ${expName}_${chunk} ENDOFMAKE2 fi done |
|
![]() |
![]() |
![]() |
#15 | |
Senior Member
Location: Kuala Lumpur, Malaysia Join Date: Mar 2008
Posts: 126
|
![]() Quote:
One issue with Novoalign is that it reads a few records to check the file format then rewinds the files and starts again. If it detects a named pipe as input then it skips the file format check and you must use the -F option. I haven't tried with -f /dev/stdin but I expect it appears as a named pipe to Novoalign so the above should work. Before we implemented use of gzipped input files in Novoalign we were doing paired end like this: mkfifo read1_pipe read2_pipe gunzip -c read1.fastq.gz > read1.pipe gunzip -c read2.fastq.gz > read2.pipe novoalign -f read1.pipe read2.pipe -F STDFQ .... The same technique could be used to process multiple paired read files. Colin |
|
![]() |
![]() |
![]() |
#16 | |
Senior Member
Location: St. Louis, MO, USA Join Date: Apr 2011
Posts: 124
|
![]() Quote:
thanks! Justin |
|
![]() |
![]() |
![]() |
#17 | |
Senior Member
Location: Kuala Lumpur, Malaysia Join Date: Mar 2008
Posts: 126
|
![]() Quote:
You can do multiple paired end files using named pipes mkfifo read1_pipe read2_pipe cat lane1_read1.fastq lane2_read1.fastq lane3_read1.fastq > read1.pipe cat lane1_read2.fastq lane2_read2.fastq lane3_read2.fastq > read2.pipe novoalign -f read1.pipe read2.pipe -F STDFQ .... rm read1_pipe read2_pipe You could also just cat all the fastq files into one one real file rather than a named pipe but that uses extra disk space. Usually we would run a novoalign for each file of reads and then sort each report and merge the files. This gives a slight advantage in that multiple sorts can be run at the same time. When running samtools sort we usually write uncompressed bam to save CPU for compression. The other question is will the results be different? Each read is aligned independently even multi-hit reads, if you use -r Random then alignment is chosen randomly from alignments for that read only. However, it is possible that they will be very slightly different. The first issue for paired end is that Novoalign adds a fragment length penalty to paired alignments and this can affect which alignment is reported for some pairs. It's not often that this penalty changes the alignment location but it can for multi-mapped reads. The initial fragment length penalties in Novoalign are calculated from a normal distribution using the mean and standard deviation entered on the -i option. However as pairs are aligned Novoalign builds a histogram of actual proper pair lengths and then starts to recalculate penalties based on the actual distribution. So if the actual fragment length distribution is not normal or the initial -i settings are not accurate then the fragment length penalties can change. Basically the first few thousand alignments for each run will have slightly different fragment length penalties than the later pairs. Setting -i accurately will minimise any differences. In any case, if you have millions of reads the start up effect is probably not noticeable. If you do merge read files they should all be from the same sample prep and all have the same fragment length distribution. The second issue would be if you used quality calibration, the -k option. This also starts off on the assumption that qualities are accurate and then slowly adjusts as more reads are aligned so the first few thousand reads in each run will not have the benefit of quality calibration. (You can avoid this using by doing a trial run to collect calibration data) But calibration is also a good reason to run separate Novoaligns for each lane of reads as each lane should be calibrated separately, I've seen cases where individual lanes had very different calibration profiles. So it's possible to process multiple files in this way but I don't see a real benefit. Kind Regards, Colin |
|
![]() |
![]() |
![]() |
#18 |
Senior Member
Location: St. Louis, MO, USA Join Date: Apr 2011
Posts: 124
|
![]()
Hi Colin,
Thank you for the very helpful information. I was able to get the named pipes up and running, and I think that solution will work for us or at least be the easiest thing to place into the pipeline with minimum changes. I'll also see if aligning them individually provides some performance gains since we are running the alignments on a cluster. Writing the pre-merge files to an uncompressed BAM is also a nice idea. Thanks for the detailed explanations of the differences we may see since we would need to be able to explain the differences if we switch to aligning one big file or multiple small files. Thanks for all the work! Justin |
![]() |
![]() |
![]() |
#19 | |
Senior Member
Location: 45°30'25.22"N / 9°15'53.00"E Join Date: Apr 2009
Posts: 258
|
![]() Quote:
d |
|
![]() |
![]() |
![]() |
Thread Tools | |
|
|