SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
where is the error in my input files? shuang Bioinformatics 3 08-23-2011 02:23 AM
input files for IMAGE Maegwin Bioinformatics 4 04-22-2011 05:54 PM
SVA input files srd Introductions 0 03-16-2011 07:17 AM
IMAGE input files skingan Genomic Resequencing 0 07-29-2010 01:02 PM
BWA - input files Bruins Bioinformatics 2 07-07-2010 12:43 AM

Reply
 
Thread Tools
Old 11-15-2011, 07:02 PM   #1
kga1978
Senior Member
 
Location: Boston, MA

Join Date: Nov 2010
Posts: 100
Default Several input files w. Novoalign?

Hi all,

I am using Novoalign to align 50bp illumina SE reads to a viral genome. The reads were processed with Casava 1.8, so I have a bunch of individual read files - e.g.

read001.gz
read002.gz
.
.
read036.gz

Now, is there a way I can tell Novoalign to align all of these reads in one go? (I know I could pipe it with gunzip, but I would rather not do that). I have tried to put them in a folder and point Novoalign like so:

novoalign -f Reads/ -c 6 -F ILM1.8 --**ILQ_SKIP -d reference -k -o SAM 2> reads.novoalign_logS.txt | samtools view -S -b -q 1 - | samtools sort - reads_sortedS

But that doesn't work - neither does replacing 'Reads/' with read0*.gz.

Any thoughts?
kga1978 is offline   Reply With Quote
Old 11-15-2011, 11:15 PM   #2
RockChalkJayhawk
Senior Member
 
Location: Rochester, MN

Join Date: Mar 2009
Posts: 191
Default

Why not just align each of them independantly, then merge BAMS?
RockChalkJayhawk is offline   Reply With Quote
Old 11-16-2011, 06:49 AM   #3
kga1978
Senior Member
 
Location: Boston, MA

Join Date: Nov 2010
Posts: 100
Default

I guess I could do that, but that workflow isn't very optimal either - what we do isn't (currently) scripted, so I would have to make a billion commands to make this work.

I guess I will settle with the gunzip pipe and then hopefully NovoCraft will get this into a future release - other programs like Mosaik can easily handle multiple small files - and with the new Illumina output, it would be good for other programs to do that as well.

Any other workarounds?
kga1978 is offline   Reply With Quote
Old 11-16-2011, 06:58 AM   #4
nickloman
Senior Member
 
Location: Birmingham, UK

Join Date: Jul 2009
Posts: 356
Default

Yeah, why not just gunzip -c *.gz > reads.fq and start from there?
nickloman is offline   Reply With Quote
Old 11-16-2011, 07:00 AM   #5
kga1978
Senior Member
 
Location: Boston, MA

Join Date: Nov 2010
Posts: 100
Default

Hi nickloman,

Ya, that may be the best option - I'm just worried that this will slow down the whole process, but maybe not that much. I guess I could test it with a real experiment .
kga1978 is offline   Reply With Quote
Old 11-16-2011, 07:01 AM   #6
nickloman
Senior Member
 
Location: Birmingham, UK

Join Date: Jul 2009
Posts: 356
Default

Yep "why think, why not do the experiment"
nickloman is offline   Reply With Quote
Old 11-16-2011, 07:26 AM   #7
kga1978
Senior Member
 
Location: Boston, MA

Join Date: Nov 2010
Posts: 100
Default

Bummer, I spoke too soon - I thought Novoalign could read from stdin, but that doesn't appear to be the case. I tried the following commands, but none of them work:

gunzip *.gz -c | novoalign -f - -c 6 -F ILM1.8 **ILQ_SKIP -d reference -k -o SAM
gunzip *.gz -c | novoalign -f stdin -c 6 -F ILM1.8 **ILQ_SKIP -d reference -k -o SAM
gunzip *.gz -c | novoalign -c 6 -F ILM1.8 **ILQ_SKIP -d reference -k -o SAM

I could of course gunzip, cat and save the file and then use that as the input, but again, that workflow really is pretty far from optimal.

Bummer. NovoCraft - we need the ability to input more than one sequence file (or two for PE) with the new Casava 1.8 pipeline.
kga1978 is offline   Reply With Quote
Old 11-16-2011, 07:28 AM   #8
nickloman
Senior Member
 
Location: Birmingham, UK

Join Date: Jul 2009
Posts: 356
Default

That was the point of my post, simply "gunzip -c *.gz > reads.fq" and then use reads.fq as the input.
nickloman is offline   Reply With Quote
Old 11-16-2011, 07:30 AM   #9
nickloman
Senior Member
 
Location: Birmingham, UK

Join Date: Jul 2009
Posts: 356
Default

Also if you really want pipes, you might find /dev/stdin or /dev/fd/0 may work.
nickloman is offline   Reply With Quote
Old 11-16-2011, 07:44 AM   #10
kga1978
Senior Member
 
Location: Boston, MA

Join Date: Nov 2010
Posts: 100
Default

You know what - that actually worked....

gunzip *.gz -c | novoalign -f /dev/stdin -F ILM1.8 --ILQ_USE....

I'll settle with that option - not too many compromises here.

Thanks nickloman!
kga1978 is offline   Reply With Quote
Old 11-16-2011, 04:46 PM   #11
RockChalkJayhawk
Senior Member
 
Location: Rochester, MN

Join Date: Mar 2009
Posts: 191
Default

Quote:
Originally Posted by kga1978 View Post
I guess I could do that, but that workflow isn't very optimal either - what we do isn't (currently) scripted, so I would have to make a billion commands to make this work.

I guess I will settle with the gunzip pipe and then hopefully NovoCraft will get this into a future release - other programs like Mosaik can easily handle multiple small files - and with the new Illumina output, it would be good for other programs to do that as well.

Any other workarounds?
I'm not quite sure what you mean about being optimal or a billlion commands. You will use the same amount of resources to do the alignment. It would be very simple to construct a bash loop to submit all your jobs too.
RockChalkJayhawk is offline   Reply With Quote
Old 11-16-2011, 04:59 PM   #12
kga1978
Senior Member
 
Location: Boston, MA

Join Date: Nov 2010
Posts: 100
Default

Hi RockChalkJayhawk

I guess, by 'optimal' I meant 'so I don't have to write too many commands'. I'm not familiar with bash loops, but I just googled it and it looks fairly easy - thanks for heads up. I have been avoiding any sort of scripting (we don't have that many sequences and our genomes are small), but I guess I need to take the plunge .
kga1978 is offline   Reply With Quote
Old 11-16-2011, 05:04 PM   #13
RockChalkJayhawk
Senior Member
 
Location: Rochester, MN

Join Date: Mar 2009
Posts: 191
Default

if the names of your files are read001.gz, read002.gz, read036.gz, etc. do
Code:
for x in read *gz
do
gunzip $x| novoalign -f /dev/stdin -F ILM1.8 --ILQ_USE....{enter the rest here]
done
RockChalkJayhawk is offline   Reply With Quote
Old 11-21-2011, 03:13 PM   #14
dawe
Senior Member
 
Location: 4530'25.22"N / 915'53.00"E

Join Date: Apr 2009
Posts: 258
Default

Quote:
Originally Posted by kga1978 View Post
Hi nickloman,

Ya, that may be the best option - I'm just worried that this will slow down the whole process, but maybe not that much. I guess I could test it with a real experiment .
You may try this:

https://bitbucket.org/dawe/utils/src...e6/makeMake.sh

Just modify it for novoalign, like this:

Code:
#!/bin/bash

# Look for binaries we need, exit if not found

SAMTOOLS=/home/elia/bin/samtools
NOVOALIGN=/home/elia/dawe/novocraft/novoalign
INDEXPATH=/home/elia/dawe/index

[[ -z ${BWA} ]] && exit 1
[[ -z ${SAMTOOLS} ]] && exit 1

# Use the SampleSheet.csv to create the script and align
# SampleSheets has a header we want to skip

nlines=(`wc -l SampleSheet.csv`)
while read line
do
  # There may be reasons to skip this line... i.e. when there's no indexed genome 
  Skip=1

  # Split the SampleSheet line using commas, store values we need (ID, Lane, Genome and Index)
  IFS=","
  lineArray=($line)
  Lane=`printf "%03d" ${lineArray[1]}`
  ID=${lineArray[2]}
  Genome=${lineArray[3]}
  if [ -z ${lineArray[4]} ]
  then 
    # If there's no index, set it to NoIndex, just like Illumina does
    Index="NoIndex"
  else 
    Index=${lineArray[4]}
  fi
  
  # Create the Experiment name, this is also the name Illumina gives to files
  expName=${ID}_${Index}_L${Lane}
  
  # Check how many chunks we have, this will be the number of SGE array tasks

  ntask=`ls ${ID}_${Index}_L${Lane}_R1_*.fastq.gz | wc -l`

  # Also check if this is paired end sequencing. This changes the bwa line
  isPE=0
  if [ -f ${ID}_${Index}_L${Lane}_R2_001.fastq.gz ]
  then
    isPE=1
  fi  
  
  # Ok, what if we don't have the genome?
  [[ -d ${INDEXPATH}/${Genome} ]] && Skip=0
  if [ $Skip -eq 1 ]
  then
    echo "Missing ${Genome}, you should correct the file by hand (align_${expName}) and qsub it"
    GFile="FIXME"
  else
    # If there's the genome, look for the proper index (assuming there's one index per directory ^__^ )
    GFile=`ls ${INDEXPATH}/${Genome}/*.nix`
    #GFile=${GFile%%.bwt}
    echo "Using ${GFile}"
  fi  
done < <(tail -n $(( ${nlines[0]} - 1 )) SampleSheet.csv)

# generate Makefile

unset IFS

# set the list of BAM files
bamTargets=''
ibam=''
for x in `seq ${ntask}`
do
  chunk=$(printf "%03d" ${x})
  ibam="${ibam} INPUT=${expName}_${chunk}.bam"
  bamTargets="${bamTargets} ${expName}_${chunk}.bam"
done  

cat > Makefile << ENDOFMAKE
SHELL = /bin/bash
NOVO = /home/elia/dawe/novocraft/novoalign 
SAMTOOLS = ${SAMTOOLS}
REFINDEX = ${GFile}
PICARD_BASE = /home/elia/dawe/software/picard-tools-1.56

all: ${expName}.bam.bai
	/bin/mv ${expName}.bam ${expName}.bam.lock
	/bin/rm -f ${bamTargets}
	/bin/mv ${expName}.bam.lock ${expName}.bam
	@echo "Time to give some meaning to these files"

${expName}.bam: ${bamTargets}
	java -jar  \$(PICARD_BASE)/MergeSamFiles.jar OUTPUT=${expName}.bam ${ibam} ASSUME_SORTED=True SORT_ORDER=coordinate VALIDATION_STRINGENCY=SILENT

${expName}.bam.bai: ${expName}.bam
	\$(SAMTOOLS) index ${expName}.bam

ENDOFMAKE

for x in `seq ${ntask}`
do
  chunk=`printf "%03d" ${x}`
  if [ $isPE == 1 ]
  then
    cat >> Makefile << ENDOFMAKE1
${expName}_${chunk}.bam: ${expName}_R1_${chunk}.fastq.gz ${expName}_R2_${chunk}.fastq.gz
	\$(NOVO) -d \$(REFINDEX) -f ${expName}_R1_${chunk}.fastq.gz ${expName}_R2_${chunk}.fastq.gz -i 200,50 -o SAM | \$(SAMTOOLS) view -Su - | \$(SAMTOOLS) sort - ${expName}_${chunk}
	
ENDOFMAKE1

  else
    cat >> Makefile << ENDOFMAKE2
${expName}_${chunk}.bam: ${expName}_R1_${chunk}.fastq.gz
	\$(BWA) samse \$(BWAOPT_SE) \$(REFINDEX) <(\$(BWA) aln \$(BWAOPT_ALN) \$(REFINDEX) ${expName}_R1_${chunk}.fastq.gz) ${expName}_R1_${chunk}.fastq.gz | \$(SAMTOOLS) view -Su - | \$(SAMTOOLS) sort - ${expName}_${chunk}
	
ENDOFMAKE2

  fi
done
Launch it in each Sample directory created by CASAVA, you will find a Makefile. That done you may just issue make -j N (just like CASAVA) to perform parallel processing of your fastq.gz files. Everything will be done up to your final merged sorted BAM file.
dawe is offline   Reply With Quote
Old 11-21-2011, 09:03 PM   #15
sparks
Senior Member
 
Location: Kuala Lumpur, Malaysia

Join Date: Mar 2008
Posts: 126
Default

Quote:
Originally Posted by kga1978 View Post
You know what - that actually worked....

gunzip *.gz -c | novoalign -f /dev/stdin -F ILM1.8 --ILQ_USE....

I'll settle with that option - not too many compromises here.

Thanks nickloman!
That's a neat way and it may improve performance as decompression is in its own thread.
One issue with Novoalign is that it reads a few records to check the file format then rewinds the files and starts again. If it detects a named pipe as input then it skips the file format check and you must use the -F option. I haven't tried with -f /dev/stdin but I expect it appears as a named pipe to Novoalign so the above should work.

Before we implemented use of gzipped input files in Novoalign we were doing paired end like this:

mkfifo read1_pipe read2_pipe
gunzip -c read1.fastq.gz > read1.pipe
gunzip -c read2.fastq.gz > read2.pipe
novoalign -f read1.pipe read2.pipe -F STDFQ ....

The same technique could be used to process multiple paired read files.

Colin
sparks is offline   Reply With Quote
Old 11-22-2011, 12:08 PM   #16
BAMseek
Senior Member
 
Location: St. Louis, MO, USA

Join Date: Apr 2011
Posts: 124
Default

Quote:
Originally Posted by RockChalkJayhawk View Post
Why not just align each of them independantly, then merge BAMS?
I am also currently trying to figure out the best way to go from multiple fastq files (produced by CASAVA 1.8) to alignment with Novoalign. I am not too familiar with the internal workings of novoalign - would aligning the fastq files separately then merging the BAMs give a different result than merging the FASTQs first then aligning? For example, does Novoalign choose the best hit of a multi-hit read based on other aligned reads?

thanks!
Justin
BAMseek is offline   Reply With Quote
Old 11-22-2011, 05:40 PM   #17
sparks
Senior Member
 
Location: Kuala Lumpur, Malaysia

Join Date: Mar 2008
Posts: 126
Default

Quote:
Originally Posted by BAMseek View Post
I am also currently trying to figure out the best way to go from multiple fastq files (produced by CASAVA 1.8) to alignment with Novoalign. I am not too familiar with the internal workings of novoalign - would aligning the fastq files separately then merging the BAMs give a different result than merging the FASTQs first then aligning? For example, does Novoalign choose the best hit of a multi-hit read based on other aligned reads?

thanks!
Justin
Hi Justin,

You can do multiple paired end files using named pipes

mkfifo read1_pipe read2_pipe
cat lane1_read1.fastq lane2_read1.fastq lane3_read1.fastq > read1.pipe
cat lane1_read2.fastq lane2_read2.fastq lane3_read2.fastq > read2.pipe
novoalign -f read1.pipe read2.pipe -F STDFQ ....
rm read1_pipe read2_pipe

You could also just cat all the fastq files into one one real file rather than a named pipe but that uses extra disk space.

Usually we would run a novoalign for each file of reads and then sort each report and merge the files. This gives a slight advantage in that multiple sorts can be run at the same time. When running samtools sort we usually write uncompressed bam to save CPU for compression.

The other question is will the results be different? Each read is aligned independently even multi-hit reads, if you use -r Random then alignment is chosen randomly from alignments for that read only.

However, it is possible that they will be very slightly different.

The first issue for paired end is that Novoalign adds a fragment length penalty to paired alignments and this can affect which alignment is reported for some pairs. It's not often that this penalty changes the alignment location but it can for multi-mapped reads. The initial fragment length penalties in Novoalign are calculated from a normal distribution using the mean and standard deviation entered on the -i option. However as pairs are aligned Novoalign builds a histogram of actual proper pair lengths and then starts to recalculate penalties based on the actual distribution. So if the actual fragment length distribution is not normal or the initial -i settings are not accurate then the fragment length penalties can change. Basically the first few thousand alignments for each run will have slightly different fragment length penalties than the later pairs. Setting -i accurately will minimise any differences. In any case, if you have millions of reads the start up effect is probably not noticeable. If you do merge read files they should all be from the same sample prep and all have the same fragment length distribution.

The second issue would be if you used quality calibration, the -k option. This also starts off on the assumption that qualities are accurate and then slowly adjusts as more reads are aligned so the first few thousand reads in each run will not have the benefit of quality calibration. (You can avoid this using by doing a trial run to collect calibration data) But calibration is also a good reason to run separate Novoaligns for each lane of reads as each lane should be calibrated separately, I've seen cases where individual lanes had very different calibration profiles.

So it's possible to process multiple files in this way but I don't see a real benefit.

Kind Regards, Colin
sparks is offline   Reply With Quote
Old 11-22-2011, 10:51 PM   #18
BAMseek
Senior Member
 
Location: St. Louis, MO, USA

Join Date: Apr 2011
Posts: 124
Default

Hi Colin,

Thank you for the very helpful information. I was able to get the named pipes up and running, and I think that solution will work for us or at least be the easiest thing to place into the pipeline with minimum changes. I'll also see if aligning them individually provides some performance gains since we are running the alignments on a cluster. Writing the pre-merge files to an uncompressed BAM is also a nice idea. Thanks for the detailed explanations of the differences we may see since we would need to be able to explain the differences if we switch to aligning one big file or multiple small files.

Thanks for all the work!
Justin
BAMseek is offline   Reply With Quote
Old 11-23-2011, 01:57 AM   #19
dawe
Senior Member
 
Location: 4530'25.22"N / 915'53.00"E

Join Date: Apr 2009
Posts: 258
Default

Quote:
Originally Posted by sparks View Post
So it's possible to process multiple files in this way but I don't see a real benefit.
If you have a clustered environment, aligning separate files separately allows you to spawn your process across the cluster. It is true that novoalign supports this with MPI, nevertheless I still prefer different processes in a Map/Reduce-like fashion.

d
dawe is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 01:34 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO