Unconfigured Ad

**RockChalkJayhawk** · 11-15-2011, 11:15 PM

Why not just align each of them independantly, then merge BAMS?

**kga1978** · 11-16-2011, 06:49 AM

I guess I could do that, but that workflow isn't very optimal either - what we do isn't (currently) scripted, so I would have to make a billion commands to make this work.

I guess I will settle with the gunzip pipe and then hopefully NovoCraft will get this into a future release - other programs like Mosaik can easily handle multiple small files - and with the new Illumina output, it would be good for other programs to do that as well.

Any other workarounds?

**nickloman** · 11-16-2011, 06:58 AM

Yeah, why not just gunzip -c *.gz > reads.fq and start from there?

**kga1978** · 11-16-2011, 07:00 AM

Hi nickloman,

Ya, that may be the best option - I'm just worried that this will slow down the whole process, but maybe not that much. I guess I could test it with a real experiment

.

**nickloman** · 11-16-2011, 07:01 AM

Yep "why think, why not do the experiment"

**kga1978** · 11-16-2011, 07:26 AM

Bummer, I spoke too soon - I thought Novoalign could read from stdin, but that doesn't appear to be the case. I tried the following commands, but none of them work:

gunzip *.gz -c | novoalign -f - -c 6 -F ILM1.8 **ILQ_SKIP -d reference -k -o SAM
gunzip *.gz -c | novoalign -f stdin -c 6 -F ILM1.8 **ILQ_SKIP -d reference -k -o SAM
gunzip *.gz -c | novoalign -c 6 -F ILM1.8 **ILQ_SKIP -d reference -k -o SAM

I could of course gunzip, cat and save the file and then use that as the input, but again, that workflow really is pretty far from optimal.

Bummer. NovoCraft - we need the ability to input more than one sequence file (or two for PE) with the new Casava 1.8 pipeline.

**nickloman** · 11-16-2011, 07:28 AM

That was the point of my post, simply "gunzip -c *.gz > reads.fq" and then use reads.fq as the input.

**nickloman** · 11-16-2011, 07:30 AM

Also if you really want pipes, you might find /dev/stdin or /dev/fd/0 may work.

**kga1978** · 11-16-2011, 07:44 AM

You know what - that actually worked....

gunzip *.gz -c | novoalign -f /dev/stdin -F ILM1.8 --ILQ_USE....

I'll settle with that option - not too many compromises here.

Thanks nickloman!

**RockChalkJayhawk** · 11-16-2011, 04:46 PM

Originally posted by kga1978 View Post

I guess I could do that, but that workflow isn't very optimal either - what we do isn't (currently) scripted, so I would have to make a billion commands to make this work.

I guess I will settle with the gunzip pipe and then hopefully NovoCraft will get this into a future release - other programs like Mosaik can easily handle multiple small files - and with the new Illumina output, it would be good for other programs to do that as well.

Any other workarounds?

I'm not quite sure what you mean about being optimal or a billlion commands. You will use the same amount of resources to do the alignment. It would be very simple to construct a bash loop to submit all your jobs too.

**kga1978** · 11-16-2011, 04:59 PM

Hi RockChalkJayhawk

I guess, by 'optimal' I meant 'so I don't have to write too many commands'. I'm not familiar with bash loops, but I just googled it and it looks fairly easy - thanks for heads up. I have been avoiding any sort of scripting (we don't have that many sequences and our genomes are small), but I guess I need to take the plunge

.

**RockChalkJayhawk** · 11-16-2011, 05:04 PM

if the names of your files are read001.gz, read002.gz, read036.gz, etc. do

Code:

for x in read *gz
do
gunzip $x| novoalign -f /dev/stdin -F ILM1.8 --ILQ_USE....{enter the rest here]
done

**dawe** · 11-21-2011, 03:13 PM

Originally posted by kga1978 View Post

Hi nickloman,

Ya, that may be the best option - I'm just worried that this will slow down the whole process, but maybe not that much. I guess I could test it with a real experiment

.

You may try this:

404 — Bitbucket

https://bitbucket.org/dawe/utils/src/0cde7b2d87e6/makeMake.sh

Just modify it for novoalign, like this:

Code:

#!/bin/bash

# Look for binaries we need, exit if not found

SAMTOOLS=/home/elia/bin/samtools
NOVOALIGN=/home/elia/dawe/novocraft/novoalign
INDEXPATH=/home/elia/dawe/index

[[ -z ${BWA} ]] && exit 1
[[ -z ${SAMTOOLS} ]] && exit 1

# Use the SampleSheet.csv to create the script and align
# SampleSheets has a header we want to skip

nlines=(`wc -l SampleSheet.csv`)
while read line
do
  # There may be reasons to skip this line... i.e. when there's no indexed genome 
  Skip=1

  # Split the SampleSheet line using commas, store values we need (ID, Lane, Genome and Index)
  IFS=","
  lineArray=($line)
  Lane=`printf "%03d" ${lineArray[1]}`
  ID=${lineArray[2]}
  Genome=${lineArray[3]}
  if [ -z ${lineArray[4]} ]
  then 
    # If there's no index, set it to NoIndex, just like Illumina does
    Index="NoIndex"
  else 
    Index=${lineArray[4]}
  fi
  
  # Create the Experiment name, this is also the name Illumina gives to files
  expName=${ID}_${Index}_L${Lane}
  
  # Check how many chunks we have, this will be the number of SGE array tasks

  ntask=`ls ${ID}_${Index}_L${Lane}_R1_*.fastq.gz | wc -l`

  # Also check if this is paired end sequencing. This changes the bwa line
  isPE=0
  if [ -f ${ID}_${Index}_L${Lane}_R2_001.fastq.gz ]
  then
    isPE=1
  fi  
  
  # Ok, what if we don't have the genome?
  [[ -d ${INDEXPATH}/${Genome} ]] && Skip=0
  if [ $Skip -eq 1 ]
  then
    echo "Missing ${Genome}, you should correct the file by hand (align_${expName}) and qsub it"
    GFile="FIXME"
  else
    # If there's the genome, look for the proper index (assuming there's one index per directory ^__^ )
    GFile=`ls ${INDEXPATH}/${Genome}/*.nix`
    #GFile=${GFile%%.bwt}
    echo "Using ${GFile}"
  fi  
done < <(tail -n $(( ${nlines[0]} - 1 )) SampleSheet.csv)

# generate Makefile

unset IFS

# set the list of BAM files
bamTargets=''
ibam=''
for x in `seq ${ntask}`
do
  chunk=$(printf "%03d" ${x})
  ibam="${ibam} INPUT=${expName}_${chunk}.bam"
  bamTargets="${bamTargets} ${expName}_${chunk}.bam"
done  

cat > Makefile << ENDOFMAKE
SHELL = /bin/bash
NOVO = /home/elia/dawe/novocraft/novoalign 
SAMTOOLS = ${SAMTOOLS}
REFINDEX = ${GFile}
PICARD_BASE = /home/elia/dawe/software/picard-tools-1.56

all: ${expName}.bam.bai
	/bin/mv ${expName}.bam ${expName}.bam.lock
	/bin/rm -f ${bamTargets}
	/bin/mv ${expName}.bam.lock ${expName}.bam
	@echo "Time to give some meaning to these files"

${expName}.bam: ${bamTargets}
	java -jar  \$(PICARD_BASE)/MergeSamFiles.jar OUTPUT=${expName}.bam ${ibam} ASSUME_SORTED=True SORT_ORDER=coordinate VALIDATION_STRINGENCY=SILENT

${expName}.bam.bai: ${expName}.bam
	\$(SAMTOOLS) index ${expName}.bam

ENDOFMAKE

for x in `seq ${ntask}`
do
  chunk=`printf "%03d" ${x}`
  if [ $isPE == 1 ]
  then
    cat >> Makefile << ENDOFMAKE1
${expName}_${chunk}.bam: ${expName}_R1_${chunk}.fastq.gz ${expName}_R2_${chunk}.fastq.gz
	\$(NOVO) -d \$(REFINDEX) -f ${expName}_R1_${chunk}.fastq.gz ${expName}_R2_${chunk}.fastq.gz -i 200,50 -o SAM | \$(SAMTOOLS) view -Su - | \$(SAMTOOLS) sort - ${expName}_${chunk}
	
ENDOFMAKE1

  else
    cat >> Makefile << ENDOFMAKE2
${expName}_${chunk}.bam: ${expName}_R1_${chunk}.fastq.gz
	\$(BWA) samse \$(BWAOPT_SE) \$(REFINDEX) <(\$(BWA) aln \$(BWAOPT_ALN) \$(REFINDEX) ${expName}_R1_${chunk}.fastq.gz) ${expName}_R1_${chunk}.fastq.gz | \$(SAMTOOLS) view -Su - | \$(SAMTOOLS) sort - ${expName}_${chunk}
	
ENDOFMAKE2

  fi
done

Launch it in each Sample directory created by CASAVA, you will find a Makefile. That done you may just issue make -j N (just like CASAVA) to perform parallel processing of your fastq.gz files. Everything will be done up to your final merged sorted BAM file.

**sparks** · 11-21-2011, 09:03 PM

Originally posted by kga1978 View Post

You know what - that actually worked....

gunzip *.gz -c | novoalign -f /dev/stdin -F ILM1.8 --ILQ_USE....

I'll settle with that option - not too many compromises here.

Thanks nickloman!

That's a neat way and it may improve performance as decompression is in its own thread.
One issue with Novoalign is that it reads a few records to check the file format then rewinds the files and starts again. If it detects a named pipe as input then it skips the file format check and you must use the -F option. I haven't tried with -f /dev/stdin but I expect it appears as a named pipe to Novoalign so the above should work.

Before we implemented use of gzipped input files in Novoalign we were doing paired end like this:

mkfifo read1_pipe read2_pipe
gunzip -c read1.fastq.gz > read1.pipe
gunzip -c read2.fastq.gz > read2.pipe
novoalign -f read1.pipe read2.pipe -F STDFQ ....

The same technique could be used to process multiple paired read files.

Colin

Topics	Statistics	Last Post
Whole-Genome Sequencing Traces Faroe Islands Ancestry to a North Atlantic Founder Population by SEQadmin2 Started by SEQadmin2, Yesterday, 06:09 AM	0 responses 16 views 0 reactions	Last Post by SEQadmin2 Yesterday, 06:09 AM
Sequencing the Two-Toed Sloth Genome Reveals Jumping Genes Tied to Its Extreme Metabolism by SEQadmin2 Started by SEQadmin2, 06-09-2026, 11:58 AM	0 responses 36 views 0 reactions	Last Post by SEQadmin2 06-09-2026, 11:58 AM
A New Method Makes Hantavirus Genome Analysis Faster and More Accessible by SEQadmin2 Started by SEQadmin2, 06-05-2026, 10:09 AM	0 responses 42 views 0 reactions	Last Post by SEQadmin2 06-05-2026, 10:09 AM
A New Single-Cell Method Maps DNA-Protein Interactions by SEQadmin2 Started by SEQadmin2, 06-04-2026, 08:59 AM	0 responses 49 views 0 reactions	Last Post by SEQadmin2 06-04-2026, 08:59 AM

Unconfigured Ad

Several input files w. Novoalign?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News