Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • kga1978
    Senior Member
    • Nov 2010
    • 100

    Several input files w. Novoalign?

    Hi all,

    I am using Novoalign to align 50bp illumina SE reads to a viral genome. The reads were processed with Casava 1.8, so I have a bunch of individual read files - e.g.

    read001.gz
    read002.gz
    .
    .
    read036.gz

    Now, is there a way I can tell Novoalign to align all of these reads in one go? (I know I could pipe it with gunzip, but I would rather not do that). I have tried to put them in a folder and point Novoalign like so:

    novoalign -f Reads/ -c 6 -F ILM1.8 --**ILQ_SKIP -d reference -k -o SAM 2> reads.novoalign_logS.txt | samtools view -S -b -q 1 - | samtools sort - reads_sortedS

    But that doesn't work - neither does replacing 'Reads/' with read0*.gz.

    Any thoughts?
  • RockChalkJayhawk
    Senior Member
    • Mar 2009
    • 192

    #2
    Why not just align each of them independantly, then merge BAMS?

    Comment

    • kga1978
      Senior Member
      • Nov 2010
      • 100

      #3
      I guess I could do that, but that workflow isn't very optimal either - what we do isn't (currently) scripted, so I would have to make a billion commands to make this work.

      I guess I will settle with the gunzip pipe and then hopefully NovoCraft will get this into a future release - other programs like Mosaik can easily handle multiple small files - and with the new Illumina output, it would be good for other programs to do that as well.

      Any other workarounds?

      Comment

      • nickloman
        Senior Member
        • Jul 2009
        • 355

        #4
        Yeah, why not just gunzip -c *.gz > reads.fq and start from there?

        Comment

        • kga1978
          Senior Member
          • Nov 2010
          • 100

          #5
          Hi nickloman,

          Ya, that may be the best option - I'm just worried that this will slow down the whole process, but maybe not that much. I guess I could test it with a real experiment .

          Comment

          • nickloman
            Senior Member
            • Jul 2009
            • 355

            #6
            Yep "why think, why not do the experiment"

            Comment

            • kga1978
              Senior Member
              • Nov 2010
              • 100

              #7
              Bummer, I spoke too soon - I thought Novoalign could read from stdin, but that doesn't appear to be the case. I tried the following commands, but none of them work:

              gunzip *.gz -c | novoalign -f - -c 6 -F ILM1.8 **ILQ_SKIP -d reference -k -o SAM
              gunzip *.gz -c | novoalign -f stdin -c 6 -F ILM1.8 **ILQ_SKIP -d reference -k -o SAM
              gunzip *.gz -c | novoalign -c 6 -F ILM1.8 **ILQ_SKIP -d reference -k -o SAM

              I could of course gunzip, cat and save the file and then use that as the input, but again, that workflow really is pretty far from optimal.

              Bummer. NovoCraft - we need the ability to input more than one sequence file (or two for PE) with the new Casava 1.8 pipeline.

              Comment

              • nickloman
                Senior Member
                • Jul 2009
                • 355

                #8
                That was the point of my post, simply "gunzip -c *.gz > reads.fq" and then use reads.fq as the input.

                Comment

                • nickloman
                  Senior Member
                  • Jul 2009
                  • 355

                  #9
                  Also if you really want pipes, you might find /dev/stdin or /dev/fd/0 may work.

                  Comment

                  • kga1978
                    Senior Member
                    • Nov 2010
                    • 100

                    #10
                    You know what - that actually worked....

                    gunzip *.gz -c | novoalign -f /dev/stdin -F ILM1.8 --ILQ_USE....

                    I'll settle with that option - not too many compromises here.

                    Thanks nickloman!

                    Comment

                    • RockChalkJayhawk
                      Senior Member
                      • Mar 2009
                      • 192

                      #11
                      Originally posted by kga1978 View Post
                      I guess I could do that, but that workflow isn't very optimal either - what we do isn't (currently) scripted, so I would have to make a billion commands to make this work.

                      I guess I will settle with the gunzip pipe and then hopefully NovoCraft will get this into a future release - other programs like Mosaik can easily handle multiple small files - and with the new Illumina output, it would be good for other programs to do that as well.

                      Any other workarounds?
                      I'm not quite sure what you mean about being optimal or a billlion commands. You will use the same amount of resources to do the alignment. It would be very simple to construct a bash loop to submit all your jobs too.

                      Comment

                      • kga1978
                        Senior Member
                        • Nov 2010
                        • 100

                        #12
                        Hi RockChalkJayhawk

                        I guess, by 'optimal' I meant 'so I don't have to write too many commands'. I'm not familiar with bash loops, but I just googled it and it looks fairly easy - thanks for heads up. I have been avoiding any sort of scripting (we don't have that many sequences and our genomes are small), but I guess I need to take the plunge .

                        Comment

                        • RockChalkJayhawk
                          Senior Member
                          • Mar 2009
                          • 192

                          #13
                          if the names of your files are read001.gz, read002.gz, read036.gz, etc. do
                          Code:
                          for x in read *gz
                          do
                          gunzip $x| novoalign -f /dev/stdin -F ILM1.8 --ILQ_USE....{enter the rest here]
                          done

                          Comment

                          • dawe
                            Senior Member
                            • Apr 2009
                            • 258

                            #14
                            Originally posted by kga1978 View Post
                            Hi nickloman,

                            Ya, that may be the best option - I'm just worried that this will slow down the whole process, but maybe not that much. I guess I could test it with a real experiment .
                            You may try this:



                            Just modify it for novoalign, like this:

                            Code:
                            #!/bin/bash
                            
                            # Look for binaries we need, exit if not found
                            
                            SAMTOOLS=/home/elia/bin/samtools
                            NOVOALIGN=/home/elia/dawe/novocraft/novoalign
                            INDEXPATH=/home/elia/dawe/index
                            
                            [[ -z ${BWA} ]] && exit 1
                            [[ -z ${SAMTOOLS} ]] && exit 1
                            
                            # Use the SampleSheet.csv to create the script and align
                            # SampleSheets has a header we want to skip
                            
                            nlines=(`wc -l SampleSheet.csv`)
                            while read line
                            do
                              # There may be reasons to skip this line... i.e. when there's no indexed genome 
                              Skip=1
                            
                              # Split the SampleSheet line using commas, store values we need (ID, Lane, Genome and Index)
                              IFS=","
                              lineArray=($line)
                              Lane=`printf "%03d" ${lineArray[1]}`
                              ID=${lineArray[2]}
                              Genome=${lineArray[3]}
                              if [ -z ${lineArray[4]} ]
                              then 
                                # If there's no index, set it to NoIndex, just like Illumina does
                                Index="NoIndex"
                              else 
                                Index=${lineArray[4]}
                              fi
                              
                              # Create the Experiment name, this is also the name Illumina gives to files
                              expName=${ID}_${Index}_L${Lane}
                              
                              # Check how many chunks we have, this will be the number of SGE array tasks
                            
                              ntask=`ls ${ID}_${Index}_L${Lane}_R1_*.fastq.gz | wc -l`
                            
                              # Also check if this is paired end sequencing. This changes the bwa line
                              isPE=0
                              if [ -f ${ID}_${Index}_L${Lane}_R2_001.fastq.gz ]
                              then
                                isPE=1
                              fi  
                              
                              # Ok, what if we don't have the genome?
                              [[ -d ${INDEXPATH}/${Genome} ]] && Skip=0
                              if [ $Skip -eq 1 ]
                              then
                                echo "Missing ${Genome}, you should correct the file by hand (align_${expName}) and qsub it"
                                GFile="FIXME"
                              else
                                # If there's the genome, look for the proper index (assuming there's one index per directory ^__^ )
                                GFile=`ls ${INDEXPATH}/${Genome}/*.nix`
                                #GFile=${GFile%%.bwt}
                                echo "Using ${GFile}"
                              fi  
                            done < <(tail -n $(( ${nlines[0]} - 1 )) SampleSheet.csv)
                            
                            # generate Makefile
                            
                            unset IFS
                            
                            # set the list of BAM files
                            bamTargets=''
                            ibam=''
                            for x in `seq ${ntask}`
                            do
                              chunk=$(printf "%03d" ${x})
                              ibam="${ibam} INPUT=${expName}_${chunk}.bam"
                              bamTargets="${bamTargets} ${expName}_${chunk}.bam"
                            done  
                            
                            cat > Makefile << ENDOFMAKE
                            SHELL = /bin/bash
                            NOVO = /home/elia/dawe/novocraft/novoalign 
                            SAMTOOLS = ${SAMTOOLS}
                            REFINDEX = ${GFile}
                            PICARD_BASE = /home/elia/dawe/software/picard-tools-1.56
                            
                            all: ${expName}.bam.bai
                            	/bin/mv ${expName}.bam ${expName}.bam.lock
                            	/bin/rm -f ${bamTargets}
                            	/bin/mv ${expName}.bam.lock ${expName}.bam
                            	@echo "Time to give some meaning to these files"
                            
                            ${expName}.bam: ${bamTargets}
                            	java -jar  \$(PICARD_BASE)/MergeSamFiles.jar OUTPUT=${expName}.bam ${ibam} ASSUME_SORTED=True SORT_ORDER=coordinate VALIDATION_STRINGENCY=SILENT
                            
                            ${expName}.bam.bai: ${expName}.bam
                            	\$(SAMTOOLS) index ${expName}.bam
                            
                            ENDOFMAKE
                            
                            for x in `seq ${ntask}`
                            do
                              chunk=`printf "%03d" ${x}`
                              if [ $isPE == 1 ]
                              then
                                cat >> Makefile << ENDOFMAKE1
                            ${expName}_${chunk}.bam: ${expName}_R1_${chunk}.fastq.gz ${expName}_R2_${chunk}.fastq.gz
                            	\$(NOVO) -d \$(REFINDEX) -f ${expName}_R1_${chunk}.fastq.gz ${expName}_R2_${chunk}.fastq.gz -i 200,50 -o SAM | \$(SAMTOOLS) view -Su - | \$(SAMTOOLS) sort - ${expName}_${chunk}
                            	
                            ENDOFMAKE1
                            
                              else
                                cat >> Makefile << ENDOFMAKE2
                            ${expName}_${chunk}.bam: ${expName}_R1_${chunk}.fastq.gz
                            	\$(BWA) samse \$(BWAOPT_SE) \$(REFINDEX) <(\$(BWA) aln \$(BWAOPT_ALN) \$(REFINDEX) ${expName}_R1_${chunk}.fastq.gz) ${expName}_R1_${chunk}.fastq.gz | \$(SAMTOOLS) view -Su - | \$(SAMTOOLS) sort - ${expName}_${chunk}
                            	
                            ENDOFMAKE2
                            
                              fi
                            done
                            Launch it in each Sample directory created by CASAVA, you will find a Makefile. That done you may just issue make -j N (just like CASAVA) to perform parallel processing of your fastq.gz files. Everything will be done up to your final merged sorted BAM file.

                            Comment

                            • sparks
                              Senior Member
                              • Mar 2008
                              • 126

                              #15
                              Originally posted by kga1978 View Post
                              You know what - that actually worked....

                              gunzip *.gz -c | novoalign -f /dev/stdin -F ILM1.8 --ILQ_USE....

                              I'll settle with that option - not too many compromises here.

                              Thanks nickloman!
                              That's a neat way and it may improve performance as decompression is in its own thread.
                              One issue with Novoalign is that it reads a few records to check the file format then rewinds the files and starts again. If it detects a named pipe as input then it skips the file format check and you must use the -F option. I haven't tried with -f /dev/stdin but I expect it appears as a named pipe to Novoalign so the above should work.

                              Before we implemented use of gzipped input files in Novoalign we were doing paired end like this:

                              mkfifo read1_pipe read2_pipe
                              gunzip -c read1.fastq.gz > read1.pipe
                              gunzip -c read2.fastq.gz > read2.pipe
                              novoalign -f read1.pipe read2.pipe -F STDFQ ....

                              The same technique could be used to process multiple paired read files.

                              Colin

                              Comment

                              Latest Articles

                              Collapse

                              • SEQadmin2
                                Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                                by SEQadmin2


                                I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.


                                Here are nine questions we think about, in roughly the order they matter, before...
                                Today, 07:11 AM
                              • SEQadmin2
                                From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                                by SEQadmin2


                                Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                                The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                                ...
                                06-02-2026, 10:05 AM
                              • SEQadmin2
                                Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                                by SEQadmin2


                                With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                                Introduction

                                Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                                05-22-2026, 06:42 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, Yesterday, 06:09 AM
                              0 responses
                              16 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-09-2026, 11:58 AM
                              0 responses
                              36 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-05-2026, 10:09 AM
                              0 responses
                              42 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-04-2026, 08:59 AM
                              0 responses
                              49 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...