Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Several input files w. Novoalign?

    Hi all,

    I am using Novoalign to align 50bp illumina SE reads to a viral genome. The reads were processed with Casava 1.8, so I have a bunch of individual read files - e.g.

    read001.gz
    read002.gz
    .
    .
    read036.gz

    Now, is there a way I can tell Novoalign to align all of these reads in one go? (I know I could pipe it with gunzip, but I would rather not do that). I have tried to put them in a folder and point Novoalign like so:

    novoalign -f Reads/ -c 6 -F ILM1.8 --**ILQ_SKIP -d reference -k -o SAM 2> reads.novoalign_logS.txt | samtools view -S -b -q 1 - | samtools sort - reads_sortedS

    But that doesn't work - neither does replacing 'Reads/' with read0*.gz.

    Any thoughts?

  • #2
    Why not just align each of them independantly, then merge BAMS?

    Comment


    • #3
      I guess I could do that, but that workflow isn't very optimal either - what we do isn't (currently) scripted, so I would have to make a billion commands to make this work.

      I guess I will settle with the gunzip pipe and then hopefully NovoCraft will get this into a future release - other programs like Mosaik can easily handle multiple small files - and with the new Illumina output, it would be good for other programs to do that as well.

      Any other workarounds?

      Comment


      • #4
        Yeah, why not just gunzip -c *.gz > reads.fq and start from there?

        Comment


        • #5
          Hi nickloman,

          Ya, that may be the best option - I'm just worried that this will slow down the whole process, but maybe not that much. I guess I could test it with a real experiment .

          Comment


          • #6
            Yep "why think, why not do the experiment"

            Comment


            • #7
              Bummer, I spoke too soon - I thought Novoalign could read from stdin, but that doesn't appear to be the case. I tried the following commands, but none of them work:

              gunzip *.gz -c | novoalign -f - -c 6 -F ILM1.8 **ILQ_SKIP -d reference -k -o SAM
              gunzip *.gz -c | novoalign -f stdin -c 6 -F ILM1.8 **ILQ_SKIP -d reference -k -o SAM
              gunzip *.gz -c | novoalign -c 6 -F ILM1.8 **ILQ_SKIP -d reference -k -o SAM

              I could of course gunzip, cat and save the file and then use that as the input, but again, that workflow really is pretty far from optimal.

              Bummer. NovoCraft - we need the ability to input more than one sequence file (or two for PE) with the new Casava 1.8 pipeline.

              Comment


              • #8
                That was the point of my post, simply "gunzip -c *.gz > reads.fq" and then use reads.fq as the input.

                Comment


                • #9
                  Also if you really want pipes, you might find /dev/stdin or /dev/fd/0 may work.

                  Comment


                  • #10
                    You know what - that actually worked....

                    gunzip *.gz -c | novoalign -f /dev/stdin -F ILM1.8 --ILQ_USE....

                    I'll settle with that option - not too many compromises here.

                    Thanks nickloman!

                    Comment


                    • #11
                      Originally posted by kga1978 View Post
                      I guess I could do that, but that workflow isn't very optimal either - what we do isn't (currently) scripted, so I would have to make a billion commands to make this work.

                      I guess I will settle with the gunzip pipe and then hopefully NovoCraft will get this into a future release - other programs like Mosaik can easily handle multiple small files - and with the new Illumina output, it would be good for other programs to do that as well.

                      Any other workarounds?
                      I'm not quite sure what you mean about being optimal or a billlion commands. You will use the same amount of resources to do the alignment. It would be very simple to construct a bash loop to submit all your jobs too.

                      Comment


                      • #12
                        Hi RockChalkJayhawk

                        I guess, by 'optimal' I meant 'so I don't have to write too many commands'. I'm not familiar with bash loops, but I just googled it and it looks fairly easy - thanks for heads up. I have been avoiding any sort of scripting (we don't have that many sequences and our genomes are small), but I guess I need to take the plunge .

                        Comment


                        • #13
                          if the names of your files are read001.gz, read002.gz, read036.gz, etc. do
                          Code:
                          for x in read *gz
                          do
                          gunzip $x| novoalign -f /dev/stdin -F ILM1.8 --ILQ_USE....{enter the rest here]
                          done

                          Comment


                          • #14
                            Originally posted by kga1978 View Post
                            Hi nickloman,

                            Ya, that may be the best option - I'm just worried that this will slow down the whole process, but maybe not that much. I guess I could test it with a real experiment .
                            You may try this:



                            Just modify it for novoalign, like this:

                            Code:
                            #!/bin/bash
                            
                            # Look for binaries we need, exit if not found
                            
                            SAMTOOLS=/home/elia/bin/samtools
                            NOVOALIGN=/home/elia/dawe/novocraft/novoalign
                            INDEXPATH=/home/elia/dawe/index
                            
                            [[ -z ${BWA} ]] && exit 1
                            [[ -z ${SAMTOOLS} ]] && exit 1
                            
                            # Use the SampleSheet.csv to create the script and align
                            # SampleSheets has a header we want to skip
                            
                            nlines=(`wc -l SampleSheet.csv`)
                            while read line
                            do
                              # There may be reasons to skip this line... i.e. when there's no indexed genome 
                              Skip=1
                            
                              # Split the SampleSheet line using commas, store values we need (ID, Lane, Genome and Index)
                              IFS=","
                              lineArray=($line)
                              Lane=`printf "%03d" ${lineArray[1]}`
                              ID=${lineArray[2]}
                              Genome=${lineArray[3]}
                              if [ -z ${lineArray[4]} ]
                              then 
                                # If there's no index, set it to NoIndex, just like Illumina does
                                Index="NoIndex"
                              else 
                                Index=${lineArray[4]}
                              fi
                              
                              # Create the Experiment name, this is also the name Illumina gives to files
                              expName=${ID}_${Index}_L${Lane}
                              
                              # Check how many chunks we have, this will be the number of SGE array tasks
                            
                              ntask=`ls ${ID}_${Index}_L${Lane}_R1_*.fastq.gz | wc -l`
                            
                              # Also check if this is paired end sequencing. This changes the bwa line
                              isPE=0
                              if [ -f ${ID}_${Index}_L${Lane}_R2_001.fastq.gz ]
                              then
                                isPE=1
                              fi  
                              
                              # Ok, what if we don't have the genome?
                              [[ -d ${INDEXPATH}/${Genome} ]] && Skip=0
                              if [ $Skip -eq 1 ]
                              then
                                echo "Missing ${Genome}, you should correct the file by hand (align_${expName}) and qsub it"
                                GFile="FIXME"
                              else
                                # If there's the genome, look for the proper index (assuming there's one index per directory ^__^ )
                                GFile=`ls ${INDEXPATH}/${Genome}/*.nix`
                                #GFile=${GFile%%.bwt}
                                echo "Using ${GFile}"
                              fi  
                            done < <(tail -n $(( ${nlines[0]} - 1 )) SampleSheet.csv)
                            
                            # generate Makefile
                            
                            unset IFS
                            
                            # set the list of BAM files
                            bamTargets=''
                            ibam=''
                            for x in `seq ${ntask}`
                            do
                              chunk=$(printf "%03d" ${x})
                              ibam="${ibam} INPUT=${expName}_${chunk}.bam"
                              bamTargets="${bamTargets} ${expName}_${chunk}.bam"
                            done  
                            
                            cat > Makefile << ENDOFMAKE
                            SHELL = /bin/bash
                            NOVO = /home/elia/dawe/novocraft/novoalign 
                            SAMTOOLS = ${SAMTOOLS}
                            REFINDEX = ${GFile}
                            PICARD_BASE = /home/elia/dawe/software/picard-tools-1.56
                            
                            all: ${expName}.bam.bai
                            	/bin/mv ${expName}.bam ${expName}.bam.lock
                            	/bin/rm -f ${bamTargets}
                            	/bin/mv ${expName}.bam.lock ${expName}.bam
                            	@echo "Time to give some meaning to these files"
                            
                            ${expName}.bam: ${bamTargets}
                            	java -jar  \$(PICARD_BASE)/MergeSamFiles.jar OUTPUT=${expName}.bam ${ibam} ASSUME_SORTED=True SORT_ORDER=coordinate VALIDATION_STRINGENCY=SILENT
                            
                            ${expName}.bam.bai: ${expName}.bam
                            	\$(SAMTOOLS) index ${expName}.bam
                            
                            ENDOFMAKE
                            
                            for x in `seq ${ntask}`
                            do
                              chunk=`printf "%03d" ${x}`
                              if [ $isPE == 1 ]
                              then
                                cat >> Makefile << ENDOFMAKE1
                            ${expName}_${chunk}.bam: ${expName}_R1_${chunk}.fastq.gz ${expName}_R2_${chunk}.fastq.gz
                            	\$(NOVO) -d \$(REFINDEX) -f ${expName}_R1_${chunk}.fastq.gz ${expName}_R2_${chunk}.fastq.gz -i 200,50 -o SAM | \$(SAMTOOLS) view -Su - | \$(SAMTOOLS) sort - ${expName}_${chunk}
                            	
                            ENDOFMAKE1
                            
                              else
                                cat >> Makefile << ENDOFMAKE2
                            ${expName}_${chunk}.bam: ${expName}_R1_${chunk}.fastq.gz
                            	\$(BWA) samse \$(BWAOPT_SE) \$(REFINDEX) <(\$(BWA) aln \$(BWAOPT_ALN) \$(REFINDEX) ${expName}_R1_${chunk}.fastq.gz) ${expName}_R1_${chunk}.fastq.gz | \$(SAMTOOLS) view -Su - | \$(SAMTOOLS) sort - ${expName}_${chunk}
                            	
                            ENDOFMAKE2
                            
                              fi
                            done
                            Launch it in each Sample directory created by CASAVA, you will find a Makefile. That done you may just issue make -j N (just like CASAVA) to perform parallel processing of your fastq.gz files. Everything will be done up to your final merged sorted BAM file.

                            Comment


                            • #15
                              Originally posted by kga1978 View Post
                              You know what - that actually worked....

                              gunzip *.gz -c | novoalign -f /dev/stdin -F ILM1.8 --ILQ_USE....

                              I'll settle with that option - not too many compromises here.

                              Thanks nickloman!
                              That's a neat way and it may improve performance as decompression is in its own thread.
                              One issue with Novoalign is that it reads a few records to check the file format then rewinds the files and starts again. If it detects a named pipe as input then it skips the file format check and you must use the -F option. I haven't tried with -f /dev/stdin but I expect it appears as a named pipe to Novoalign so the above should work.

                              Before we implemented use of gzipped input files in Novoalign we were doing paired end like this:

                              mkfifo read1_pipe read2_pipe
                              gunzip -c read1.fastq.gz > read1.pipe
                              gunzip -c read2.fastq.gz > read2.pipe
                              novoalign -f read1.pipe read2.pipe -F STDFQ ....

                              The same technique could be used to process multiple paired read files.

                              Colin

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM
                              • seqadmin
                                The Impact of AI in Genomic Medicine
                                by seqadmin



                                Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
                                02-26-2024, 02:07 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 03-14-2024, 06:13 AM
                              0 responses
                              32 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-08-2024, 08:03 AM
                              0 responses
                              71 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-07-2024, 08:13 AM
                              0 responses
                              80 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-06-2024, 09:51 AM
                              0 responses
                              68 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X