Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • ReorderSam (Picard) failing due to tiny read

    Hi,

    I am new to the forum and just starting out with my first forays into next-gen pipelines. We are currently using a proprietary platform for alignment/variant calling, so I've been tasked with finding a suitable open-source replacement, if possible.

    I have started off with bwa and am looking to use GATK for the variant calling part. My workflow so far looks like this:

    - Illumina MiSEQ paired reads (~600k reads) - cutadapt to remove adapter sequences
    - Reference - sequences downloaded from NCBI, indexed using bwa
    - Alignment of reads to reference - using bwa mem
    - Conversion of SAM to BAM using samtools

    Now here's where I'm coming a bit unstuck. Before I can do anything in GATK it seems I need to sort the bam file to match the contig order in the reference file. GATK recommends ReorderSam for this. Here's my command line (having first created the sequence dictionary successfully):

    java -jar ReorderSam.jar INPUT=alignment.bam OUTPUT=alignment_reorder.bam REFERENCE=/path/to/humanGenome/reference.fasta
    I get the following error:

    Exception in thread "main" net.sf.samtools.SAMFormatException: SAM validation error: ERROR: Record 27388, Read name [read name], Zero-length read without FZ, CS or CQ tag
    So... I dutifully go back to my FASTQ to look for [read name], and it's not zero length! It *is* very short though (8 base pairs...). Before the program stops I get a partial output file - if I do:

    samtools view -H alignment_reorder.bam | head -1
    I get:

    [bam_header_read] EOF marker is absent. The input is probably truncated.
    @HD VN:1.4 SO:coordinate
    ...so it's sort of right?! I think this mega-short read is tripping it up but I'm not sure where to go from here. I can probably discard reads less than a certain length but preferably would just ignore them if possible. If anyone has seen similar and can shed light I'd be very grateful indeed. Apologies for the insanely long post and I hope it makes sense.

    TIA.

  • #2
    With cutadapt you can discard reads less than X basepairs (you can define X), so that's the easiest thing to do.

    Be careful though when doing this with cutadapt with paired end reads as it can discard one but not both as cutadapt will consider pairs separately. There are other threads that address this issue, but one option is to use Trim Galore, which is a wrapper around cutadapt that will handle paired end reads appropriately.

    Comment


    • #3
      I'm very embarassed - thanks for taking the time to reply, but I was looking at the wrong FASTQ file when searching for the offending read. So - the read in question *is* in fact empty.

      My next question is therefore - what is the best way to discard reads that have no sequence or quality in them? It looks like this in the FASTQ:

      @READ_NAME

      +
      To clarify (just realised I left out a step in the workflow above) - first I use cutadapt to remove adapters and at the same time discard any reads less than 1bp long to get rid of empty reads. Then I need to do an additional step to get rid of 5bp from either end of the reads (technology-specific requirement) so I've got a Perl script for that - this is where the empty reads are coming from - so I'll just modify my script to get rid of them at this stage too. Unless there's a better way to do it? Not sure what ulitities are out there for manipulating FASTQs. Thanks again for replying!

      Comment


      • #4
        I don't know if you're willing to re-analyze everything, but if you are you can specify cut adapt to get rid of reads that are 6bp or shorter. But again, if you have paired end reads you need to be careful that if you discard one end to also discard the other end. If you're not willing to start over with Trim Galore (or something else) then you'll need to write a bit of code to do this.

        Why do you have to get rid of 5bp from each end after running cutadapt? That doesn't sound logical so I'm curious.

        Comment


        • #5
          The 5bp thing is to remove residual bases from the enzyme restriction footprint. Going to start again today with the whole analysis and keep better track of which files are being generated - I ended up with about 6 in one directory yesterday with very similar names hence my confusion over where the empty sequence came from, that was tripping up downstream analysis.

          As an aside - I had previously tried to process the reads using BioPerl for removing the 5bp from either end but had to look for an alternative as the Bio::SeqIO module can't seem to cope with (what I think is) well-formed FASTQ (the example in my previous post where there is a sequence name/id but no actual sequence or qualities) - maybe that warrents a message to the BioPerl mailing list though...

          Comment


          • #6
            bwa-mem paired end

            Just thought of another question - not sure if I should have made this a new topic actually - I'm using bwa mem for alignment. Reading the man page, it seems that if both forward and reverse reads are present, "this command assumes the i-th read in reads.fq and the i-th read in mates.fq constitute a read pair"

            So... thanks to various pre-processing steps, the forward and reverse now have different numbers of reads, so not every read in the forward has its corresponding pair, and vice versa. Should I then process these files further to ensure the condition above about the i-th read in both files being paired accordingly?

            Comment


            • #7
              If you want to align the reads as paired-end reads, yes, you should.

              Comment


              • #8
                Thank you for the reply

                Comment


                • #9
                  As mastal said, you should; that's what I was alluding to above. Could you specify your restriction footprint as part of your adapter sequence? Would the footprint always be the same? If so that can be trimmed off too.

                  Also, a quick linux way to get rid of 5 characters from the end of each read:

                  Code:
                  sed 'n;s,^.....,,' input.fq | sed 'n;s,.....$,,'

                  Comment


                  • #10
                    Not sure how I missed your point about paired ends - sorry for asking the same question twice! Re: restriction footprint - I will have to check with my colleagues who run the samples (am not very au fait with the technology but getting there!). Thanks for the one-liner, very useful. I have now got to the stage where I have an indexed, sorted and reordered BAM with read groups, plus the reference FASTA. From looking at the GATK best practice document I guess my next step is local realignment...

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Strategies for Sequencing Challenging Samples
                      by seqadmin


                      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                      03-22-2024, 06:39 AM
                    • seqadmin
                      Techniques and Challenges in Conservation Genomics
                      by seqadmin



                      The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                      Avian Conservation
                      Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                      03-08-2024, 10:41 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, Yesterday, 06:37 PM
                    0 responses
                    11 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, Yesterday, 06:07 PM
                    0 responses
                    10 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-22-2024, 10:03 AM
                    0 responses
                    51 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-21-2024, 07:32 AM
                    0 responses
                    67 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X