Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • making SAM: too few good pairs

    Hi Everyone,

    I am new to the community - learning to analyze whole exome data using bwa and samtools. First I will tell you what I have been doing, then point out the red flags, and then my question is at the bottom.

    This is what I have done so far:

    1. I analyzed the fastq files using fastQC and they look good. The reads are from Otogenetics.
    2. I downloaded the most recent b37 folder from the GATK resource bundle.
    3. I indexed the genome using bwa-0.6.2, and aligned 2 illumina fastq files and then merged them into a SAM file.

    Here are the commands I used for steps 2 and 3:
    ./bwa index -a bwtsw human_g1k_v37.fasta
    ./bwa aln human_g1k_v37.fasta Ot6000_R1_001.fastq > Ot6000_R1.sai
    ./bwa aln human_g1k_v37.fasta Ot6000_R2_001.fastq > Ot6000_R2.sai
    ./bwa sampe -r "@RG\tID:6000\tSM:6000\tLB:ga\tPL:Illumina" human_g1k_v37.fasta Ot6000_R1.sai Ot6000_R2.sai Ot6000_R1_001.fastq Ot6000_R2_001.fastq > Ot6000.sam

    4. I also made bam files, sorted the bam files and indexed the bam files.
    5. I ran flagstat on the sorted bam files. Here is the output:

    There are these 3 red flags in the output:
    1. output during merging of 2 sai files into sam files:

    [bwa_sai2sam_pe_core] convert to sequence coordinate...
    [infer_isize] fail to infer insert size: too few good pairs
    [bwa_sai2sam_pe_core] time elapses: 1.00 sec
    [bwa_sai2sam_pe_core] changing coordinates of 0 alignments.
    [bwa_sai2sam_pe_core] align unmapped mate...
    [bwa_sai2sam_pe_core] time elapses: 0.39 sec
    [bwa_sai2sam_pe_core] refine gapped alignments... 0.01 sec
    [bwa_sai2sam_pe_core] print alignments... 0.18 sec
    [bwa_sai2sam_pe_core] 28604456 sequences have been processed.

    2. bai files are empty

    3. output from flagstat:
    57208912 + 0 in total (QC-passed reads + QC-failed reads)
    0 + 0 duplicates
    0 + 0 mapped (0.00%:nan%)
    57208912 + 0 paired in sequencing
    28604456 + 0 read1
    28604456 + 0 read2
    0 + 0 properly paired (0.00%:nan%)
    0 + 0 with itself and mate mapped
    0 + 0 singletons (0.00%:nan%)
    0 + 0 with mate mapped to a different chr
    0 + 0 with mate mapped to a different chr (mapQ>=5)

    Something is going on with the pairing of the reads, based on the output while making the sam file and from the flagstat output.

    I have re-downloaded the GATK resource bundle, re-indexed the genome, tried other fastq files, etc, and I'm getting the same result. I can't find the problem. Any ideas?

    Thanks!
    Allison

  • #2
    Taking the data as face value; nothing mapped. Have you spot-checked a few reads and confirmed that they BLAT to the human genome? Did you at least check to see if on all the indexing files that should be there are there?

    Have you looked at the .sam file? What do the entries look like? They all have the 4 flag? The quality strings look sane? The sequences look sane?

    I don't think pairing is the problem, but if you think it is, have you tried making .sam files based on read 1 alone? What do those look like?

    Comment


    • #3
      hi, thanks for your help.

      I confirmed that the fastq reads BLAT to the human genome. However, looking at the SAM file, the flags are 77 and 141, throughout. I looked up the flags and they mean this:

      "All Bad"
      At the other end of the spectrum we have "all bad" i.e. neither the read nor its mate mapped:
      77 - 0001001101 - First in pair, both reads in pair unmapped. "All bad"
      141 - 0010001101 - Second in pair and "all bad".

      Comment


      • #4
        You could try reindexing the reference. Or maybe a chromosome, to speed up troubleshooting.

        Have you eyeballed the fastq? What about the rest of the .sam entry, besides the flag? Does that look sane? Did you eyeball the reference fasta?

        Something is off, and likely, it's because something is wrong with one of your files

        Comment


        • #5
          Can you post examples of the fastq headers for both files? Maybe the pairs are not correctly recognised as pairs because the header is not, for example, similar, only with /1 and /2 at the end.

          Comment


          • #6
            Hi again,
            I gave the fastq files to a person who helps me and they worked fine for him. Something is going wrong (for me) with either indexing the genome, or aligning the fastq files to the reference genome. The fastq reads aren't aligning to the genome. I have downloaded fresh copies of both the human genome and the bwa software, and I'm still getting the same results. My sai files are 114 MB and they should be 800 MB (that's what the other person got with the same files).

            Comment


            • #7
              hi everyone,
              I guess the newer versions of bwa don't work on a Mac. I downloaded an older one - 0.5.10 - and already things are looking different. Hopefully that solved the problem.
              Thanks!

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Advancing Precision Medicine for Rare Diseases in Children
                by seqadmin




                Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
                12-16-2024, 07:57 AM
              • seqadmin
                Recent Advances in Sequencing Technologies
                by seqadmin



                Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

                Long-Read Sequencing
                Long-read sequencing has seen remarkable advancements,...
                12-02-2024, 01:49 PM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 12-17-2024, 10:28 AM
              0 responses
              26 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 12-13-2024, 08:24 AM
              0 responses
              42 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 12-12-2024, 07:41 AM
              0 responses
              28 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 12-11-2024, 07:45 AM
              0 responses
              42 views
              0 likes
              Last Post seqadmin  
              Working...
              X