Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Why don't my SAM files list the chromosomes?

    I used the latest version of BWA. I tried the program 4 different ways on the same paired-end sequence to see which gives me the best quality.

    First way involved using mem. I used one paired-end read that had the adaptor sequences chopped off. I then chopped off poor quality bases from that same file and ran BWA again.

    Second way involved using aln and sampe. I tried this two different ways like the first way.

    After this process, I used samtools for each sam file produced. For each sam file, I converted to bam. Then I sorted the bam file. Then I used the index command on the bam file. Finally I used idxstats for stats.

    My questions:

    1. After using bwa to align/map and then using samtools to sort and index, I checked out each final bam file by converting them to a sam file and I viewed them in the terminal.

    I couldn't seem to find the chromosome, I think in the third column. Why?


    Example from SAM file:
    Code:
    M00532:8:000000000-A17VF:1:1101:16380:1451      83      Serratia        3298780 29      229M1S  =       3298620 -389    TGTCGTTCGCCAACTTCAGCGTGCTCTGGACCTCAATGGCCTTTNTGCTCGCCGCGCCGCCGTTCAACTATTCCGAGGGAGTGATCGGGCTGTTCGGCCTGGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGCANCTGGCGGACAAAGGCAAGGCCGGNCTGACNACCACCGTCGGCCTGGTGTTNCTGCTGCTGTCCTGGATCCCTATCGCGTTCGCCAAN  D>ED>4'8?1*1*?*AEED>FEA?A1*A???:??A?8A8)8800#;DDDDDDDD?D8D;ECECA?E?C?CC;EDFEEEFFFEDDDDEE?:DDDDDA8)0)0.#####################################?44#EEEEEEFFFFFFFFFFHHFF@?4#HFD?5#HHHEHHHHHHHIHIHHFEA5#IIHHIHIIIHHIIIFFFFFBDDDDDDDD@@???<5#  XT:A:M  NM:i:49 SM:i:29 AM:i:29 XM:i:7  XO:i:0  XG:i:0  MD:Z:44C34G22T0G0G0G0C0G0C0C0G0C0C0G0G0G0G0C0G0C0T0G0G0C0C0G0C0T0T0C0G0C0G0C0G0C0C0G0G3T14T2A5G0T4C20G0T1A26A5

    2. What does the last line mean after running idxstats?

    Serratia 5113802 307778 2900
    * 0 0 155004


    And just for clarification, the first line reads reference sequence name, sequence length, # of mapped reads and # of unmapped reads?

  • #2
    This comes down to how you built the index for BWA. What FASTA file(s) did you use? If you didn't build the index from FASTA sequences that are full chromosome references then you won't get alignments in terms of chromosomes.

    Also that last line of idxstats is probably just the number of unaligned reads. Typically unmapped reads have an '*' in the third column.
    /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
    Salk Institute for Biological Studies, La Jolla, CA, USA */

    Comment


    • #3
      Did you read SAM format description?

      Yes, the third column of a sam file has the chromosome name.

      You've done something very wrong, though.

      MD:Z:44C34G22T0G0G0G0C0G0C0C0G0C0C0G0G0G0G0C0G0C0T0G0G0C0C0G0C0T0T0C0G0C0G0C0G0C0C0G0G3T14T2A5G0T4C20G0T1A26A5
      Means that you used the wrong fastq file in the sampe step.

      Comment


      • #4
        Also I recommend mem over the aln/sampe pipeline. It's simpler and it works better.
        /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
        Salk Institute for Biological Studies, La Jolla, CA, USA */

        Comment


        • #5
          Originally posted by sdriscoll View Post
          This comes down to how you built the index for BWA. What FASTA file(s) did you use? If you didn't build the index from FASTA sequences that are full chromosome references then you won't get alignments in terms of chromosomes.

          Also that last line of idxstats is probably just the number of unaligned reads. Typically unmapped reads have an '*' in the third column.
          I used db11.fasta

          I did build the index.

          And the last part of what you said makes no sense because the first row describes the name, sequences, # of mapped reads, and # of unmapped reads. How does the second row (* 0 0 32694) describe the # of unmapped reads when the first row already lists the # of unmapped reads?

          Comment


          • #6
            Originally posted by swbarnes2 View Post
            Did you read SAM format description?

            Yes, the third column of a sam file has the chromosome name.

            You've done something very wrong, though.



            Means that you used the wrong fastq file in the sampe step.
            Could this have anything to do with the fact that Serratia marcescens is a bacteria with only 1 chromosome?

            Comment


            • #7
              A read can be unmapped, and associated with a chromosome, if it hangs off the edge. You have 2900 such reads. The rest of the unmapped reads didn't map at all, that's the 155004.

              I used bwa and samtools on single chromosome bacterial references all the time. You messed up your sampe command, that's why you have that nonsense MD part. That's the only mistake you appear to have made, everything else looks normal, so I'm not sure what you think the problem is.

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Essential Discoveries and Tools in Epitranscriptomics
                by seqadmin


                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
                Yesterday, 07:01 AM
              • seqadmin
                Current Approaches to Protein Sequencing
                by seqadmin


                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                04-04-2024, 04:25 PM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 04-11-2024, 12:08 PM
              0 responses
              55 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 10:19 PM
              0 responses
              51 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 09:21 AM
              0 responses
              45 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-04-2024, 09:00 AM
              0 responses
              55 views
              0 likes
              Last Post seqadmin  
              Working...
              X