Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Sorting SAM files

    Hello all. I've been fiddling around with htseq-count using SAM files that were output by hisat2. I am using paired reads here and realized as I was going through the Htseq-count Manual that I should sort my data beforehand using
    Code:
    samtools sort
    I used this command to execute htseq-count:
    Code:
    htseq-count -m intersection-nonempty -i Name -s reverse -r name Sample_1_hisat2_results_sorted.sam /Volumes/cachannel/RNA_SEQ/Notch_RNASeq/9.1_Reference_Files/XENLA_UTAmayball_cdna_longest_CHRS2.gff3 >Sample_1_Hisat2_Counts.txt 2>Sample_1_Hisat2_Counts_OUTPUT_WARNINGS.txt
    So now I have two htseq-count output files to compare: one where the data is sorted and one where the data is unsorted.

    In the unsorted warning file I notice at the bottom: "4674733 SAM alignment pairs processed"
    In the sorted warning file I notice at the bottom: "8572367 SAM alignment paris processed"

    That is about 2X the amount of pairs processed for the sorted data! The count txt files for both runs look good, but I'm guessing there is much more data on counts in the sorted one. When comparing counts for specific genes between the two, the sorted appears to have around double the amount. My question is why is this and what are the consequences of sorting? Does it lead to more precise counts, or is it not even worth it?
    Last edited by ronaldrcutler; 06-16-2016, 09:36 AM.

  • #2
    Also, another related question would be: is the SAM alignments output by hisat2 sorted?

    Comment


    • #3
      htseq-count wants you to "samtools sort -n", not "samtools sort". The difference is the cause of the differing results. You do not need to sort the output of hisat2 before giving it to htseq-count.

      Note that since you coordinate sorted the file and then told htseq-count that it was name sorted that the results for that are...inaccurate. The file with the smaller number of processed alignments is the correct one.

      Comment


      • #4
        Thanks for the clarification, this will save a lot of time!

        Comment


        • #5
          Originally posted by dpryan View Post
          You do not need to sort the output of hisat2 before giving it to htseq-count.
          When examining the head of some SAM files I have been working with output from hisat2, I noticed that the head contains this line:
          Code:
          @HD	VN:1.0	SO:unsorted
          I know you said hisat2 outputs sorted SAM files, so what does this mean?

          Comment


          • #6
            Originally posted by ronaldrcutler View Post
            When examining the head of some SAM files I have been working with output from hisat2, I noticed that the head contains this line:
            Code:
            @HD	VN:1.0	SO:unsorted
            I know you said hisat2 outputs sorted SAM files, so what does this mean?
            You could use instead featureCounts. It is much faster and will sort the bam/sam files if needed.

            Looks like HISAT2's output is unsorted.
            Last edited by GenoMax; 06-28-2016, 05:31 PM.

            Comment


            • #7
              To follow up: sorting the sam files removed this error that I had in all of them:
              Code:
              Warning: Malformed SAM line: MRNM != '*' although flag bit &0x0008 set
              Warning: Malformed SAM line: RNAME != '*' although flag bit &0x0004 set
              Warning: Malformed SAM line: MRNM == '=' although read is not aligned.
              But not this error, which was similar in all of them (however, I just ignored it):
              Code:
              Warning: Read ACB052:253:C76YKACXX:2:1101:2245:1957 claims to have an aligned mate which could not be found in an adjacent line.
              When comparing the sorted and unsorted files using the 'diff' command, there were no differences!

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Essential Discoveries and Tools in Epitranscriptomics
                by seqadmin


                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
                Today, 07:01 AM
              • seqadmin
                Current Approaches to Protein Sequencing
                by seqadmin


                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                04-04-2024, 04:25 PM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 04-11-2024, 12:08 PM
              0 responses
              37 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 10:19 PM
              0 responses
              41 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 09:21 AM
              0 responses
              35 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-04-2024, 09:00 AM
              0 responses
              54 views
              0 likes
              Last Post seqadmin  
              Working...
              X