Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • ronaldrcutler
    Member
    • May 2016
    • 80

    Sorting SAM files

    Hello all. I've been fiddling around with htseq-count using SAM files that were output by hisat2. I am using paired reads here and realized as I was going through the Htseq-count Manual that I should sort my data beforehand using
    Code:
    samtools sort
    I used this command to execute htseq-count:
    Code:
    htseq-count -m intersection-nonempty -i Name -s reverse -r name Sample_1_hisat2_results_sorted.sam /Volumes/cachannel/RNA_SEQ/Notch_RNASeq/9.1_Reference_Files/XENLA_UTAmayball_cdna_longest_CHRS2.gff3 >Sample_1_Hisat2_Counts.txt 2>Sample_1_Hisat2_Counts_OUTPUT_WARNINGS.txt
    So now I have two htseq-count output files to compare: one where the data is sorted and one where the data is unsorted.

    In the unsorted warning file I notice at the bottom: "4674733 SAM alignment pairs processed"
    In the sorted warning file I notice at the bottom: "8572367 SAM alignment paris processed"

    That is about 2X the amount of pairs processed for the sorted data! The count txt files for both runs look good, but I'm guessing there is much more data on counts in the sorted one. When comparing counts for specific genes between the two, the sorted appears to have around double the amount. My question is why is this and what are the consequences of sorting? Does it lead to more precise counts, or is it not even worth it?
    Last edited by ronaldrcutler; 06-16-2016, 09:36 AM.
  • ronaldrcutler
    Member
    • May 2016
    • 80

    #2
    Also, another related question would be: is the SAM alignments output by hisat2 sorted?

    Comment

    • dpryan
      Devon Ryan
      • Jul 2011
      • 3478

      #3
      htseq-count wants you to "samtools sort -n", not "samtools sort". The difference is the cause of the differing results. You do not need to sort the output of hisat2 before giving it to htseq-count.

      Note that since you coordinate sorted the file and then told htseq-count that it was name sorted that the results for that are...inaccurate. The file with the smaller number of processed alignments is the correct one.

      Comment

      • ronaldrcutler
        Member
        • May 2016
        • 80

        #4
        Thanks for the clarification, this will save a lot of time!

        Comment

        • ronaldrcutler
          Member
          • May 2016
          • 80

          #5
          Originally posted by dpryan View Post
          You do not need to sort the output of hisat2 before giving it to htseq-count.
          When examining the head of some SAM files I have been working with output from hisat2, I noticed that the head contains this line:
          Code:
          @HD	VN:1.0	SO:unsorted
          I know you said hisat2 outputs sorted SAM files, so what does this mean?

          Comment

          • GenoMax
            Senior Member
            • Feb 2008
            • 7142

            #6
            Originally posted by ronaldrcutler View Post
            When examining the head of some SAM files I have been working with output from hisat2, I noticed that the head contains this line:
            Code:
            @HD	VN:1.0	SO:unsorted
            I know you said hisat2 outputs sorted SAM files, so what does this mean?
            You could use instead featureCounts. It is much faster and will sort the bam/sam files if needed.

            Looks like HISAT2's output is unsorted.
            Last edited by GenoMax; 06-28-2016, 05:31 PM.

            Comment

            • ronaldrcutler
              Member
              • May 2016
              • 80

              #7
              To follow up: sorting the sam files removed this error that I had in all of them:
              Code:
              Warning: Malformed SAM line: MRNM != '*' although flag bit &0x0008 set
              Warning: Malformed SAM line: RNAME != '*' although flag bit &0x0004 set
              Warning: Malformed SAM line: MRNM == '=' although read is not aligned.
              But not this error, which was similar in all of them (however, I just ignored it):
              Code:
              Warning: Read ACB052:253:C76YKACXX:2:1101:2245:1957 claims to have an aligned mate which could not be found in an adjacent line.
              When comparing the sorted and unsorted files using the 'diff' command, there were no differences!

              Comment

              Latest Articles

              Collapse

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by SEQadmin2, Yesterday, 10:09 AM
              0 responses
              10 views
              0 reactions
              Last Post SEQadmin2  
              Started by SEQadmin2, 06-04-2026, 08:59 AM
              0 responses
              19 views
              0 reactions
              Last Post SEQadmin2  
              Started by SEQadmin2, 06-02-2026, 12:03 PM
              0 responses
              26 views
              0 reactions
              Last Post SEQadmin2  
              Started by SEQadmin2, 06-02-2026, 11:40 AM
              0 responses
              21 views
              0 reactions
              Last Post SEQadmin2  
              Working...