Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Sorting SAM files

    Hello all. I've been fiddling around with htseq-count using SAM files that were output by hisat2. I am using paired reads here and realized as I was going through the Htseq-count Manual that I should sort my data beforehand using
    Code:
    samtools sort
    I used this command to execute htseq-count:
    Code:
    htseq-count -m intersection-nonempty -i Name -s reverse -r name Sample_1_hisat2_results_sorted.sam /Volumes/cachannel/RNA_SEQ/Notch_RNASeq/9.1_Reference_Files/XENLA_UTAmayball_cdna_longest_CHRS2.gff3 >Sample_1_Hisat2_Counts.txt 2>Sample_1_Hisat2_Counts_OUTPUT_WARNINGS.txt
    So now I have two htseq-count output files to compare: one where the data is sorted and one where the data is unsorted.

    In the unsorted warning file I notice at the bottom: "4674733 SAM alignment pairs processed"
    In the sorted warning file I notice at the bottom: "8572367 SAM alignment paris processed"

    That is about 2X the amount of pairs processed for the sorted data! The count txt files for both runs look good, but I'm guessing there is much more data on counts in the sorted one. When comparing counts for specific genes between the two, the sorted appears to have around double the amount. My question is why is this and what are the consequences of sorting? Does it lead to more precise counts, or is it not even worth it?
    Last edited by ronaldrcutler; 06-16-2016, 09:36 AM.

  • #2
    Also, another related question would be: is the SAM alignments output by hisat2 sorted?

    Comment


    • #3
      htseq-count wants you to "samtools sort -n", not "samtools sort". The difference is the cause of the differing results. You do not need to sort the output of hisat2 before giving it to htseq-count.

      Note that since you coordinate sorted the file and then told htseq-count that it was name sorted that the results for that are...inaccurate. The file with the smaller number of processed alignments is the correct one.

      Comment


      • #4
        Thanks for the clarification, this will save a lot of time!

        Comment


        • #5
          Originally posted by dpryan View Post
          You do not need to sort the output of hisat2 before giving it to htseq-count.
          When examining the head of some SAM files I have been working with output from hisat2, I noticed that the head contains this line:
          Code:
          @HD	VN:1.0	SO:unsorted
          I know you said hisat2 outputs sorted SAM files, so what does this mean?

          Comment


          • #6
            Originally posted by ronaldrcutler View Post
            When examining the head of some SAM files I have been working with output from hisat2, I noticed that the head contains this line:
            Code:
            @HD	VN:1.0	SO:unsorted
            I know you said hisat2 outputs sorted SAM files, so what does this mean?
            You could use instead featureCounts. It is much faster and will sort the bam/sam files if needed.

            Looks like HISAT2's output is unsorted.
            Last edited by GenoMax; 06-28-2016, 05:31 PM.

            Comment


            • #7
              To follow up: sorting the sam files removed this error that I had in all of them:
              Code:
              Warning: Malformed SAM line: MRNM != '*' although flag bit &0x0008 set
              Warning: Malformed SAM line: RNAME != '*' although flag bit &0x0004 set
              Warning: Malformed SAM line: MRNM == '=' although read is not aligned.
              But not this error, which was similar in all of them (however, I just ignored it):
              Code:
              Warning: Read ACB052:253:C76YKACXX:2:1101:2245:1957 claims to have an aligned mate which could not be found in an adjacent line.
              When comparing the sorted and unsorted files using the 'diff' command, there were no differences!

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Recent Advances in Sequencing Analysis Tools
                by seqadmin


                The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
                05-06-2024, 07:48 AM
              • seqadmin
                Essential Discoveries and Tools in Epitranscriptomics
                by seqadmin




                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                04-22-2024, 07:01 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, Yesterday, 06:57 AM
              0 responses
              11 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 05-06-2024, 07:17 AM
              0 responses
              16 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 05-02-2024, 08:06 AM
              0 responses
              19 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-30-2024, 12:17 PM
              0 responses
              24 views
              0 likes
              Last Post seqadmin  
              Working...
              X