Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Sorting SAM files

    Hello all. I've been fiddling around with htseq-count using SAM files that were output by hisat2. I am using paired reads here and realized as I was going through the Htseq-count Manual that I should sort my data beforehand using
    Code:
    samtools sort
    I used this command to execute htseq-count:
    Code:
    htseq-count -m intersection-nonempty -i Name -s reverse -r name Sample_1_hisat2_results_sorted.sam /Volumes/cachannel/RNA_SEQ/Notch_RNASeq/9.1_Reference_Files/XENLA_UTAmayball_cdna_longest_CHRS2.gff3 >Sample_1_Hisat2_Counts.txt 2>Sample_1_Hisat2_Counts_OUTPUT_WARNINGS.txt
    So now I have two htseq-count output files to compare: one where the data is sorted and one where the data is unsorted.

    In the unsorted warning file I notice at the bottom: "4674733 SAM alignment pairs processed"
    In the sorted warning file I notice at the bottom: "8572367 SAM alignment paris processed"

    That is about 2X the amount of pairs processed for the sorted data! The count txt files for both runs look good, but I'm guessing there is much more data on counts in the sorted one. When comparing counts for specific genes between the two, the sorted appears to have around double the amount. My question is why is this and what are the consequences of sorting? Does it lead to more precise counts, or is it not even worth it?
    Last edited by ronaldrcutler; 06-16-2016, 09:36 AM.

  • #2
    Also, another related question would be: is the SAM alignments output by hisat2 sorted?

    Comment


    • #3
      htseq-count wants you to "samtools sort -n", not "samtools sort". The difference is the cause of the differing results. You do not need to sort the output of hisat2 before giving it to htseq-count.

      Note that since you coordinate sorted the file and then told htseq-count that it was name sorted that the results for that are...inaccurate. The file with the smaller number of processed alignments is the correct one.

      Comment


      • #4
        Thanks for the clarification, this will save a lot of time!

        Comment


        • #5
          Originally posted by dpryan View Post
          You do not need to sort the output of hisat2 before giving it to htseq-count.
          When examining the head of some SAM files I have been working with output from hisat2, I noticed that the head contains this line:
          Code:
          @HD	VN:1.0	SO:unsorted
          I know you said hisat2 outputs sorted SAM files, so what does this mean?

          Comment


          • #6
            Originally posted by ronaldrcutler View Post
            When examining the head of some SAM files I have been working with output from hisat2, I noticed that the head contains this line:
            Code:
            @HD	VN:1.0	SO:unsorted
            I know you said hisat2 outputs sorted SAM files, so what does this mean?
            You could use instead featureCounts. It is much faster and will sort the bam/sam files if needed.

            Looks like HISAT2's output is unsorted.
            Last edited by GenoMax; 06-28-2016, 05:31 PM.

            Comment


            • #7
              To follow up: sorting the sam files removed this error that I had in all of them:
              Code:
              Warning: Malformed SAM line: MRNM != '*' although flag bit &0x0008 set
              Warning: Malformed SAM line: RNAME != '*' although flag bit &0x0004 set
              Warning: Malformed SAM line: MRNM == '=' although read is not aligned.
              But not this error, which was similar in all of them (however, I just ignored it):
              Code:
              Warning: Read ACB052:253:C76YKACXX:2:1101:2245:1957 claims to have an aligned mate which could not be found in an adjacent line.
              When comparing the sorted and unsorted files using the 'diff' command, there were no differences!

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Strategies for Sequencing Challenging Samples
                by seqadmin


                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                03-22-2024, 06:39 AM
              • seqadmin
                Techniques and Challenges in Conservation Genomics
                by seqadmin



                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                Avian Conservation
                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                03-08-2024, 10:41 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, Yesterday, 06:37 PM
              0 responses
              10 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, Yesterday, 06:07 PM
              0 responses
              9 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 03-22-2024, 10:03 AM
              0 responses
              50 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 03-21-2024, 07:32 AM
              0 responses
              67 views
              0 likes
              Last Post seqadmin  
              Working...
              X