Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • HTSeq sort sam

    Hi,

    I spent a few hours on this problem. I am using the latest HTSeq (0.5.3).

    The sam file was generated via bwa. I used sort, samtools, and picard to sort the sam files but in all cases HTSeq complained excessively about "(Is the SAM file properly sorted?)".

    I used "cat my_file.sorted.sam |grep "read_name"" and here are two examples:

    1)
    5G_BaTVPgyNB42 147 All_unigene019861 858 255 75M = 745 -188 GTGTAGTCCGTGTGTGAAAGGCTGGGATAAAAAGTCAGGTTATTAGGTTGCTGTAGGGAACAGTCTGTTAATTCA bbbbbbcacbacbbcbbbbbbbbbb_abbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbabb XA:i:0 MD:Z:75 NM:i:0

    5G_BaTVPgyNB42 99 All_unigene019861 745 255 75M = 858 188 GTGTAAAGCTACTCGACAGTTGATTCAAACGCAATGAATTAGAATAGAAGATTTTGTTTGAATCACGAGCAACAG bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb_a_bbbbbbbcbbbbcba XA:i:0 MD:Z:32G42 NM:i:1

    2)
    1D_J7uMriyNB42 163 All_unigene003941 81 29 75M = 201 178 AAACCTACCACCGATTTCACAGACAAAGTCATAGAAGGCGAGGATGGACTAAAATTCGACACTCCGTCGTCCGCA bbbbbbbbbbbbbbbbbbbbbbbbbbbbcbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbcbbbabbacbb`c XT:A:U NM:i:0 SM:i:29 AM:i:29 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:75
    1D_J7uMriyNB42 87 All_unigene003941 201 29 58M17S = 81 -178 GTCGTCCAGTTGTCCGACAAACAAGGAGATCTGACTGAACAGAGTGACACGACGTCTGCAACTGGACATCTTGAC \aba\bcc`bcaabb`cbbbbbbabbbbbbbbbabbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb XT:A:M NM:i:1 SM:i:29 AM:i:29 XM:i:1 XO:i:0 XG:i:0 MD:Z:55T2

    Anybody can offer an explanation?

    Thank you,
    Douglas

  • #2
    Does HTSeq expect reads to be sorted by name? In which case the sort order should be easy to check:
    cat my_file.sorted.sam | sort -s -c -k 1,1

    Comment


    • #3
      Yes. HTSeq expects the reads sorted by name. My problem is that the paired reads are in the sam file and should be right next to each other. But the program still complained that it could not locate the mate in the sam. I tried different programs to sort the same SAM file but the warning messages persisted (and were abundant).

      Comment


      • #4
        Have you actually tried this?
        cat my_file.sorted.sam | sort -s -c -k 1,1

        Another thought is that in your example reads are not sorted by mate order, that is the second mate in the pair comes first. Not sure if this is relevant though.

        Comment


        • #5
          Hi vadim,

          Thank you for your suggestion. I will try your method and report back. (BTW, in the first example, the second mate was reported first; in the second one it is fine...)

          Douglas

          Comment


          • #6
            Hi vadim,

            I used "sort my_file.sam >my_file.sorted.sam". Then as per your suggestion, I used "cat my_file.sorted.sam |sort -s -c -k 1,1"). here is the output:

            sort: -:7094: disorder: 11_hu4B0jyNB42 99 All_unigene030891 58 075M = 174 191 GGATTTGGTGAAGTTTGTTTAATACTATTTATCTCTGGAGATTGAAATATAACATTCGATTTTTGTATATTAACA bbbbbbbbbbbbbbbbbbbaabbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb_abbbbbbbb XT:A:R NM:i:0 SM:i:0 AM:i:0 X0:i:2 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:75 XA:Z:All_unigene003037,+263,75M,0;

            What do you think?

            Thank you,
            Douglas

            Comment


            • #7
              Could you try sorting with this command?
              sort -s -k 1,1 my_file.sam > my_file.sorted.sam

              This will sort by the first column only, ignoring everything else.

              Comment


              • #8
                You may also want to do this prior to sorting:
                export LC_ALL=POSIX

                Comment


                • #9
                  Hi vadim,

                  I am happy to report it worked! I guess "export LC_ALL=POSIX" helped. I had not known it was related to the sorting or htseq-count. But you pointed me to the right direction and offered a working solution.

                  Thank you very much and now I am off for a great July 4th!

                  Douglas

                  Comment


                  • #10
                    Hi vadim
                    I had the same problem as DZhang in that my .sam file would not sort properly but thanks to your help it is not working also.

                    Unfortunately, i then encountered a new error message with Htseq-count.

                    Warning: Incorrect 'proper_pair' flag value for read pair HWI-EAS283_0004_FC:2:100:11697:4473#0


                    My data is paired-end reads frim Illumina GA. I used tophat v1.2.0 to align.

                    Any ideas or suggestions would be greatly appreciated!
                    Thanks
                    Kesa

                    Comment


                    • #11
                      I donot know why I have sort my reads. It still not work.

                      The output is from BWA as in this attachments.
                      Attached Files

                      Comment


                      • #12
                        Hi,

                        I should have posted my final resolution on my problem. Apparently HTSeq has problems with BWA-generated sam-files, or vice versa. I do not know why; I spent many hours on this and did not fix all the complaints from HTSeq. Then I simply used bowtie-generated sam file - same set of reads, reference, etc - and it worked without any issue.

                        Comment


                        • #13
                          My seq is RNA-seq, so if I use Tophat will have no problem?

                          If we use bwa and do not care the complaints from HTSeq, will this have some potential error?

                          Thank you for your share.

                          Originally posted by DZhang View Post
                          Hi,

                          I should have posted my final resolution on my problem. Apparently HTSeq has problems with BWA-generated sam-files, or vice versa. I do not know why; I spent many hours on this and did not fix all the complaints from HTSeq. Then I simply used bowtie-generated sam file - same set of reads, reference, etc - and it worked without any issue.

                          Comment


                          • #14
                            Hi fabrice,

                            If you use Tophat, I think it should work as it uses bowtie for alignment. As to HTSeq's complaint(s) about BWA-generated sam files, I would not take the chance to ignore the warning messages unless I can understand the source of the errors. (In my case, I did not understand the errors so I switched to bowtie.)

                            Comment


                            • #15
                              I do not get too much warning.

                              It is better to ask the author.

                              I have written to the author, but no response.

                              I looked into the source, it is here to give the warning.

                              def pair_SAM_alignments( alignments, bundle=False ):

                              def process_list( almnt_list ):
                              while len( almnt_list ) > 0:
                              a1 = almnt_list.pop( 0 )
                              # Find its mate
                              for a2 in almnt_list:
                              if a1.pe_which == a2.pe_which:
                              continue
                              if a1.aligned != a2.mate_aligned or a1.mate_aligned != a2.aligned:
                              continue
                              if not (a1.aligned and a2.aligned):
                              break
                              if a1.iv.chrom == a2.mate_start.chrom and a1.iv.start == a2.mate_start.pos and \
                              a2.iv.chrom == a1.mate_start.chrom and a2.iv.start == a1.mate_start.pos:
                              break
                              else:
                              if a1.mate_aligned:
                              warnings.warn( "Read " + a1.read.name + " claims to have an aligned mate " +
                              "which could not be found. (Is the SAM file properly sorted?)" )
                              a2 = None
                              if a2 is not None:
                              almnt_list.remove( a2 )
                              if a1.pe_which == "first":
                              yield ( a1, a2 )
                              else:
                              assert a1.pe_which == "second"
                              yield ( a2, a1 )
                              Originally posted by DZhang View Post
                              Hi fabrice,

                              If you use Tophat, I think it should work as it uses bowtie for alignment. As to HTSeq's complaint(s) about BWA-generated sam files, I would not take the chance to ignore the warning messages unless I can understand the source of the errors. (In my case, I did not understand the errors so I switched to bowtie.)
                              Last edited by fabrice; 08-09-2011, 12:49 PM.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Advancing Precision Medicine for Rare Diseases in Children
                                by seqadmin




                                Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
                                12-16-2024, 07:57 AM
                              • seqadmin
                                Recent Advances in Sequencing Technologies
                                by seqadmin



                                Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

                                Long-Read Sequencing
                                Long-read sequencing has seen remarkable advancements,...
                                12-02-2024, 01:49 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 12-17-2024, 10:28 AM
                              0 responses
                              33 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 12-13-2024, 08:24 AM
                              0 responses
                              49 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 12-12-2024, 07:41 AM
                              0 responses
                              34 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 12-11-2024, 07:45 AM
                              0 responses
                              46 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X