Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • HTSeq error using STAR alignments

    Hi,
    I am trying to use STAR alignments as input to the HTSeq count tools suitable for either DESeq or DEXSeq.

    The error I am getting is:

    2181086 GFF lines processed.
    Error occured when reading first line of sam file.
    Error: ("SAM optional field with illegal type letter ':'", 'line 28 of file GRL1259_76_GAGATTCC_L005_R1_001_AT_QT.paire
    R_aligned.sam.bam.sorted.bam.sam')
    [Exception type: ValueError, raised in _HTSeq.pyx:1180]

    This is the first line of the samfile, the 28th line looks similar with a different read.
    HWI-1KL163:53:C2191ACXX:1:1204:19189:58727 99 1 155979169 255 51M = 155979289 171 CCTTCACTATGGCTGAGAGCAGGGCAGGATCCAGGAGAAAGTGGCCAAGGG CCCFFFFFHHHHHJJJJJJIJJJJJJJJIJJJJIJFHHJJJFHGIJJJJJJ NH:i:1 HI:i:1 AS:i:100 nM:i:0

    I am using the following commands:

    STAR --genomeDir $genome_DIR --runThreadN 12 --genomeLoad NoSharedMemory --outSAMmode Full --outFilterMismatchNmax 3 --outFilterMultimapNmax 5 --readFilesIn $align_FILES

    samtools view -h file.bam -o file.bam.sam

    python -m HTSeq.scripts.count -m intersection-strict -t exon -s no -i gene_id $BAM_NAME.sorted.bam.sam $GTF >$BAM_NAME.sorted.bam.sam.out

    I am using samtools 0.1.19, HTSeq 0.5.4p3, STAR 2.3.0

    Thanks,
    Bob

  • #2
    Can you post the actual 28th line? I think that this will include the header, so it could be the line that you posted. One odd thing is the "nM" field, which should be "NM". You might also try just making a smaller version of the same file, excluding the line that causes issues. You can then see if this crops up elsewhere.

    Comment


    • #3
      Hi,
      it looks like it doesn't like the second sequence in all cases, 3000 jobs all choked.
      Here are the first two from one of the files.
      HWI-1KL163:5422YEACXX:5:1101:1627:2223 99 3 196083660 255 51M = 196089177 5568 TGAGAAGTTCAAAACGTTCAT
      TTGGGTATCCTTTAGACTGCACGTGCTTCA @CCFDDFDHHHHHJJJIJJJJJJJJJGHIJJJJJIHIJJJJJJJJJJJJJG NH:i:1 HI:i:1 AS:i:99 nM:i:0 jM:B:c,-1 jI:B:i,-1
      HWI-1KL163:5422YEACXX:5:1101:1627:2223 147 3 196089177 255 51M = 196083660 -5568 TCCCCTCCACTACTCCATCTG
      CTTTTTCAGGTGGCATCTCCAGTATCTCTG GJJHF?HGFHHDIGJJJIJJJJJJJJJJJJJIJJJJIJHHHHHFFFFFCCC NH:i:1 HI:i:1 AS:i:99 nM:i:0 jM:B:c,-1 jI:B:i,-1

      Comment


      • #4
        I wonder if it's choking on the jM and/or jI tags. Try making a miniature SAM file with just the header and a couple read-pairs and then try removing those two fields (optionally changing "nM" to "NM"). I recall in the STAR manual, there's mention made that some of the additional fields can break other programs.

        Comment


        • #5
          Good ideas. I will give that a try.

          Comment


          • #6
            Hi,
            thanks, with that as a hint, I removed the --outSAMmode Full option and went with the defaults, all seems to work now.
            Bob

            Comment


            • #7
              Originally posted by bioBob View Post
              Hi,
              thanks, with that as a hint, I removed the --outSAMmode Full option and went with the defaults, all seems to work now.
              Bob
              That must be the option I'm remembering, glad that everything's working.

              Devon

              Comment


              • #8
                Originally posted by dpryan View Post
                That must be the option I'm remembering, glad that everything's working.

                Devon
                Hi @bioBob and @dpryan

                you are right - HTseq cannot deal with jM:B:c,-1 and jI:B:i,-1 SAM attributes which are output if you use (non-default) --outSAMmode Full. These new type of "array" attributes are relatively new to SAM, they appeared in 0.1.17, I believe. HTseq (as well as Cufflinks and probably others) seem to have a problem with these attributes. These attributes are convenient but not really required, so you can either run STAR without them, or cut the corresponding columns before feeding them to HTseq or Cufflinks. I probably need to report this problem to HTseq authors.

                nM attribute is intentionally named differently than the standard NM attributes because nM contains the number of mismatches in both mates of a paired-end read, and does not include indels. It should not cause any problems with the downstream processing.

                Cheers
                Alex

                Comment


                • #9
                  HTSeq works fine with BAM

                  I had the same issue when trying to count sam files in HTSeq. However, if you got pysam installed you can run it with the -f bam flag and the new fields are no longer an issue.

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Strategies for Sequencing Challenging Samples
                    by seqadmin


                    Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                    03-22-2024, 06:39 AM
                  • seqadmin
                    Techniques and Challenges in Conservation Genomics
                    by seqadmin



                    The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                    Avian Conservation
                    Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                    03-08-2024, 10:41 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, Yesterday, 06:37 PM
                  0 responses
                  11 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, Yesterday, 06:07 PM
                  0 responses
                  10 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 03-22-2024, 10:03 AM
                  0 responses
                  51 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 03-21-2024, 07:32 AM
                  0 responses
                  67 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X