Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Error using HTSeq

    Hi,

    I am relatively new to using Linux and to RNA-seq data analysis. I have been given some BAM files and been told to analyse the data in R. However, before I do that, I need to generate a count table for each sample. I am using HTSeq to do this but am hitting an error every time. Below is what I'm typing and the resulting error:

    $ htseq-count -s no 01_v2.sam "gencode.v21.annotation.gtf" > 01_v2.counts
    100000 GFF lines processed.
    200000 GFF lines processed.
    300000 GFF lines processed.
    400000 GFF lines processed.
    500000 GFF lines processed.
    600000 GFF lines processed.
    700000 GFF lines processed.
    800000 GFF lines processed.
    900000 GFF lines processed.
    1000000 GFF lines processed.
    1100000 GFF lines processed.
    1200000 GFF lines processed.
    1300000 GFF lines processed.
    1400000 GFF lines processed.
    1500000 GFF lines processed.
    1600000 GFF lines processed.
    1700000 GFF lines processed.
    1800000 GFF lines processed.
    1900000 GFF lines processed.
    2000000 GFF lines processed.
    2100000 GFF lines processed.
    2200000 GFF lines processed.
    2300000 GFF lines processed.
    2400000 GFF lines processed.
    2500000 GFF lines processed.
    2546594 GFF lines processed.
    Error occured when reading first line of sam file.
    Error: ("Malformed SAM line: MRNM == '*' although flag bit &0x0008 cleared", 'line 1 of file 01_v2.sam')
    [Exception type: ValueError, raised in _HTSeq.pyx:1321]


    head 01_v2.sam
    FCC64Y1ACXX:1:2311:16630:3201#GCCAATAT 81 chr10 60021 0 49M * 0 0 GCATCGGGGTGCTCTGGTTTTGTTGTTGTTATTTCTGAATGACATTTAC hiihhiiiiiiiiiiiiihdhiiiihhfiiihihhheggggeeeeebb_ NM:i:0 MD:Z:49
    FCC64Y1ACXX:1:2206:15375:73069#GCCAATAT 65 chr10 60124 0 49M * 0 0 GACAGGTCTTAATTGACGCGCTGTTCAGCCCTTTGAGTTCGGTTGAGTT bbbeeecegggggifhhfhiiiihhiiiiihhihfefcffhiefhcg_c NM:i:0 MD:Z:49
    FCC64Y1ACXX:1:2214:12808:62966#NCCAATAT 65 chr10 60154 0 49M * 0 0 CTTTGAGTTCGGTTGAGTTTTGGGTTGGAGAATTTTCTTCCACAAGGGA bbbeeeeeggggghihiggiiiiighiihhihhiiihiiihihiiiiig NM:i:0 MD:Z:49
    FCC64Y1ACXX:1:1102:13233:83253#GCCAATAT 65 chr10 60853 0 49M * 0 0 CGCAGATGGATAGATTACTGTTATTAGTTCTCATTTCATTGTTAATTTT bbbeeeeeggfgghiiiiiihhdhidgghfhihhiiihhhifhihhhhi NM:i:1 MD:Z:0T48
    FCC64Y1ACXX:1:1308:1800:34170#GCCAATAT 65 chr10 60930 30 49M * 0 0 TGCCTTTCAATATACCTTAGTGGAATTTATTAAATTTTCCTGGATGTCC bbbeeeeegggggiiiihiighheiiihghhiiiiiiiiiiiihiigii NM:i:0 MD:Z:49
    FCC64Y1ACXX:1:1307:19223:71708#GCCAATAT 65 chr10 61145 0 49M * 0 0 AATTCCACTTGGTTATATTGTCTAACTTTTTTCTAATTTTCTTTCATTT ^[\cccSaaccccd[bKQ[`Y^`[Y^`ecaddccd[accdccccdcbcd NM:i:0 MD:Z:49
    FCC64Y1ACXX:1:2206:7623:63302#GCCAATAT 81 chr10 62142 0 49M * 0 0 CTATTTGCACATATAGTTTTAATACCAATGACGTTAAAATGTATAACAC ghfcf^fhhiiiiiiiiiiiihggiiiiiihiiiiigggggeeeee_ab NM:i:0 MD:Z:49
    FCC64Y1ACXX:1:1216:3028:53281#NCCAATAT 65 chr10 62384 0 49M * 0 0 GTCCAGAGACAAATATTTTAAATATTGAAGTTGAAGACCTAAAAATGTG ___`ccdegeegehhhgfffhhhhXeeg_gghhhf_ddfghhfghaaa_ NM:i:1 MD:Z:0T48
    FCC64Y1ACXX:1:1207:12316:62577#GCCAATAT 81 chr10 66812 0 49M * 0 0 TGAAAGCATTCCCTTTGAGAATTGGAACAAGACGAGGAGACTACTCTCA iiiiiiiiihiihihhhifiiiiiiihhiiiiiiiigggggeceeebb_ NM:i:0 MD:Z:49


    I have also posted the first 5 lines of the file, but don't know what is wrong with line 1 of the file.

    If anyone can help, that would be greatly appreciated,

    Thanks

  • #2
    You may have a look at this biostars threat.

    Comment


    • #3
      Thanks or the reply. I have looked at that post previously. So does that mean the only way to resolve the problem would be to realign the raw data? Is there no way around this?

      Its just that I was given the aligned BAM files as the alignment was done with an external company.

      Thanks

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Current Approaches to Protein Sequencing
        by seqadmin


        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
        04-04-2024, 04:25 PM
      • seqadmin
        Strategies for Sequencing Challenging Samples
        by seqadmin


        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
        03-22-2024, 06:39 AM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, 04-11-2024, 12:08 PM
      0 responses
      30 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 10:19 PM
      0 responses
      32 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 09:21 AM
      0 responses
      28 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-04-2024, 09:00 AM
      0 responses
      53 views
      0 likes
      Last Post seqadmin  
      Working...
      X