Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #31
    Dear Simon,
    I am using HT-Seq counts for summarizing the data of arabidopsis gene. When i use the command: 'htseq-count -m intersection-strict -s no aligned_074262 TAIR10_GTF.gtf > counts_074262'.
    I get an error saying unable to read first line of the sam file. unable to determine 11 delimited tabs in the file. What can be the possible solution for this?
    Thanks.

    Comment


    • #32
      I get an error saying unable to read first line of the sam file.
      Well, how does the first line of your sam file look like?

      Comment


      • #33
        htseq-count -m intersection-strict -s no aligned_074262 TAIR10_GTF.gtf > counts_074262
        100000 GFF lines processed.
        200000 GFF lines processed.
        300000 GFF lines processed.
        400000 GFF lines processed.
        478324 GFF lines processed.
        Error occured when reading first line of sam file.
        Error: ('SAM line does not contain at least 11 tab-delimited fields.', 'line 1 of file aligned_074262')
        [Exception type: ValueError, raised in _HTSeq.pyx:1220]

        The file looks like
        SRR074262.3 HWI-EAS121:2:1:14:1646 + Chr3 328884 AGGGAGGGAGGGAGTGAGATTGNTTCGATCGCCAAT BCCCBCCBACCCBCACBCBBC:!<B@A<?>@:>8AA 0 22:A>N
        SRR074262.4 HWI-EAS121:2:1:15:1470 + Chr1 14296071 CTGGGTTTTTGTGTTATTGAGANTCTGAGTTTGAGA BBA>?96>@@>==?A?@A<<=4!.5;7268?@607= 0 22:G>N

        Comment


        • #34
          I think your "SAM" file has a bit of a strange format. What alignment program did you use to create it?
          Take a look at the example of my SAM file a few lines back and you can see some differences right away.

          Comment


          • #35
            I created it using Bowtie . I am actually following the same method as described in the seqwiki page. What do you think I should be doing?

            Comment


            • #36
              Did you use the -S option when running Bowtie? That's what will give you the standard SAM output file. Otherwise I think it gives you the SAM-like file that you currently have. Try running your Bowtie command again with -S and then use that file with HTseq.

              Comment


              • #37
                Thanks a lot for your replies. I had used the -S option funnily though when I used --sam option it did work out fine. Thanks a lot for your feedback I really appreciate it.

                Comment


                • #38
                  Dear all,

                  I encounter the same error message ('SAM line does not contain at least 11 tab-delimited fields.') as mentioned before when running htseq-count with this command:
                  Code:
                  htseq-count merged_align.bam UCSC_annotation_hg19.gtf > htseq_test.txt
                  However, when I inspect the provided bam file via samtools view it looks like this:

                  HWI-ST301L:234:C0GECACXX:3:1102:2230:56147 355 chr1 11164 1 101M = 11274 211 GACCGCCCCTTGCTTGCAGCCGGGCACTACAGGACCCGCTTGCTCACGGTGCTGTGCCAGGGCGCCCCCTGCTGGCGACTAGGGCAACTGCAGGGCTCTCT * AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU NH:i:3 CC:Z:chr15 CP:i:102519906 HI:i:0

                  The alignments have been done via Tophat2 + bowtie2 (most recent releases).
                  Does anyone have an idea why this error occurs?

                  Best regards

                  Comment


                  • #39
                    Dear Seqanswer community, dear Simon,

                    I'm dealing with stickleback RNA-Seq data. Mapping was done with Tophat2. The ref.fasta and the anno.gtf (both from ensembl) were included in the tophat options.
                    ./tophat -p 8 -G stickle_gen_anno.gtf -o 193_ctrl24_CT_Shared_thout stickle_genome 193_ctrl24_CT_1_Shared.fastq 193_ctrl24_CT_2_Shared.fastq

                    I used samtools for sorting and converting the accepted_hits.bam (from tophat) to sam format and did the HTseq-count analysis with the anno.gtf I already used in tophat.
                    Coming to my recent problem. Is it normal that so many genes have no feature? So more than genes that are counted?
                    Here as an example (with one line added to see haw many genes are counted in total:

                    count sum 1266408
                    no_feature 2005250
                    ambiguous 9
                    too_low_aQual 0
                    not_aligned 0
                    alignment_not_unique 862188

                    So first I checked if the chromosomes displayed in the warning messages of htseq_count
                    e.g. Warning: Skipping read 'H108:2920J82ACXX:1:1101:3626:48576', because chromosome 'scaffold_893', to which it has been aligned, did not appear in the GFF file.
                    are not in the anno.gtf file. And indeed they are not. So they have to be just in the ref.fasta file.
                    But isn't there a possibility to count these genes? If not the loss of information is really big!?

                    Many thanks for an answer!

                    Cheers Alex

                    Comment


                    • #40
                      Damn! I just saw that default option in htseq-count is strand-specific...and as I used the RNA TruSeq V2 PE Library prep. I can be as good as certain that it is unstranded... so did HTseq count with the --stranded=no option and get less no_feature counts. Here is the result:

                      count sum 2464902
                      no_feature 728598
                      ambiguous 78167
                      too_low_aQual 0
                      not_aligned 0
                      alignment_not_unique 862188

                      Before (without --stranded=no)

                      see above

                      So the result looks better. And I thought about my question, if it is not possible to get less no feature genes, but I think when the gene is not yet annotated then there is no Gene ID and it wont make sense to include genes without an ID. So I have to live with that...
                      Can you confirm my thoughts?

                      Cheers Alex
                      Last edited by Alex.Manuel; 12-11-2012, 09:19 AM.

                      Comment


                      • #41
                        Yes, I think you already found your problem. You have reads mapping to contigs in your reference file, but these contigs must not have any genes annotated on them and therefore are not present in your .gff file.

                        You could check this by removing these contigs from your reference file and running your whole pipeline again, or just using an awk script to remove all the SAM lines with alignments to one of these contigs, then run HT-Seq again. That should get rid of a big proportion of the "no_feature" counts.

                        If you want to get useful information from these reads you're going to have to annotate genes on these contigs, which you could do with something like cufflinks.

                        Good luck

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Essential Discoveries and Tools in Epitranscriptomics
                          by seqadmin




                          The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                          04-22-2024, 07:01 AM
                        • seqadmin
                          Current Approaches to Protein Sequencing
                          by seqadmin


                          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                          04-04-2024, 04:25 PM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, Today, 08:47 AM
                        0 responses
                        12 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-11-2024, 12:08 PM
                        0 responses
                        60 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-10-2024, 10:19 PM
                        0 responses
                        59 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-10-2024, 09:21 AM
                        0 responses
                        54 views
                        0 likes
                        Last Post seqadmin  
                        Working...
                        X