Seqanswers Leaderboard Ad

**himanshu04** · 03-17-2012, 07:51 AM

Dear Simon,
I am using HT-Seq counts for summarizing the data of arabidopsis gene. When i use the command: 'htseq-count -m intersection-strict -s no aligned_074262 TAIR10_GTF.gtf > counts_074262'.
I get an error saying unable to read first line of the sam file. unable to determine 11 delimited tabs in the file. What can be the possible solution for this?
Thanks.

**Simon Anders** · 03-17-2012, 09:02 AM

I get an error saying unable to read first line of the sam file.

Well, how does the first line of your sam file look like?

**himanshu04** · 03-17-2012, 09:16 AM

htseq-count -m intersection-strict -s no aligned_074262 TAIR10_GTF.gtf > counts_074262
100000 GFF lines processed.
200000 GFF lines processed.
300000 GFF lines processed.
400000 GFF lines processed.
478324 GFF lines processed.
Error occured when reading first line of sam file.
Error: ('SAM line does not contain at least 11 tab-delimited fields.', 'line 1 of file aligned_074262')
[Exception type: ValueError, raised in _HTSeq.pyx:1220]

The file looks like
SRR074262.3 HWI-EAS121:2:1:14:1646 + Chr3 328884 AGGGAGGGAGGGAGTGAGATTGNTTCGATCGCCAAT BCCCBCCBACCCBCACBCBBC:!<B@A<?>@:>8AA 0 22:A>N
SRR074262.4 HWI-EAS121:2:1:15:1470 + Chr1 14296071 CTGGGTTTTTGTGTTATTGAGANTCTGAGTTTGAGA BBA>?96>@@>==?A?@A<<=4!.5;7268?@607= 0 22:G>N

**Camg** · 03-17-2012, 11:27 AM

I think your "SAM" file has a bit of a strange format. What alignment program did you use to create it?
Take a look at the example of my SAM file a few lines back and you can see some differences right away.

**himanshu04** · 03-17-2012, 11:43 AM

I created it using Bowtie . I am actually following the same method as described in the seqwiki page. What do you think I should be doing?

**Camg** · 03-17-2012, 11:48 AM

Did you use the -S option when running Bowtie? That's what will give you the standard SAM output file. Otherwise I think it gives you the SAM-like file that you currently have. Try running your Bowtie command again with -S and then use that file with HTseq.

**himanshu04** · 03-17-2012, 12:41 PM

Thanks a lot for your replies. I had used the -S option funnily though when I used --sam option it did work out fine. Thanks a lot for your feedback I really appreciate it.

**rboettcher** · 06-21-2012, 05:41 AM

Dear all,

I encounter the same error message ('SAM line does not contain at least 11 tab-delimited fields.') as mentioned before when running htseq-count with this command:

Code:

htseq-count merged_align.bam UCSC_annotation_hg19.gtf > htseq_test.txt

However, when I inspect the provided bam file via samtools view it looks like this:

HWI-ST301L:234:C0GECACXX:3:1102:2230:56147 355 chr1 11164 1 101M = 11274 211 GACCGCCCCTTGCTTGCAGCCGGGCACTACAGGACCCGCTTGCTCACGGTGCTGTGCCAGGGCGCCCCCTGCTGGCGACTAGGGCAACTGCAGGGCTCTCT * AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YT:Z:UU NH:i:3 CC:Z:chr15 CP:i:102519906 HI:i:0

The alignments have been done via Tophat2 + bowtie2 (most recent releases).
Does anyone have an idea why this error occurs?

Best regards

**Alex.Manuel** · 12-11-2012, 07:27 AM

Dear Seqanswer community, dear Simon,

I'm dealing with stickleback RNA-Seq data. Mapping was done with Tophat2. The ref.fasta and the anno.gtf (both from ensembl) were included in the tophat options.
./tophat -p 8 -G stickle_gen_anno.gtf -o 193_ctrl24_CT_Shared_thout stickle_genome 193_ctrl24_CT_1_Shared.fastq 193_ctrl24_CT_2_Shared.fastq

I used samtools for sorting and converting the accepted_hits.bam (from tophat) to sam format and did the HTseq-count analysis with the anno.gtf I already used in tophat.
Coming to my recent problem. Is it normal that so many genes have no feature? So more than genes that are counted?
Here as an example (with one line added to see haw many genes are counted in total:

count sum 1266408
no_feature 2005250
ambiguous 9
too_low_aQual 0
not_aligned 0
alignment_not_unique 862188

So first I checked if the chromosomes displayed in the warning messages of htseq_count
e.g. Warning: Skipping read 'H108:292

0J82ACXX:1:1101:3626:48576', because chromosome 'scaffold_893', to which it has been aligned, did not appear in the GFF file.
are not in the anno.gtf file. And indeed they are not. So they have to be just in the ref.fasta file.
But isn't there a possibility to count these genes? If not the loss of information is really big!?

Many thanks for an answer!

Cheers Alex

**Alex.Manuel** · 12-11-2012, 09:16 AM

Damn! I just saw that default option in htseq-count is strand-specific...and as I used the RNA TruSeq V2 PE Library prep. I can be as good as certain that it is unstranded... so did HTseq count with the --stranded=no option and get less no_feature counts. Here is the result:

count sum 2464902
no_feature 728598
ambiguous 78167
too_low_aQual 0
not_aligned 0
alignment_not_unique 862188

Before (without --stranded=no)

see above

So the result looks better. And I thought about my question, if it is not possible to get less no feature genes, but I think when the gene is not yet annotated then there is no Gene ID and it wont make sense to include genes without an ID. So I have to live with that...
Can you confirm my thoughts?

Cheers Alex

**Camg** · 12-11-2012, 12:43 PM

Yes, I think you already found your problem. You have reads mapping to contigs in your reference file, but these contigs must not have any genes annotated on them and therefore are not present in your .gff file.

You could check this by removing these contigs from your reference file and running your whole pipeline again, or just using an awk script to remove all the SAM lines with alignments to one of these contigs, then run HT-Seq again. That should get rid of a big proportion of the "no_feature" counts.

If you want to get useful information from these reads you're going to have to annotate genes on these contigs, which you could do with something like cufflinks.

Good luck

Topics	Statistics	Last Post
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Today, 08:47 AM	0 responses 12 views 0 likes	Last Post by seqadmin Today, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 59 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 54 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News