Seqanswers Leaderboard Ad

**chadn737** · 01-26-2012, 07:06 AM

htseq-count

http://www-huber.embl.de/users/anders/HTSeq/doc/count.html

**shuang** · 01-26-2012, 09:30 AM

I was also suggested to use Bowtie. Bowtie basically outputs an alignment report. Then, how to do the counting/statistics part?

**chadn737** · 01-26-2012, 09:34 AM

The program I mentioned above, htseq-count, takes a sam alignment file and a gtf/gff file describing your features and gives you the number of reads aligning to each feature. This is then appropriate for use in various statistical programs such as DESeq and EdgeR.

**arvid** · 01-26-2012, 11:56 PM

I like to use RSEM for mapping (uses Bowtie) and isoform expression estimations, followed by DESeq for the differential expression statistics.

If you just want raw read counts, map your reads (take some time to find the right software and options here, as this depends on the sequencing technology, sample and reference) and extract the counts from the BAM file with "samtools idxstats".

**steven** · 01-27-2012, 01:25 AM

Originally posted by arvid View Post

I like to use RSEM for mapping (uses Bowtie) and isoform expression estimations, followed by DESeq for the differential expression statistics.

If you just want raw read counts, map your reads (take some time to find the right software and options here, as this depends on the sequencing technology, sample and reference) and extract the counts from the BAM file with "samtools idxstats".

"samtools idxstats file.bam" seems to compute the number of reads per reference sequence -typically chromosomes. To get the number of reads per gene htseq-count is a valid option indeed. I think that BEDtools is another one.
BTW, I am curious to know if one is much faster than the other one.

**mgogol** · 01-27-2012, 07:22 AM

BEDtools coverageBed with a bed file of genes can be used with a bam file from bowtie...

**elemenTY** · 01-27-2012, 09:27 AM

Beside Simon's python-based HTSeq , If you have experience with R and bioconductor, summarizeOverlaps function in GenomicRanges is easy to use too, following the same pattern defined in HTSeq

http://www.bioconductor.org/packages/2.10/bioc/vignettes/GenomicRanges/inst/doc/summarizeOverlaps.pdf

countByOverlaps could also do the trick if you can manage your data and features into the right forms.

**shuang** · 01-27-2012, 12:39 PM

I used bowtie for alignment and samtools idxstats for counting. It works!

However, bowtie only allows me set a alignment constrain by a number of mismatches. Can I set a constrain by either identity percentage or P-value via bowtie or other tools?

**shuang** · 01-27-2012, 02:05 PM

My data reads are all bout 100bp. I want to count any alignments which are 90% identified and up. I notice that bowtie only allows max mismatches to be 3. How do I increase mismatch to be 10?

**arvid** · 01-30-2012, 12:07 AM

Originally posted by shuang View Post

My data reads are all bout 100bp. I want to count any alignments which are 90% identified and up. I notice that bowtie only allows max mismatches to be 3. How do I increase mismatch to be 10?

With Bowtie, the -n option (0-3) is for the seed only (usually the first 28 bases in the read), if you increase -e (max sum of mismatch quals) more mismatches in the whole alignment are allowed. You could also use the -v option (report end-to-end hits w/ <=v mismatches; ignore qualities) instead of -e if you want to allow a specific number of mismatches.

The settings in Bowtie2 (currently in beta5) are simplified and might suit your purposes better...

**shuang** · 01-30-2012, 01:48 PM

the average read length of my RNA-seq is 83. The reference sequences are coding sequences, including genomic, chloroplast, and mitochondria of the same species and strain.

Ideally, I want to set a threshold about 90% identity in finding matches. I set the parameters as -n 2 -l 15 -e 10

However, the aligned reads were only about 30% while I expected it to be almost 100%. Where did I do wrong?

**arvid** · 01-30-2012, 11:42 PM

Originally posted by shuang View Post

the average read length of my RNA-seq is 83. The reference sequences are coding sequences, including genomic, chloroplast, and mitochondria of the same species and strain.

Ideally, I want to set a threshold about 90% identity in finding matches. I set the parameters as -n 2 -l 15 -e 10

However, the aligned reads were only about 30% while I expected it to be almost 100%. Where did I do wrong?

Not sure whether you'll see massive improvements, but you should set -e much higher (you set it lower than the default 70), it is in mismatch qualities, not bases. Try something extreme like "-e 9999999" to see whether that gives you more alignments...

**shuang** · 01-31-2012, 07:50 AM

Thank you. This works!

I have one more question about the alignment. How do I set a threshold on P value or the minimum length of alignments. Basically, I don't want alignments which are too short, such as shorted than 50bp.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 25 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 24 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

counting RNA-seq matches

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News