SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
multiBamCov or htseq-count to count read per feature ? NicoBxl Bioinformatics 1 07-03-2012 02:05 AM
analysis of single and PE reads with htseq-count and DESeq dglemay Bioinformatics 1 04-26-2012 11:23 PM
htseq-count gets more reads? deepsea Bioinformatics 3 03-29-2012 11:27 AM
paired-end reads mapped to genome.. gene with only one direction of paired-end reads? danwiththeplan Bioinformatics 2 09-22-2011 02:06 AM
How to count Paired-End reads for RNA-Seq read counts epistatic RNA Sequencing 0 08-31-2011 02:44 PM

Reply
 
Thread Tools
Old 06-29-2012, 08:13 AM   #1
roloman
Junior Member
 
Location: Boston

Join Date: Feb 2012
Posts: 1
Post Using htseq-count with paired-end and orphaned reads

Dear all,

I am building a pipeline for 100bp paired-end sequencing and I have a question regarding how htseq-count deals with the output from TopHat and how to best map my reads to features prior to using DESeq.

For each sample I have 2 bam files: one corresponding to alignments of paired-end reads and another containing single reads that didn't have a mate pair. Originally, I merged my 2 bam files, sorted and converted into SAM and ran htseq-count to get gene counts, but I got an "expecting mate pair error". This made sense as single reads coming from merged files lacked mate pairs that htseq-count expected in sorted SAM files.

My questions is: Is running htseq-count separately on these two files and them combining their outputs for each sample biasing my analysis, or does it render the same results as if htseq allowed runs of files containing both paired-end and single reads?

I apologize for the long explanation, and thanks for any help you can give me.
roloman is offline   Reply With Quote
Old 09-16-2013, 08:09 AM   #2
sheltonj
Junior Member
 
Location: usa

Join Date: May 2012
Posts: 3
Default What the difference between the library type single and paired, DeSeq

I am preparing read counts for DeSeq. I have no reference genome for these transcriptomes. I am writing a script to extract counts from sam files produced by bowtie2 https://github.com/i5K-KINBRE-script...anscriptome.pl.

It seems to me that Deseq assumes each pair counts as one alignment for "paired" data. Bowtie2 and many other packages attempt aligning single reads when the pair fails. I also have reads that have no pairs (one set failed my base quality filters).

It seems like addition, useful information may be gained using the pairs with a quality issue in only one end or an unexpected insert length. It also seems that feeding the pairing information to Bowtie2 could increase the quality of alignment for some ambiguously mapping reads. It also seems to me DeSeq makes no use of pairs other than to normalize between samples. Does this sound correct to you? If that is the case why label samples "paired" or "single."

My main questions are:

1) If I use "paired" in DeSeq should I force Bowtie2 to only report pairs where both align as a pair and divide the counts by 2?

2) If I include my single end alignments, should I count each single read as +1 and each pair as +1? Should I use "paired" or "single" in DeSeq and why does this setting matter?

Thanks
sheltonj is offline   Reply With Quote
Old 09-16-2013, 09:17 AM   #3
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

The "paired" or "single" in the DESeq vignette is just an example of (1) how to deal with multiple factors and (2) that single-end and paired-end libraries can produce slightly different results (likely due to mappability). If you just have one type of library, then there's no reason to include that in your model. As you suspected, both read pairs and single reads count as 1, since they each represent a single original molecule.

Perhaps one of the DESeq authors has a different opinion, but I would only really worry about including orphaned reads if the rate of them was very different between samples (you'd see if this is causing an issue in a principal components graph or by doing clustering). This would likely indicate a general difference in overall quality between the two

BTW, your perl script could be replaced by htseq-count, where you make an annotation file wherein each contig is a feature (this has the benefit of not counting multimappers and allowing filtering by MAPQ).
dpryan is offline   Reply With Quote
Old 09-16-2013, 10:17 AM   #4
sheltonj
Junior Member
 
Location: usa

Join Date: May 2012
Posts: 3
Default DeSeq paired-end and orphaned reads

That you for your reply. I have no reference genome. I am assuming you mean that I create a gff where each feature is a contig. If this is correct, what software do you use to create the gff from contigs.

I am adding filtering steps to my script currently:
1) For Bowtie2 column 5 is map quality.

2) Column 7 indicates whether the read has a mate and which contig the mate aligns to. Keeping reads aligning with a value other than "=" or "*" in column 7 would remove pairs with reads that align to different contigs.

3) The optional field "XS:i:<N>" only appears for reads that aligns ambiguously. (Would you recommend filtering a pair if one read aligns ambiguously or only if both align ambiguously?)

4) I could also filter by by problematic pairing using the optional field "YT:Z:<S>" where a value of UU indicates the read was not part of a pair. Value of CP indicates the read was part of a pair and the pair aligned concordantly. Value of DP indicates the read was part of a pair and the pair aligned discordantly. Value of UP indicates the read was part of a pair but the pair failed to aligned either concordantly or discordantly.

Any thoughts on the filters listed above.

Thanks,
Jennifer

Last edited by sheltonj; 09-16-2013 at 11:12 AM.
sheltonj is offline   Reply With Quote
Old 09-16-2013, 10:34 AM   #5
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Yeah, you'd need to make the GFF or GTF file. A GTF file would be easy to make from the header of one of your SAM/BAM files (since you just need the contig names and lengths). It sounds like the additions to your script should take care of most of the issues )

Regarding (2), that will also filter any read with a secondary alignment (XS score) equivalent to that of the primary alignment and where the primary alignment contains mismatches. Regardless, you'll certainly want to get rid of those!

Regarding (3), the presence of an XS score actually just means that there's a secondary alignment with a score of at least that specified by --score-min. If AS=0 and XS=-60, I wouldn't call the alignment ambiguous. This is generally why it's better to just filter at a MAPQ (say 10 or 20).

Regarding (4), I don't recall off-hand what bowtie2's metric is for calling something unique or not, so I can't advise on the utility of this.

Best of luck,
Devon
dpryan is offline   Reply With Quote
Old 09-17-2013, 03:46 PM   #6
sheltonj
Junior Member
 
Location: usa

Join Date: May 2012
Posts: 3
Default

Hi thanks,

I drafted a perl script to count paired end reads and broken pairs that align to de novo transcriptomes. The script takes Bowtie2 "best match" sam files as input.

The script "count_reads_aligned_to_de_novo_transcriptome.pl" is briefly described at the bottom of this README here https://github.com/i5K-KINBRE-script...ster/README.md.

The script is available here https://github.com/i5K-KINBRE-script...anscriptome.pl

Comments, questions, or suggestions would be appreciated

-Jennifer
sheltonj is offline   Reply With Quote
Reply

Tags
htseq-count, paired-end, rnaseq, sequencing

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 02:18 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO