SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
problem with HTSeq PhD1990 Bioinformatics 6 01-20-2014 11:12 PM
RIPseq - am I doing this right? Buprestid RNA Sequencing 0 08-22-2013 02:11 PM
Problem with HTSeq-count anikng RNA Sequencing 3 08-16-2013 08:33 PM
HTseq count problem dietmar13 Bioinformatics 0 12-19-2012 02:34 AM
Problem using HTSeq pinki999 Bioinformatics 4 10-23-2012 09:40 PM

Reply
 
Thread Tools
Old 04-16-2014, 08:36 AM   #1
Rivalyn
Member
 
Location: Netherlands

Join Date: Apr 2014
Posts: 14
Default Problem with HTSeq in RIPseq pipeline

Hello all,

I am a second year PhD student who is completely new to NGS, bioinformatics, and Linux. I'm at an institution my boss describes as a "bioinformatics desert" and as such, I am really struggling with my analysis pipeline. In the interest of providing as much information as possible, I'll explain our experiment and the pipeline I plan to use before I get to my specific problem.

We know an RNA-binding protein is essential for control of translation in a certain biological process, and want to identify its mRNA targets. To this end, we transfected our cells with a biotagged version of our protein and performed streptavidin pulldowns on cells both with and without (neg ctrl) the biotagged protein. We then eluted for RNA and submitted our samples for MiSeq sequencing. We have 4 biological replicates.

I intend to analyze these sequences by first trimming the low quality base pairs (<30 Phred quality score), aligning the trimmed sequences with the genome using Tophat, then processing the resulting bam files so that differential expression can be determined with edgeR. Unfortunately, I broke my Ubuntu installation last week by changing the owner of the /usr directory (lesson learned), so I performed the quality trimming and Tophat alignment with Galaxy while I tried to fix Linux. (I ended up reinstalling Ubuntu.)

I realized later on that instead of filtering just the low quality base pairs like I intended, I had actually discarded all sequences which contain any base pairs with a Phred score lower than 30. I'm mentioning this because I think it might have to do with the problem I encountered with HTSeq.

To convert my bam files into read counts for edgeR, I sorted all the bam files by name with samtools sort, converted to sam with samtools view, then made read counts with HTSeq. These are the commands I used:

#sort the bam files by name
for f in *.bam; do samtools sort -n "$f" "${f%.*}".sorted; done
#convert bam to sam
for f in *sorted.bam; do samtools view "$f" > "${f%.*}".sam; done
#convert sam to read counts
for f in *.sam; do python -m HTSeq.scripts.count "$f" Mus_musculus.GRCm38.75.gtf > "${f%.*}".readcount.txt ; done

Unfortunately, the according to HTSeq, I have no read counts anywhere. The output looks like this:

ENSMUSG00000000001 0
ENSMUSG00000000003 0
ENSMUSG00000000028 0
ENSMUSG00000000031 0
ENSMUSG00000000037 0
(etc)
__no_feature 1375882
__ambiguous 0
__too_low_aQual 0
__not_aligned 0
__alignment_not_unique 1018679

The same is true of all of my biological replicates, including the negative controls. Is this because I filtered out too many reads? Or am I doing something else wrong?
Rivalyn is offline   Reply With Quote
Old 04-16-2014, 09:41 AM   #2
blancha
Senior Member
 
Location: Montreal

Join Date: May 2013
Posts: 367
Default

Removing reads with any base pairs with a Phred score lower than 30 could definitely be a problem. There are often low quality reads at the end of the reads.

I would redo the trimming. You can also examine the quality of the reads before and after trimming with FASTQC.

You could also examine the BAM files in IGV. You could thus determine if the problem is with htseq-count or further upstream.
blancha is offline   Reply With Quote
Old 04-16-2014, 02:03 PM   #3
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Quote:
Originally Posted by Rivalyn View Post
I intend to analyze these sequences by first trimming the low quality base pairs (<30 Phred quality score),
This is crazy conservative, try just trimming off adapters and low quality (Phred scores <=5) from the ends of reads. You can do that with trimmomatic or trim_galore or a large number of other programs.

Quote:
#sort the bam files by name
for f in *.bam; do samtools sort -n "$f" "${f%.*}".sorted; done
#convert bam to sam
for f in *sorted.bam; do samtools view "$f" > "${f%.*}".sam; done
#convert sam to read counts
for f in *.sam; do python -m HTSeq.scripts.count "$f" Mus_musculus.GRCm38.75.gtf > "${f%.*}".readcount.txt ; done
You can actually just:

Code:
for f in *.bam
do
    htseq-count -f bam -r pos "$f" Mus_musculus.GRCm38.75.gtf > "${f%.*}".readcount.txt
done
at least with the newer versions. BTW, remember that the defaults assume a stranded library, which you didn't mention having (so you might also try with "-s no")

Quote:
Unfortunately, the according to HTSeq, I have no read counts anywhere. The output looks like this:
Did you align to mm10 from UCSC? This sort of thing normally happens when there's a chromosome name mismatch between your reference genome (the fasta file) and the annotation file you use (the GTF from Ensembl in your case). If you aligned to mm10 then the chromosome names in your alignments will be of the form: "chr1", "chr10", "chr11", ... . In the annotation from Ensembl, however, the same chromosomes would be: "1", "10", "11", ... . This is a common cause of problems and it's best to not mix the two (stick to Ensembl, the annotations are MUCH better). Just post the output of "samtools view -H blah.bam" on one of your alignments and we can confirm if this is the case. If that is the case then simply reheadering the alignments will solve the problem (I can show you how should the need arise).
dpryan is offline   Reply With Quote
Old 04-17-2014, 03:13 AM   #4
Rivalyn
Member
 
Location: Netherlands

Join Date: Apr 2014
Posts: 14
Default

Thank you both for your prompt responses.

Code:
samtools view -H BirA_rep1_Galaxy.bam | head

@HD	VN:1.0	SO:coordinate
@SQ	SN:chr1	LN:195471971
@SQ	SN:chr10	LN:130694993
@SQ	SN:chr11	LN:122082543
@SQ	SN:chr12	LN:120129022
@SQ	SN:chr13	LN:120421639
@SQ	SN:chr14	LN:124902244
@SQ	SN:chr15	LN:104043685
@SQ	SN:chr16	LN:98207768
@SQ	SN:chr17	LN:94987271
Looks like dpryan was right about the GTF files. I ran the initial alignment on Galaxy and used their built-in reference, which was mm10. I'm told that one should avoid Galaxy and do as much as possible via the console, but since my Linux installation was borked at the time, I thought it would be better than nothing.

Because the QC trimming needs to be redone, I will rerun tophat rather than simply reheadering. However, I'm curious as to how you would go about reheadering should the need arise, so if you can explain how to do that, I would be grateful.

Quote:
BTW, remember that the defaults assume a stranded library, which you didn't mention having (so you might also try with "-s no")
Forgive me, but what is a "stranded library"?

Quote:
You could also examine the BAM files in IGV. You could thus determine if the problem is with htseq-count or further upstream.
IGV requires Java, and Java has been... difficult... ever since we got this computer. This is also why I haven't used Trimmomatic, but tried to do the QC trimming with the Fastx toolkit. Fastx has been a disaster, so I've put a significant amount of time into fixing (or trying to) my Java installation. I think I succeeded, but I won't know for sure until I install Trimmomatic and make sense of its documentation (which I will start on right after writing this).
Rivalyn is offline   Reply With Quote
Old 04-17-2014, 03:25 AM   #5
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Which distro have you been using that java has given such issues?

Regarding stranded libraries, perhaps you know them as "directional" or "strand-specific". In short, the sequencing libraries are created in was such that the strand of origin ("+" or "-") of each sequenced fragment is preserved. Some sequencing facilities will use such a protocol be default but others (ours is one example) won't unless you specifically ask for it. There are also 2 basic types of strand-specific libraries, the first where the orientation of read1's alignment denotes the strand and the second where the opposite is true...so if you used the former but specified the latter then you'd get a lot of aberrant 0 counts.

For reheadering the file, the easiest route would be to simply:
Code:
samtools view -H BirA_rep1_Galaxy.bam > foo.header
Then edit "foo.header" to remove the "chr" prefix and rename "M" to "MT". That's normally sufficient. You can then simply:

Code:
samtools reheader foo.header BirA_rep1_Galaxy.bam > BirA_rep1_Galaxy.Ensembl.bam
dpryan is offline   Reply With Quote
Old 04-17-2014, 03:26 AM   #6
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

I should note that you should never change the order of chromosomes when reheadering...that will completely bork the alignments (you'll end up changing what chromosome everything aligns to!).
dpryan is offline   Reply With Quote
Old 04-17-2014, 06:15 AM   #7
Rivalyn
Member
 
Location: Netherlands

Join Date: Apr 2014
Posts: 14
Default

Thanks for explaining reheadering. I'll archive that for future reference.

I am reasonably certain that my reads are strand-specific. Is there any way to check that easily? How do I tell what kind of strand-specific library I have?

Quote:
Which distro have you been using that java has given such issues?
I'm using Ubuntu 13.10. I don't want to get too off topic here, but we had a real mess trying to dual boot Ubuntu and Windows earlier this month. I've had a long standing problem with making files executable. chmod 755, chmod +x etc all run without errors but don't actually have any impact on the executable status of the file in question. After a lot of googling, it looks like somehow the partition that contains the Ubuntu system files was mounted as NTFS instead of... whatever the Linux file system is called. IF this is really the problem, then I'm surprised Ubuntu runs at all.

No idea if this is also causing the issues I'm having with java, but it's my best guess at the moment. I've had a lot of things behave oddly ever since we set up our dual-boot.

I gather that the solution is normally to unmount and remount the afflicted partition as the right kind of file system with some arcane Linux console wizardry. But somehow I don't think it's a good idea to unmount the partition that contains all the Linux system files. That seems like it would be... bad.

Last edited by Rivalyn; 04-17-2014 at 06:21 AM.
Rivalyn is offline   Reply With Quote
Old 04-17-2014, 06:22 AM   #8
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

The easiest way is to simply too at things in IGV. If you color reads according to being read#1 or read#2 (there's an option for that) and look at a random gene or two then it should be immediately apparent what sort of library type you have.

Yeah, dual booting can screw things up nicely. The default fs is ext4 (at least on my ubuntu work station) and I too am surprised that it ran at all with everything being ntfs. You might ply a local IT person with copious amounts of booze to make the issues disappear...
dpryan is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 09:00 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO