Hello all,
I am a second year PhD student who is completely new to NGS, bioinformatics, and Linux. I'm at an institution my boss describes as a "bioinformatics desert" and as such, I am really struggling with my analysis pipeline. In the interest of providing as much information as possible, I'll explain our experiment and the pipeline I plan to use before I get to my specific problem.
We know an RNA-binding protein is essential for control of translation in a certain biological process, and want to identify its mRNA targets. To this end, we transfected our cells with a biotagged version of our protein and performed streptavidin pulldowns on cells both with and without (neg ctrl) the biotagged protein. We then eluted for RNA and submitted our samples for MiSeq sequencing. We have 4 biological replicates.
I intend to analyze these sequences by first trimming the low quality base pairs (<30 Phred quality score), aligning the trimmed sequences with the genome using Tophat, then processing the resulting bam files so that differential expression can be determined with edgeR. Unfortunately, I broke my Ubuntu installation last week by changing the owner of the /usr directory (lesson learned), so I performed the quality trimming and Tophat alignment with Galaxy while I tried to fix Linux. (I ended up reinstalling Ubuntu.)
I realized later on that instead of filtering just the low quality base pairs like I intended, I had actually discarded all sequences which contain any base pairs with a Phred score lower than 30. I'm mentioning this because I think it might have to do with the problem I encountered with HTSeq.
To convert my bam files into read counts for edgeR, I sorted all the bam files by name with samtools sort, converted to sam with samtools view, then made read counts with HTSeq. These are the commands I used:
#sort the bam files by name
for f in *.bam; do samtools sort -n "$f" "${f%.*}".sorted; done
#convert bam to sam
for f in *sorted.bam; do samtools view "$f" > "${f%.*}".sam; done
#convert sam to read counts
for f in *.sam; do python -m HTSeq.scripts.count "$f" Mus_musculus.GRCm38.75.gtf > "${f%.*}".readcount.txt ; done
Unfortunately, the according to HTSeq, I have no read counts anywhere. The output looks like this:
ENSMUSG00000000001 0
ENSMUSG00000000003 0
ENSMUSG00000000028 0
ENSMUSG00000000031 0
ENSMUSG00000000037 0
(etc)
__no_feature 1375882
__ambiguous 0
__too_low_aQual 0
__not_aligned 0
__alignment_not_unique 1018679
The same is true of all of my biological replicates, including the negative controls. Is this because I filtered out too many reads? Or am I doing something else wrong?
I am a second year PhD student who is completely new to NGS, bioinformatics, and Linux. I'm at an institution my boss describes as a "bioinformatics desert" and as such, I am really struggling with my analysis pipeline. In the interest of providing as much information as possible, I'll explain our experiment and the pipeline I plan to use before I get to my specific problem.
We know an RNA-binding protein is essential for control of translation in a certain biological process, and want to identify its mRNA targets. To this end, we transfected our cells with a biotagged version of our protein and performed streptavidin pulldowns on cells both with and without (neg ctrl) the biotagged protein. We then eluted for RNA and submitted our samples for MiSeq sequencing. We have 4 biological replicates.
I intend to analyze these sequences by first trimming the low quality base pairs (<30 Phred quality score), aligning the trimmed sequences with the genome using Tophat, then processing the resulting bam files so that differential expression can be determined with edgeR. Unfortunately, I broke my Ubuntu installation last week by changing the owner of the /usr directory (lesson learned), so I performed the quality trimming and Tophat alignment with Galaxy while I tried to fix Linux. (I ended up reinstalling Ubuntu.)
I realized later on that instead of filtering just the low quality base pairs like I intended, I had actually discarded all sequences which contain any base pairs with a Phred score lower than 30. I'm mentioning this because I think it might have to do with the problem I encountered with HTSeq.
To convert my bam files into read counts for edgeR, I sorted all the bam files by name with samtools sort, converted to sam with samtools view, then made read counts with HTSeq. These are the commands I used:
#sort the bam files by name
for f in *.bam; do samtools sort -n "$f" "${f%.*}".sorted; done
#convert bam to sam
for f in *sorted.bam; do samtools view "$f" > "${f%.*}".sam; done
#convert sam to read counts
for f in *.sam; do python -m HTSeq.scripts.count "$f" Mus_musculus.GRCm38.75.gtf > "${f%.*}".readcount.txt ; done
Unfortunately, the according to HTSeq, I have no read counts anywhere. The output looks like this:
ENSMUSG00000000001 0
ENSMUSG00000000003 0
ENSMUSG00000000028 0
ENSMUSG00000000031 0
ENSMUSG00000000037 0
(etc)
__no_feature 1375882
__ambiguous 0
__too_low_aQual 0
__not_aligned 0
__alignment_not_unique 1018679
The same is true of all of my biological replicates, including the negative controls. Is this because I filtered out too many reads? Or am I doing something else wrong?
Comment