Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Problem with HTSeq in RIPseq pipeline

    Hello all,

    I am a second year PhD student who is completely new to NGS, bioinformatics, and Linux. I'm at an institution my boss describes as a "bioinformatics desert" and as such, I am really struggling with my analysis pipeline. In the interest of providing as much information as possible, I'll explain our experiment and the pipeline I plan to use before I get to my specific problem.

    We know an RNA-binding protein is essential for control of translation in a certain biological process, and want to identify its mRNA targets. To this end, we transfected our cells with a biotagged version of our protein and performed streptavidin pulldowns on cells both with and without (neg ctrl) the biotagged protein. We then eluted for RNA and submitted our samples for MiSeq sequencing. We have 4 biological replicates.

    I intend to analyze these sequences by first trimming the low quality base pairs (<30 Phred quality score), aligning the trimmed sequences with the genome using Tophat, then processing the resulting bam files so that differential expression can be determined with edgeR. Unfortunately, I broke my Ubuntu installation last week by changing the owner of the /usr directory (lesson learned), so I performed the quality trimming and Tophat alignment with Galaxy while I tried to fix Linux. (I ended up reinstalling Ubuntu.)

    I realized later on that instead of filtering just the low quality base pairs like I intended, I had actually discarded all sequences which contain any base pairs with a Phred score lower than 30. I'm mentioning this because I think it might have to do with the problem I encountered with HTSeq.

    To convert my bam files into read counts for edgeR, I sorted all the bam files by name with samtools sort, converted to sam with samtools view, then made read counts with HTSeq. These are the commands I used:

    #sort the bam files by name
    for f in *.bam; do samtools sort -n "$f" "${f%.*}".sorted; done
    #convert bam to sam
    for f in *sorted.bam; do samtools view "$f" > "${f%.*}".sam; done
    #convert sam to read counts
    for f in *.sam; do python -m HTSeq.scripts.count "$f" Mus_musculus.GRCm38.75.gtf > "${f%.*}".readcount.txt ; done

    Unfortunately, the according to HTSeq, I have no read counts anywhere. The output looks like this:

    ENSMUSG00000000001 0
    ENSMUSG00000000003 0
    ENSMUSG00000000028 0
    ENSMUSG00000000031 0
    ENSMUSG00000000037 0
    (etc)
    __no_feature 1375882
    __ambiguous 0
    __too_low_aQual 0
    __not_aligned 0
    __alignment_not_unique 1018679

    The same is true of all of my biological replicates, including the negative controls. Is this because I filtered out too many reads? Or am I doing something else wrong?

  • #2
    Removing reads with any base pairs with a Phred score lower than 30 could definitely be a problem. There are often low quality reads at the end of the reads.

    I would redo the trimming. You can also examine the quality of the reads before and after trimming with FASTQC.

    You could also examine the BAM files in IGV. You could thus determine if the problem is with htseq-count or further upstream.

    Comment


    • #3
      Originally posted by Rivalyn View Post
      I intend to analyze these sequences by first trimming the low quality base pairs (<30 Phred quality score),
      This is crazy conservative, try just trimming off adapters and low quality (Phred scores <=5) from the ends of reads. You can do that with trimmomatic or trim_galore or a large number of other programs.

      #sort the bam files by name
      for f in *.bam; do samtools sort -n "$f" "${f%.*}".sorted; done
      #convert bam to sam
      for f in *sorted.bam; do samtools view "$f" > "${f%.*}".sam; done
      #convert sam to read counts
      for f in *.sam; do python -m HTSeq.scripts.count "$f" Mus_musculus.GRCm38.75.gtf > "${f%.*}".readcount.txt ; done
      You can actually just:

      Code:
      for f in *.bam
      do
          htseq-count -f bam -r pos "$f" Mus_musculus.GRCm38.75.gtf > "${f%.*}".readcount.txt
      done
      at least with the newer versions. BTW, remember that the defaults assume a stranded library, which you didn't mention having (so you might also try with "-s no")

      Unfortunately, the according to HTSeq, I have no read counts anywhere. The output looks like this:
      Did you align to mm10 from UCSC? This sort of thing normally happens when there's a chromosome name mismatch between your reference genome (the fasta file) and the annotation file you use (the GTF from Ensembl in your case). If you aligned to mm10 then the chromosome names in your alignments will be of the form: "chr1", "chr10", "chr11", ... . In the annotation from Ensembl, however, the same chromosomes would be: "1", "10", "11", ... . This is a common cause of problems and it's best to not mix the two (stick to Ensembl, the annotations are MUCH better). Just post the output of "samtools view -H blah.bam" on one of your alignments and we can confirm if this is the case. If that is the case then simply reheadering the alignments will solve the problem (I can show you how should the need arise).

      Comment


      • #4
        Thank you both for your prompt responses.

        Code:
        samtools view -H BirA_rep1_Galaxy.bam | head
        
        @HD	VN:1.0	SO:coordinate
        @SQ	SN:chr1	LN:195471971
        @SQ	SN:chr10	LN:130694993
        @SQ	SN:chr11	LN:122082543
        @SQ	SN:chr12	LN:120129022
        @SQ	SN:chr13	LN:120421639
        @SQ	SN:chr14	LN:124902244
        @SQ	SN:chr15	LN:104043685
        @SQ	SN:chr16	LN:98207768
        @SQ	SN:chr17	LN:94987271
        Looks like dpryan was right about the GTF files. I ran the initial alignment on Galaxy and used their built-in reference, which was mm10. I'm told that one should avoid Galaxy and do as much as possible via the console, but since my Linux installation was borked at the time, I thought it would be better than nothing.

        Because the QC trimming needs to be redone, I will rerun tophat rather than simply reheadering. However, I'm curious as to how you would go about reheadering should the need arise, so if you can explain how to do that, I would be grateful.

        BTW, remember that the defaults assume a stranded library, which you didn't mention having (so you might also try with "-s no")
        Forgive me, but what is a "stranded library"?

        You could also examine the BAM files in IGV. You could thus determine if the problem is with htseq-count or further upstream.
        IGV requires Java, and Java has been... difficult... ever since we got this computer. This is also why I haven't used Trimmomatic, but tried to do the QC trimming with the Fastx toolkit. Fastx has been a disaster, so I've put a significant amount of time into fixing (or trying to) my Java installation. I think I succeeded, but I won't know for sure until I install Trimmomatic and make sense of its documentation (which I will start on right after writing this).

        Comment


        • #5
          Which distro have you been using that java has given such issues?

          Regarding stranded libraries, perhaps you know them as "directional" or "strand-specific". In short, the sequencing libraries are created in was such that the strand of origin ("+" or "-") of each sequenced fragment is preserved. Some sequencing facilities will use such a protocol be default but others (ours is one example) won't unless you specifically ask for it. There are also 2 basic types of strand-specific libraries, the first where the orientation of read1's alignment denotes the strand and the second where the opposite is true...so if you used the former but specified the latter then you'd get a lot of aberrant 0 counts.

          For reheadering the file, the easiest route would be to simply:
          Code:
          samtools view -H BirA_rep1_Galaxy.bam > foo.header
          Then edit "foo.header" to remove the "chr" prefix and rename "M" to "MT". That's normally sufficient. You can then simply:

          Code:
          samtools reheader foo.header BirA_rep1_Galaxy.bam > BirA_rep1_Galaxy.Ensembl.bam

          Comment


          • #6
            I should note that you should never change the order of chromosomes when reheadering...that will completely bork the alignments (you'll end up changing what chromosome everything aligns to!).

            Comment


            • #7
              Thanks for explaining reheadering. I'll archive that for future reference.

              I am reasonably certain that my reads are strand-specific. Is there any way to check that easily? How do I tell what kind of strand-specific library I have?

              Which distro have you been using that java has given such issues?
              I'm using Ubuntu 13.10. I don't want to get too off topic here, but we had a real mess trying to dual boot Ubuntu and Windows earlier this month. I've had a long standing problem with making files executable. chmod 755, chmod +x etc all run without errors but don't actually have any impact on the executable status of the file in question. After a lot of googling, it looks like somehow the partition that contains the Ubuntu system files was mounted as NTFS instead of... whatever the Linux file system is called. IF this is really the problem, then I'm surprised Ubuntu runs at all.

              No idea if this is also causing the issues I'm having with java, but it's my best guess at the moment. I've had a lot of things behave oddly ever since we set up our dual-boot.

              I gather that the solution is normally to unmount and remount the afflicted partition as the right kind of file system with some arcane Linux console wizardry. But somehow I don't think it's a good idea to unmount the partition that contains all the Linux system files. That seems like it would be... bad.
              Last edited by Rivalyn; 04-17-2014, 06:21 AM.

              Comment


              • #8
                The easiest way is to simply too at things in IGV. If you color reads according to being read#1 or read#2 (there's an option for that) and look at a random gene or two then it should be immediately apparent what sort of library type you have.

                Yeah, dual booting can screw things up nicely. The default fs is ext4 (at least on my ubuntu work station) and I too am surprised that it ran at all with everything being ntfs. You might ply a local IT person with copious amounts of booze to make the issues disappear...

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Essential Discoveries and Tools in Epitranscriptomics
                  by seqadmin




                  The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                  04-22-2024, 07:01 AM
                • seqadmin
                  Current Approaches to Protein Sequencing
                  by seqadmin


                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                  04-04-2024, 04:25 PM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, Today, 08:47 AM
                0 responses
                11 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-11-2024, 12:08 PM
                0 responses
                60 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 10:19 PM
                0 responses
                59 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 09:21 AM
                0 responses
                53 views
                0 likes
                Last Post seqadmin  
                Working...
                X