Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Isolating non coding RNA using bedtools/samtools/htseq-count

    I am trying to isolate ncRNA from a bam file.

    I have a accepted_hits.bam file that has been produced by tophat, which to the best of my knowledge contains all reads that have mapped to the reference genome, be it coding or non-coding.
    I also have a bed file that I have made using getting the lengths of all the chromosomes, into a basic bed file, with only the required fields, i.e:
    Chr1 0 17971566
    Chr10 0 15490421
    Chr11 0 17082262
    Chr12 0 13002763
    Chr13 0 17372821
    Chr14 0 12031045
    Chr15 0 19001909
    Chr16 0 11476499
    Chr17 0 14675105
    Chr18 0 14814751
    Chr19 0 11145676
    Chr2 0 15003700
    Chr20 0 14375345
    Chr21 0 14287830
    Chr22 0 12982530
    Chr23 0 20688064
    Chr24 0 17702303
    Chr25 0 14321294
    Chr26 0 17153587
    Chr27 0 10917718
    Chr28 0 16406049
    Chr29 0 14180785
    Chr3 0 20519224
    Chr4 0 13065296
    Chr5 0 18825850
    Chr6 0 18208465
    Chr7 0 18704015
    Chr8 0 21083321
    Chr9 0 11598160
    Unknow 0 177361715
    I then used bedtools subtract with the above bed file and the gff file associated with the genome that I aligned to. My reasoning being that the gff file has all the features (genes) in it, so if I subtract those from the lengths of the various chromosomes, I'll be be left with everything outside of a coding region.
    My plan was to then use
    Code:
    samtools view -L nonCodingRegions.bed accepted_hits.bam > nonCoding.bam
    This, for some reasons gives me the same file. i.e. it outputs accepted_hits.bam into nonCoding.bam. Which isn't right. Am I using the -L option correctly or should I be using something else?

    My other thought was to get the reads that are identified with the gff features using the -o option in htseq-count:
    Code:
    htseq-count -f bam -r name -o coding -t gene -i ID accepted_hits.bam
    so that I cold then try and figure out some way of subtracting the sam file output by that from the accepted_hits.bam. This doesn't appear to work either as it only outputs a list of features that the reads have hit, with no way of identifying which reads hit that feature, like this:
    XF:Z:__no_feature
    XF:Z:Achn175571
    XF:Z:Achn175571
    XF:Z:__no_feature
    XF:Z:Achn008681
    XF:Z:__no_feature
    XF:Z:__no_feature
    XF:Z:Achn339401
    XF:Z:Achn339401
    XF:Z:Achn050721
    Is there something I'm missing in my attempt to get htseq-count to output the samfile of reads that map to features in the gff file?

    Or alternatively, is there a better way to draw out the ncRNA?

    Cheers
    Ben.

  • #2
    Firstly, don't blindly accept anything that this method produces as being an ncRNA. Most of what you'll be getting are repeat regions that have some background expression (i.e., uninteresting noise).

    Secondly, your samtools view -L ... command is correct, so it's likely a problem with your BED file. Have a look at that in IGV or another browser and make sure it seems OK.

    As an aside, there are programs that have been written to find many classes of ncRNAs. You'd be strongly encouraged to look into them before trying to roll your own.

    Comment


    • #3
      Again, thanks for the answer.

      Don't worry, I wasn't going to accept it all
      Looking at my BED file, not sure how to figure out if something is/what is wrong.

      These other programs, would you have any recommendations? I've come across a number of papers that haven't (that I've seen) linked to any software that they've used. There's a couple, like CPAT that I'm currently trying to get running, but if there are other options, I'd be very interested in having a look at them.

      Cheers
      Ben.

      Comment


      • #4
        piPipe for piRNAs, mirdeep2 for miRNAs. There are a few for lincRNAs as well, but I can't come up with the names off-hand.

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Strategies for Sequencing Challenging Samples
          by seqadmin


          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
          03-22-2024, 06:39 AM
        • seqadmin
          Techniques and Challenges in Conservation Genomics
          by seqadmin



          The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

          Avian Conservation
          Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
          03-08-2024, 10:41 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, Yesterday, 06:37 PM
        0 responses
        8 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, Yesterday, 06:07 PM
        0 responses
        8 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 03-22-2024, 10:03 AM
        0 responses
        49 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 03-21-2024, 07:32 AM
        0 responses
        66 views
        0 likes
        Last Post seqadmin  
        Working...
        X