I am trying to isolate ncRNA from a bam file.
I have a accepted_hits.bam file that has been produced by tophat, which to the best of my knowledge contains all reads that have mapped to the reference genome, be it coding or non-coding.
I also have a bed file that I have made using getting the lengths of all the chromosomes, into a basic bed file, with only the required fields, i.e:
I then used bedtools subtract with the above bed file and the gff file associated with the genome that I aligned to. My reasoning being that the gff file has all the features (genes) in it, so if I subtract those from the lengths of the various chromosomes, I'll be be left with everything outside of a coding region.
My plan was to then use
This, for some reasons gives me the same file. i.e. it outputs accepted_hits.bam into nonCoding.bam. Which isn't right. Am I using the -L option correctly or should I be using something else?
My other thought was to get the reads that are identified with the gff features using the -o option in htseq-count:
so that I cold then try and figure out some way of subtracting the sam file output by that from the accepted_hits.bam. This doesn't appear to work either as it only outputs a list of features that the reads have hit, with no way of identifying which reads hit that feature, like this:
Is there something I'm missing in my attempt to get htseq-count to output the samfile of reads that map to features in the gff file?
Or alternatively, is there a better way to draw out the ncRNA?
Cheers
Ben.
I have a accepted_hits.bam file that has been produced by tophat, which to the best of my knowledge contains all reads that have mapped to the reference genome, be it coding or non-coding.
I also have a bed file that I have made using getting the lengths of all the chromosomes, into a basic bed file, with only the required fields, i.e:
Chr1 0 17971566
Chr10 0 15490421
Chr11 0 17082262
Chr12 0 13002763
Chr13 0 17372821
Chr14 0 12031045
Chr15 0 19001909
Chr16 0 11476499
Chr17 0 14675105
Chr18 0 14814751
Chr19 0 11145676
Chr2 0 15003700
Chr20 0 14375345
Chr21 0 14287830
Chr22 0 12982530
Chr23 0 20688064
Chr24 0 17702303
Chr25 0 14321294
Chr26 0 17153587
Chr27 0 10917718
Chr28 0 16406049
Chr29 0 14180785
Chr3 0 20519224
Chr4 0 13065296
Chr5 0 18825850
Chr6 0 18208465
Chr7 0 18704015
Chr8 0 21083321
Chr9 0 11598160
Unknow 0 177361715
Chr10 0 15490421
Chr11 0 17082262
Chr12 0 13002763
Chr13 0 17372821
Chr14 0 12031045
Chr15 0 19001909
Chr16 0 11476499
Chr17 0 14675105
Chr18 0 14814751
Chr19 0 11145676
Chr2 0 15003700
Chr20 0 14375345
Chr21 0 14287830
Chr22 0 12982530
Chr23 0 20688064
Chr24 0 17702303
Chr25 0 14321294
Chr26 0 17153587
Chr27 0 10917718
Chr28 0 16406049
Chr29 0 14180785
Chr3 0 20519224
Chr4 0 13065296
Chr5 0 18825850
Chr6 0 18208465
Chr7 0 18704015
Chr8 0 21083321
Chr9 0 11598160
Unknow 0 177361715
My plan was to then use
Code:
samtools view -L nonCodingRegions.bed accepted_hits.bam > nonCoding.bam
My other thought was to get the reads that are identified with the gff features using the -o option in htseq-count:
Code:
htseq-count -f bam -r name -o coding -t gene -i ID accepted_hits.bam
XF:Z:__no_feature
XF:Z:Achn175571
XF:Z:Achn175571
XF:Z:__no_feature
XF:Z:Achn008681
XF:Z:__no_feature
XF:Z:__no_feature
XF:Z:Achn339401
XF:Z:Achn339401
XF:Z:Achn050721
XF:Z:Achn175571
XF:Z:Achn175571
XF:Z:__no_feature
XF:Z:Achn008681
XF:Z:__no_feature
XF:Z:__no_feature
XF:Z:Achn339401
XF:Z:Achn339401
XF:Z:Achn050721
Or alternatively, is there a better way to draw out the ncRNA?
Cheers
Ben.
Comment