I'm trying to annotate all small ncRNAs in an Illumina GAIIx sequenceing sample. All sequences are 36 bps initially. I've used Galaxy to trim the 3' adapters, trim the bases with quality < 20 from the 3' end, and align to the genome using Bowtie.
My idea is to download from UCSC BED files containing rRNA, tRNA, miRNA, piRNA, snRNA, snoRNA, exons, introns, repeatmasker, etc., into Galaxy and then intersect these with my alignment BED file.
1. Is it a good way to annotate reads using track intersections? Or is it better to collapse the sequences and then align to a fasta file with miRNAs, etc. (though it's slower than using track intersections)?
2. Out of around 5 million alignment in my BED file, 3.5 million intersect with the rRNA intervals from the RepeatMasker track and aroun 0.5 million with tRNAs. How do I know whether these are really rRNAs and tRNAs and not e.g. miRNAs or piRNAs processed from rRNAs and tRNAs?
3. Has anyone used the various ncRNA tracks at this UCSC mirror: http://www.ncrna.org/glocal/cgi-bin/hgGateway It seems that some sequences are annotated twice with different identifiers. Also, e.g., the rRNA does not seem to contain all the rRNA annotations that are part of the RepeatMasker track.
Thank you.
My idea is to download from UCSC BED files containing rRNA, tRNA, miRNA, piRNA, snRNA, snoRNA, exons, introns, repeatmasker, etc., into Galaxy and then intersect these with my alignment BED file.
1. Is it a good way to annotate reads using track intersections? Or is it better to collapse the sequences and then align to a fasta file with miRNAs, etc. (though it's slower than using track intersections)?
2. Out of around 5 million alignment in my BED file, 3.5 million intersect with the rRNA intervals from the RepeatMasker track and aroun 0.5 million with tRNAs. How do I know whether these are really rRNAs and tRNAs and not e.g. miRNAs or piRNAs processed from rRNAs and tRNAs?
3. Has anyone used the various ncRNA tracks at this UCSC mirror: http://www.ncrna.org/glocal/cgi-bin/hgGateway It seems that some sequences are annotated twice with different identifiers. Also, e.g., the rRNA does not seem to contain all the rRNA annotations that are part of the RepeatMasker track.
Thank you.