Hello,
I am relatively new to NGS analysis, and have recently been put on a project analyzing samples that have been sequenced by Illumina-Solexa's small RNA protocol.
I have reviewed this forum and found much advice on various tools/pipelines including miRtools, miRanalyzer, miRexpress, miRdeep, etc. I just wanted to get some feedback/advice on the type of results I am getting using these various public domain pipelines.
1. Using the s_*_sequence.txt output from the sequencer pipeline, I have about 22 million reads per sample, which after removing redundancy, trimming the adapter sequences, and removing reads where the adapter sequence shows up at the beginning or middle, I paired down to ~300,000 unique reads +/- 25,000 depending on the sample.
Is this typical?
2. When analyzing the length distributions of my unique reads, I am seeing a clear peak around 20-21 nucleotides,but i am seeing some minor peaks around 29 and 32 nucleotides. When analyzing the length distributions of the total read counts, I am seeing a peak at 29 nucleotides. Please see attached plots.
What could this possibly mean? Shouldn't I be seeing peaks around 20-21 nucleotide only? Has anyone seen this type of data before? The idea of these sequences being fragments of mRNA, piRNA, snoRNA, tRNA, etc. has been suggested.
3. I aligned my unique reads to the human genome using BowTie. I am getting approximately 45-60% of the unique reads aligning to the genome, while the rest of the reads do not align. I have used the command:
./bowtie -n 1 -l 17 -k 200 --best --chunkmbs 128
So what does this mean for the remaining sequences that are not aligning? Why would there be so much unaligned sequences in my data?
-----------------------
Thank you all for any insight you can provide me with. SEQAnswers has been very helpful for me thus far in getting introduced to NGS data analysis!
I am relatively new to NGS analysis, and have recently been put on a project analyzing samples that have been sequenced by Illumina-Solexa's small RNA protocol.
I have reviewed this forum and found much advice on various tools/pipelines including miRtools, miRanalyzer, miRexpress, miRdeep, etc. I just wanted to get some feedback/advice on the type of results I am getting using these various public domain pipelines.
1. Using the s_*_sequence.txt output from the sequencer pipeline, I have about 22 million reads per sample, which after removing redundancy, trimming the adapter sequences, and removing reads where the adapter sequence shows up at the beginning or middle, I paired down to ~300,000 unique reads +/- 25,000 depending on the sample.
Is this typical?
2. When analyzing the length distributions of my unique reads, I am seeing a clear peak around 20-21 nucleotides,but i am seeing some minor peaks around 29 and 32 nucleotides. When analyzing the length distributions of the total read counts, I am seeing a peak at 29 nucleotides. Please see attached plots.
What could this possibly mean? Shouldn't I be seeing peaks around 20-21 nucleotide only? Has anyone seen this type of data before? The idea of these sequences being fragments of mRNA, piRNA, snoRNA, tRNA, etc. has been suggested.
3. I aligned my unique reads to the human genome using BowTie. I am getting approximately 45-60% of the unique reads aligning to the genome, while the rest of the reads do not align. I have used the command:
./bowtie -n 1 -l 17 -k 200 --best --chunkmbs 128
So what does this mean for the remaining sequences that are not aligning? Why would there be so much unaligned sequences in my data?
-----------------------
Thank you all for any insight you can provide me with. SEQAnswers has been very helpful for me thus far in getting introduced to NGS data analysis!
Comment