I am about as newbie as a newbie can get so to be honest I'm a little reluctant to post - however, I have discovered a result looking at the raw Illumina reads that is not readily answerable this early in my RNA-seq workflow, and, with my limited knowledge at this point, thought I might ask to see if I could get an opinion on the below results.
I am particularly interested in an hypothetical unannotated paralog/alt splice that may not align with my genome so I am spending quite a bit of time perusing the raw illumina reads in order to take a close look at some of the more conserved regions of my research proteins looking for paralogs, etc. (as well as to get a 'feel' for the raw reads, how they behave, etc, on manual queries). After I finish with this preliminary analysis I have a few hundred hours of learning before I can comment with any confidence on sequence assembly matters - I enjoy computers but I am far removed from a Linux wizard.
Below is a result I found generated from high quality reads (>Q30) which suggests an alternative splice. In the code box below there are three lines:
Line 1: partial exon 2 of one of my research proteins
Line 2: raw Illumina reads linked by grep query, all have >Q30, and all cross the putative splice site
Line 3: partial exon 7 of the same protein in Line 1
The <....> bracket indicates the beginning of an intronic sequence at the end of exon 2.
The above result appears to be an alternative splice. However, I was wondering if it may be an error generated by RNA-seq preparation of exp material, i.e., two pieces of DNA randomly cut and joined. There were about 10 copies of the middle region above all yielding high quality reads and all crossing an apparent splice site.
Q: What is the likelyhood that the above is real and not a machine artifact?
I am particularly interested in an hypothetical unannotated paralog/alt splice that may not align with my genome so I am spending quite a bit of time perusing the raw illumina reads in order to take a close look at some of the more conserved regions of my research proteins looking for paralogs, etc. (as well as to get a 'feel' for the raw reads, how they behave, etc, on manual queries). After I finish with this preliminary analysis I have a few hundred hours of learning before I can comment with any confidence on sequence assembly matters - I enjoy computers but I am far removed from a Linux wizard.
Below is a result I found generated from high quality reads (>Q30) which suggests an alternative splice. In the code box below there are three lines:
Line 1: partial exon 2 of one of my research proteins
Line 2: raw Illumina reads linked by grep query, all have >Q30, and all cross the putative splice site
Line 3: partial exon 7 of the same protein in Line 1
The <....> bracket indicates the beginning of an intronic sequence at the end of exon 2.
Code:
[FONT="Courier New"] ********** ***::****:* e2 ...RGHTGLFAGG<ASTYQVGLELC...> ...GHALLFRTSVMAKVEIQAVSTCRGHTGLFAGG<ASTFHVGLEAC...> e7 ...GHALLYRTTVMAKLEIQAVSTCR... <--- intron ---> *****:**:****:********* [/FONT]
Q: What is the likelyhood that the above is real and not a machine artifact?
Comment