SEQanswers

Go Back   SEQanswers > Sequencing Technologies/Companies > SOLiD



Similar Threads
Thread Thread Starter Forum Replies Last Post
miRNA analysis.. quicksand21 Bioinformatics 24 03-10-2012 01:41 AM
Mirna expression analysis moriah Bioinformatics 3 08-23-2011 02:46 AM
miRNA analysis ndeshpan Bioinformatics 3 07-08-2011 12:21 AM
illumina smallRNA adapter sequence for downstram analysis + miRNA analysis steps ndeshpan Bioinformatics 2 06-14-2011 10:44 PM
miRNA northern analysis saurabh_r General 0 10-27-2009 10:06 PM

Reply
 
Thread Tools
Old 07-27-2009, 04:28 AM   #1
Sheila
Member
 
Location: Europe

Join Date: Jun 2009
Posts: 17
Default miRNA analysis

Hi,
I'm wondering what people are using for miRNA analysis. Does anyone use other tools apart from AB's?

Does anyone know the percentage of mappable reads one should get per sample with barcodes? Is it 50% of the total amount of usable reads or less than that?

Regards,

S
Sheila is offline   Reply With Quote
Old 07-27-2009, 09:43 PM   #2
OneManArmy
Member
 
Location: Sydney, Australia

Join Date: Jul 2009
Posts: 13
Default

We've been using AB's rna2map for some miRNA analysis.

From our experience in barcoded samples, we've been getting ~50-60% of usable reads that are mappable.
OneManArmy is offline   Reply With Quote
Old 07-27-2009, 10:06 PM   #3
OneManArmy
Member
 
Location: Sydney, Australia

Join Date: Jul 2009
Posts: 13
Default

On that note, I find that the mapping highly depends on the number of mismatches you set. When increasing the number of mismatches on such short reads, we tend to get multiple mappings for some of the reads. Anyone have any idea on what to set the mismatches to?

50-60% of usable reads is obtained using up to 3 mismatches for seeding, 6 mismatches after extension, and for genomic step, 2 mismatches for seeding and 5 mismatches after extension.
OneManArmy is offline   Reply With Quote
Old 07-28-2009, 03:42 AM   #4
Sheila
Member
 
Location: Europe

Join Date: Jun 2009
Posts: 17
Default

Hi,
Thanks for your reply.
Do you mask the last positions of the reads and then use up to 6mm or you use all possitions (no masking)?
In addition, did you get homogeneous distribution of barcodes for all your samples? For us the number of tags per barcode were from 14 to 29M (usable tags not mappable). Not sure if that's normal.

Regards.

S.
Sheila is offline   Reply With Quote
Old 07-28-2009, 11:29 AM   #5
fishtank
Junior Member
 
Location: seattle

Join Date: Jul 2009
Posts: 8
Default

Quote:
Originally Posted by OneManArmy View Post
On that note, I find that the mapping highly depends on the number of mismatches you set. When increasing the number of mismatches on such short reads, we tend to get multiple mappings for some of the reads. Anyone have any idea on what to set the mismatches to?

50-60% of usable reads is obtained using up to 3 mismatches for seeding, 6 mismatches after extension, and for genomic step, 2 mismatches for seeding and 5 mismatches after extension.
I am trying out rna2map. I am wondering what fraction of total reads map to known miRNA. How does setting 0 mismatches affect that?
I am assuming "multiple mapping for some reads" means the same reads maps to 2 different miRNA. In such cases, does the read gets assigned to both miRNA. How do you get the statistics for such cases?
Is the 6 mismatch considered high for short reads? Since ABI uses colorspace coding, am I right to say 1 mismatch is generally a sequencing error that can be corrected, 2 mismatch is a base substitution so could be a SNP. Does the mismatch have to be adjacent in such cases?
So how does 3 mismatches compares to ? number of mismatch in basespace?
if I set to 1 mismatch, is this almost equivalent to 0 mismatch in base space?
Thanks for your help.
fishtank is offline   Reply With Quote
Old 07-28-2009, 12:54 PM   #6
westerman
Rick Westerman
 
Location: Purdue University, Indiana, USA

Join Date: Jun 2008
Posts: 1,104
Default

Quote:
Originally Posted by fishtank View Post
I am trying out rna2map. I am wondering what fraction of total reads map to known miRNA. How does setting 0 mismatches affect that?
Any reads with sequencing errors would get discarded. With luck that won't be a significant number. Also you will discard any reads with SNPs. Unless your reference is really evolutionarily close to your experimental sample you could discard some miRNA. This may bias your results.

Quote:
I am assuming "multiple mapping for some reads" means the same reads maps to 2 different miRNA. In such cases, does the read gets assigned to both miRNA. How do you get the statistics for such cases?
I am not familiar with the miRNA pipeline but in the "normal" SNP-calling pipeline only unique reads are considered in the end results. If this holds true in miRNA/transcriptome counting then reads with multiple hits will eventually be discarded.

Quote:
Is the 6 mismatch considered high for short reads?
For 50-mers the answer is 'no, 6 mismatch is ok although 5mm would be better'. In part it depends on the distance of your reference from the experiment. For the shorter miRNA reads there could be problems. For SNP calling the general recommendation is 2mm for 25mers, 3mm for 35mers. For transcriptome calling the program has different mismatch parameters for the 5' part of the 50mer and for the 3' part.

Quote:
Since ABI uses colorspace coding, am I right to say 1 mismatch is generally a sequencing error that can be corrected, 2 mismatch is a base substitution so could be a SNP. Does the mismatch have to be adjacent in such cases?
Yes, 1mm is a sequencing error. Always. Adjacent 2mm is most likely a SNP although it can, obviously, also be 2 sequencing errors in a row, indel, etc. SNPs have to be adjacent.

Quote:
So how does 3 mismatches compares to ? number of mismatch in basespace?
if I set to 1 mismatch, is this almost equivalent to 0 mismatch in base space?
Thanks for your help.
Yes, 1mm in CS is 0mm in BS. 3mm in CS is can be 1 SNP plus 1 error. Or 3 errors. Or something else. Base space usually has quality values and thus I am not sure that the questions can be definitively answered.
westerman is offline   Reply With Quote
Old 07-28-2009, 01:18 PM   #7
fishtank
Junior Member
 
Location: seattle

Join Date: Jul 2009
Posts: 8
Default

[QUOTE=westerman;6898]Any reads with sequencing errors would get discarded. With luck that won't be a significant number. Also you will discard any reads with SNPs. Unless your reference is really evolutionarily close to your experimental sample you could discard some miRNA. This may bias your results.

I am puzzled: if 1 mismatch is always a sequence error that can be corrected. Why would any reads with sequencing errors get discarded?
Does the abi rna2map pipeline discards reads as a result of sequencing error before the alignment?
fishtank is offline   Reply With Quote
Old 07-28-2009, 01:30 PM   #8
westerman
Rick Westerman
 
Location: Purdue University, Indiana, USA

Join Date: Jun 2008
Posts: 1,104
Default

Quote:
Originally Posted by fishtank View Post

I am puzzled: if 1 mismatch is always a sequence error that can be corrected. Why would any reads with sequencing errors get discarded?
Does the abi rna2map pipeline discards reads as a result of sequencing error before the alignment?
That is a good point. With 1 or more mismatch the errors should get corrected and the reads retained.

Of course if you set mismatch to 0 (which is the part of the post I was responding to) then the reads will be discarded. But mismatch 1 or greater should not have the reads being discarded ... just corrected.

Last edited by westerman; 07-28-2009 at 01:33 PM. Reason: Posted faster than my brain thought things through. Corrected this.
westerman is offline   Reply With Quote
Old 07-28-2009, 02:47 PM   #9
OneManArmy
Member
 
Location: Sydney, Australia

Join Date: Jul 2009
Posts: 13
Default

Quote:
Originally Posted by westerman View Post
A
I am not familiar with the miRNA pipeline but in the "normal" SNP-calling pipeline only unique reads are considered in the end results. If this holds true in miRNA/transcriptome counting then reads with multiple hits will eventually be discarded.
The rna2map pipeline counts multiple mappings in the end result - they are not discarded. Thus setting the mismatch threshold too high can yield ambiguous results.
OneManArmy is offline   Reply With Quote
Old 07-28-2009, 02:52 PM   #10
OneManArmy
Member
 
Location: Sydney, Australia

Join Date: Jul 2009
Posts: 13
Default

Quote:
Originally Posted by Sheila View Post
Do you mask the last positions of the reads and then use up to 6mm or you use all possitions (no masking)?
Yes, that is how the pipeline does it. Start mapping 18-mers and extend the mapping until up to 6mm.
Quote:
Originally Posted by Sheila View Post
In addition, did you get homogeneous distribution of barcodes for all your samples? For us the number of tags per barcode were from 14 to 29M (usable tags not mappable). Not sure if that's normal.
No, obviously the barcode distribution will vary depending on how accurately you combined the barcoding samples in the wet lab, lab conditions, etc.. However, I am not too familiar with this part.
14-29M tags, how many barcodes were you running? It really depends on how many beads you loaded on the slide. From ABI's docs it seems that 300M is the number of beads you're supposed to load - if you are using the full 10 barcodes in SREK that seems about right.
OneManArmy is offline   Reply With Quote
Old 07-29-2009, 01:12 AM   #11
Sheila
Member
 
Location: Europe

Join Date: Jun 2009
Posts: 17
Default

Quote:
Originally Posted by OneManArmy View Post
The rna2map pipeline counts multiple mappings in the end result - they are not discarded. Thus setting the mismatch threshold too high can yield ambiguous results.
In the configuration file you can choose between "all" or "unique".
all = all mapping positions
unique= unique mapping positions

S.
Sheila is offline   Reply With Quote
Old 07-29-2009, 01:20 AM   #12
Sheila
Member
 
Location: Europe

Join Date: Jun 2009
Posts: 17
Default

[QUOTE=westerman;6898]For 50-mers the answer is 'no, 6 mismatch is ok although 5mm would be better'. In part it depends on the distance of your reference from the experiment. For the shorter miRNA reads there could be problems. For SNP calling the general recommendation is 2mm for 25mers, 3mm for 35mers. For transcriptome calling the program has different mismatch parameters for the 5' part of the 50mer and for the 3' part.


I'd rather use 35nt for miRNAs since their size varies between 19 and 25nt (Human).
6 mismatches seem quite a lot but bare in mind the last bases of the miRNA that are close to the adaptor have a high error rate. I wouldn't use 0 mismatches.

S.
Sheila is offline   Reply With Quote
Old 07-29-2009, 12:14 PM   #13
fishtank
Junior Member
 
Location: seattle

Join Date: Jul 2009
Posts: 8
Default

[QUOTE=Sheila;6925]
Quote:
Originally Posted by westerman View Post
For 50-mers the answer is 'no, 6 mismatch is ok although 5mm would be better'..In part it depends on the distance of your reference from the experiment. For the shorter miRNA reads there could be problems. For SNP calling the general recommendation is 2mm for 25mers, 3mm for 35mers. For transcriptome calling the program has different mismatch parameters for the 5' part of the 50mer and for the 3' part.
Can you clarify what you mean on the distance of your reference from the experiments?
Have anyone tried comparing different mismatches settings for the rna2map? I still can't get a sense of what is optimal for miRNA transcriptome. Does it make sense to set the seed mm to 0 and set extension mm to 3?

Quote:
Originally Posted by Sheila View Post
I'd rather use 35nt for miRNAs since their size varies between 19 and 25nt (Human).
6 mismatches seem quite a lot but bare in mind the last bases of the miRNA that are close to the adaptor have a high error rate. I wouldn't use 0 mismatches.

S.
What are the reasons for not using 0 or 1 mismatches besides the low counts thus missing low abundance miRNA. If you are comparing differential expression, these miRNA wouldn't be statistically significant most times. On the other hand if you can boost the counts for such low abundance miRNA allowing for higher mm, it becomes questionable the accuracy of the method.

Also how is "usable reads" defined? Where do I get statistics for those?

From my runs, it says 64540015 total beads but the uniquely placed beads are 0.31% for 0 mismatches. Is that low? It reports up to 2.04% for up to 6 mismatches.
fishtank is offline   Reply With Quote
Old 07-29-2009, 01:00 PM   #14
westerman
Rick Westerman
 
Location: Purdue University, Indiana, USA

Join Date: Jun 2008
Posts: 1,104
Default

Quote:
Originally Posted by fishtank View Post

Can you clarify what you mean on the distance of your reference from the experiments?
Evolutionary distance (or divergence) in million of years.

So many people, especially in the SOLiD camp, do experiments of DNA versus known and well annotated genomes. E.g., human DNA vs. the known human genome reference. When you do this type of experiment you can get away with low mismatch requirements because you expect your sequence DNA to be very close to the reference. 2 mismatches is great for SNP discovery since any given read is unlikely to have more than 1 SNP in it. Anything else can be discarded as error.

On the other hand some of us have to deal DNA from species only partially related to our known (and often incomplete) reference sequence. We then use larger mismatch parameters and are thankful for what information we do get back.

When I talk to the ABI they always are thinking in "perfect human reference" terms. Thus I try to be careful to couch my answers in terms of evolutionary distance. I.e., what works for me in the rough-and-ready world of plant genomics may not be strictly applicable to you if you are working in human genomics.


Quote:
From my runs, it says 64540015 total beads but the uniquely placed beads are 0.31% for 0 mismatches. Is that low? It reports up to 2.04% for up to 6 mismatches.
That is low but it depends on your reference and your DNA and your organism. Which I do not think you have stated. But given this thread I presume that your reference is genomic and your DNA is microRNA. In that case you have to ask yourself, "how much of the genome do I expect to be miRNA as versus other RNA, genes, and structural?" If the answer is that you expect only 0.3% of your genome to be miRNA then your mapping is fine.

microRNA are so newly discovered -- i.e., since I've been out of school -- that I am not sure how much of a genome should be miRNA. I could tell you roughly how much of genome should be gene and thus how much a mRNA experiment should have have as coverage but not for miRNAs.
westerman is offline   Reply With Quote
Old 07-29-2009, 02:46 PM   #15
fishtank
Junior Member
 
Location: seattle

Join Date: Jul 2009
Posts: 8
Default

Quote:
Originally Posted by Sheila View Post
I'd rather use 35nt for miRNAs since their size varies between 19 and 25nt (Human).
6 mismatches seem quite a lot but bare in mind the last bases of the miRNA that are close to the adaptor have a high error rate. I wouldn't use 0 mismatches.
S.
I am wondering where you came to the conclusion that last bases of the miRNA that are close to the adaptor have a high error rate. Could these be due to miRNA editing?
fishtank is offline   Reply With Quote
Old 07-29-2009, 04:33 PM   #16
fishtank
Junior Member
 
Location: seattle

Join Date: Jul 2009
Posts: 8
Default

Quote:
Originally Posted by westerman View Post
That is low but it depends on your reference and your DNA and your organism. Which I do not think you have stated. But given this thread I presume that your reference is genomic and your DNA is microRNA. In that case you have to ask yourself, "how much of the genome do I expect to be miRNA as versus other RNA, genes, and structural?" If the answer is that you expect only 0.3% of your genome to be miRNA then your mapping is fine.

microRNA are so newly discovered -- i.e., since I've been out of school -- that I am not sure how much of a genome should be miRNA. I could tell you roughly how much of genome should be gene and thus how much a mRNA experiment should have have as coverage but not for miRNAs.
But according to this ABI document, they are getting 50% reads mapped to miRNA so 0.3% is worrying. And it is already enriched for small RNA. I am skeptical of the 50% claim though, wonder what other people are getting?
Attached Files
File Type: pdf cms_057560.pdf (617.4 KB, 51 views)
fishtank is offline   Reply With Quote
Old 07-29-2009, 05:25 PM   #17
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Quote:
Originally Posted by westerman View Post
2 mismatches is great for SNP discovery since any given read is unlikely to have more than 1 SNP in it. Anything else can be discarded as error.
The fraction of possible 50bp reads with X SNPs (from hg18 and dbsnp) is:

0 84.08%
1 13.02%
2 2.30%
3 0.40%
4 0.10%
5 0.03%
...

so make your own judgment.

Quote:
On the other hand some of us have to deal DNA from species only partially related to our known (and often incomplete) reference sequence. We then use larger mismatch parameters and are thankful for what information we do get back.
I think that if it is possible, try to align with the greatest sensitivity as possible, since you will recover the most amount of data. SOLiD color error rates are non-trivial and can be easily corrected (while correctly using dynamic programming, not valid-adjacent rules). I would recommend somewhere around 10% color differences (in most cases SNPs count as two, color errors as one).
nilshomer is offline   Reply With Quote
Old 07-29-2009, 05:31 PM   #18
OneManArmy
Member
 
Location: Sydney, Australia

Join Date: Jul 2009
Posts: 13
Default

Quote:
Originally Posted by Sheila View Post
In the configuration file you can choose between "all" or "unique".
all = all mapping positions
unique= unique mapping positions
Thanks. Even with this, the pipeline still discards the reads that map to multiple places - even though a read may map to a reference with 0 mismatches and another one with 2 mismatches.
OneManArmy is offline   Reply With Quote
Old 07-31-2009, 02:17 AM   #19
Sheila
Member
 
Location: Europe

Join Date: Jun 2009
Posts: 17
Default

Quote:
Originally Posted by fishtank View Post
I am wondering where you came to the conclusion that last bases of the miRNA that are close to the adaptor have a high error rate. Could these be due to miRNA editing?
Hi,
It's is known the last bases close to the adaptor have a higher error rate so I would not use 0 mismatches first because you would not detect any isomiR with 1nt diference (polymorphic or not) and second because of the higher error rate at the end of the sequences.
I'm still playing with the parameters, it's hard to define what's best.

S.
Sheila is offline   Reply With Quote
Old 07-31-2009, 01:21 PM   #20
fishtank
Junior Member
 
Location: seattle

Join Date: Jul 2009
Posts: 8
Default

I am trying to figure out how the *.csfasta_extend.counts.35.6 gets generated from .csfasta_extend.ma.35.6. In the .csfasta_extend.ma.35.6, what does

>1_17_829_F3,220_-79.6.21
T13100202312110020020101102011303111

means? I saw some documents that says it should be
>TAG_ID,LOCATION,MISMATCHES.

so 1_17_829_F3 is the TAG_ID.
Is 6 is the mismatches? But how do I decode the location part?

Thanks.
fishtank is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 10:45 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO