Hi,
I hope this question has not been posted somewhere here and I haven'nt found it. In fact I would like to generate a kind of replicate dataset from an existing 454 Junior NGS dataset based on known error rates. The problem is that I am dealing with reads coming from a project which aims to analyse the B-cell repertoire. This means, I have no real reference (for those which are not familiar - in short: the reads represent quasi a combination of differnt genes which are mixed during an infection and additionally affected by mutations, so that an alignment, or a generation of synthethic reads based on the reference genome does not help me.)
I would like to do something quiet simple:
- take the reads I have
- build any kind of error model including respective error rates
- generate new NGS dataset, in fact representing the original one, but modified based on error rates.
I there anything out there, which can do that already? I had a look on Art, but I guess this does not help me.
Alternatively I could also implement something by myself...But then I was wondering how I practically consider for example an indel-error-rate of 0.38/100bp. I have reads which are ~220 bp long --> error-rate = 0.84/read
I am wondering what I can do with this number... is it valid to say:
0.84 idel_error / read --> 84 indel_error_events / 100 reads
My dataset has ~20.000 reads, so that I would have to perform ~16700 indel_events. I could here select the reads randomly, check for homopolymers (in case of >1 homopolymer, random selection) and the inserting or deleting lets say bewteen 2 and 4 nts. Substitution errors I would add with a very low rate also randomly.
My aim is to get a rough idea about the robustness of my analysis (and to convince the biologist that a technical replication or a control sample might make sense for his particular research question ). I know that it is not 100% correct, since my data already include an error.
I would be happy for any suggestion.
Thanks in advance.
I hope this question has not been posted somewhere here and I haven'nt found it. In fact I would like to generate a kind of replicate dataset from an existing 454 Junior NGS dataset based on known error rates. The problem is that I am dealing with reads coming from a project which aims to analyse the B-cell repertoire. This means, I have no real reference (for those which are not familiar - in short: the reads represent quasi a combination of differnt genes which are mixed during an infection and additionally affected by mutations, so that an alignment, or a generation of synthethic reads based on the reference genome does not help me.)
I would like to do something quiet simple:
- take the reads I have
- build any kind of error model including respective error rates
- generate new NGS dataset, in fact representing the original one, but modified based on error rates.
I there anything out there, which can do that already? I had a look on Art, but I guess this does not help me.
Alternatively I could also implement something by myself...But then I was wondering how I practically consider for example an indel-error-rate of 0.38/100bp. I have reads which are ~220 bp long --> error-rate = 0.84/read
I am wondering what I can do with this number... is it valid to say:
0.84 idel_error / read --> 84 indel_error_events / 100 reads
My dataset has ~20.000 reads, so that I would have to perform ~16700 indel_events. I could here select the reads randomly, check for homopolymers (in case of >1 homopolymer, random selection) and the inserting or deleting lets say bewteen 2 and 4 nts. Substitution errors I would add with a very low rate also randomly.
My aim is to get a rough idea about the robustness of my analysis (and to convince the biologist that a technical replication or a control sample might make sense for his particular research question ). I know that it is not 100% correct, since my data already include an error.
I would be happy for any suggestion.
Thanks in advance.