Seqanswers Leaderboard Ad

**nilshomer** · 04-21-2012, 08:23 AM

Originally posted by chintanspy View Post

I will try and let you know the output.

The following reads map on to opposite strand
CCCAAAACGTTTTTTCAGATTTTGCATTTT
TCCCAAAACGTTTTTTCAGATTTTGCATTT
TTCCCAAAACGTTTTTTCAGATTTTGCATT
TTTCCCAAAACGTTTTTTCAGATTTTGCAT
TTTTCCCAAAACGTTTTTTCAGATTTTGCA
TTTTTCCCAAAACGTTTTTTCAGATTTTGC
TTTTTTCCCAAAACGTTTTTTCAGATTTTG
TTTTTTTCCCAAAACGTTTTTTCAGATTTT
TTTTTTTTCCCAAAACGTTTTTTCAGATTT
TTTTTTTTTCCCAAAACGTTTTTTCAGATT
TTTTTTTTTTCCCAAAACGTTTTTTCAGAT

Can you show the pairs in FASTQ format? I need to see the read names as well.

**chintanspy** · 04-22-2012, 08:43 PM

Hi nils,

I tried the latest code with commit "6c33e0e5c64c816a5b95c7eac155eef2b4b8155c".
I have attached the outputs of the old and latest code

each zip file consists of 2 files:
1. *.bwa.read1.fastq
2. *.sam

I am still getting the same output as the previous code.

Attached Files

**nilshomer** · 04-23-2012, 10:43 PM

You attached only the first read in each pair. The second end is found in "test_new_v.bwa.read2.fastq".

The "-S" option controls how pairs are related (same strand/opposite strand). This doesn't preclude the individual reads to mapping to the reverse strand. Perhaps this is where the confusion lies?

**chintanspy** · 04-23-2012, 11:25 PM

Originally posted by nilshomer View Post

You attached only the first read in each pair. The second end is found in "test_new_v.bwa.read2.fastq".

The "-S" option controls how pairs are related (same strand/opposite strand). This doesn't preclude the individual reads to mapping to the reverse strand. Perhaps this is where the confusion lies?

Yes, you are right...mapping is done on both the strands.

my *.bwa.read2.fastq file is empty as I had used "-2 0".

So "-S" option says "default (opposite strand for Illumina, same strand for SOLiD/Ion Torrent)". For SOLiD/Ion Torrent, reads have to be from same strand, right?

The command that I used is
dwgsim -1 30 -e 0 -E 0 -r 0 -X 0 -y 0 -R 0 -2 0 -c 2 -B -f ATGC test.fa ionData/test

From the above command I expect
1. no errors
2. no mutations (no indels)
3. no reads from opposite strand

I get reads without errors and mutations but there are reads from opposite strand.
My question is why am I getting the reads from negative strand ?

**nilshomer** · 04-23-2012, 11:42 PM

You are getting reads mapping to the negative strand since the simulator draws reads from either strand. The "-S" option does not affect this behavior. Again, there is no way to specify to have individual templates maps only to one strand or the other. This is drawn randomly (0.5 probability).

The "-S" option specifies the strand relationship between the first end and second end. In this case you have no second end ("-2 0"), so this option has no effect. In the case you had "-2 100" or the like, the "-S" option would say are the first end and second end always on the same strand? For example, if the first end maps to the reverse strand then with "-S 1" the second end would also map to the reverse strand. Perhaps you should search the forum on the meaning of paired ends and mate pairs, since they are used in the description of "-S 1" and "-S 2".

**chintanspy** · 04-24-2012, 12:29 AM

I got the term 'same strand' misinterpreted. I ignored the 'mate-pair' term.

Now its clear.

Thanks a lot for answering patiently.

**chintanspy** · 04-25-2012, 01:23 AM

Evaluating mapping

In 'Evaluating mapping - dwgsim_eval' there are columns without "single quote" (i.e. mc, mi, mu, um, uu ) and with "single quote" (i.e mc', mi', mu', um', uu')

I saw that these are cumulative values used for calculating sensitivity, FDR, positive predictive value.

At a given threshold,

sensitivity = True Positives(TP) / True Positives(TP) + False Negatives(FN)

so TP = mc , right ?
FN = um , right ?

can you please elaborate on the fields with "single quote" ?

**nilshomer** · 04-26-2012, 08:05 AM

Originally posted by chintanspy View Post

In 'Evaluating mapping - dwgsim_eval' there are columns without "single quote" (i.e. mc, mi, mu, um, uu ) and with "single quote" (i.e mc', mi', mu', um', uu')

I saw that these are cumulative values used for calculating sensitivity, FDR, positive predictive value.

At a given threshold,

sensitivity = True Positives(TP) / True Positives(TP) + False Negatives(FN)

so TP = mc , right ?
FN = um , right ?

can you please elaborate on the fields with "single quote" ?

Let me know if the documentation answers the question. If not, lets work on improving the documentation.

dnaa

http://sourceforge.net/apps/mediawiki/dnaa/index.php?title=Whole_Genome_Simulation#Output_Table

Download dnaa for free. DNAA is the DNA analysis package, for analyzing next-generation post-alignment whole genome resequencing data. Specifically, DNAA is able to find structural variation, SNP and indel variants, as well as evaluating the mapping and data quality.

**chintanspy** · 04-26-2012, 09:12 PM

Originally posted by nilshomer View Post

Let me know if the documentation answers the question. If not, lets work on improving the documentation.
http://sourceforge.net/apps/mediawik...n#Output_Table

There is an error on the link you provided

"There was an error processing your request ...

Group not found for name: dnaa "

**nilshomer** · 04-26-2012, 09:17 PM

Originally posted by chintanspy View Post

There is an error on the link you provided

"There was an error processing your request ...

Group not found for name: dnaa "

Looks like sourceforge is having problems, here's another link:

Home

https://github.com/nh13/DWGSIM/wiki/Evaluating-Mappings-from-DWGSIM

Whole Genome Simulator for Next-Generation Sequencing - nh13/DWGSIM

**chintanspy** · 04-26-2012, 09:48 PM

My understanding :

True Positive (TP) Number of reads mapped correctly that should be mapped
False Positive (FP) Number of reads mapped incorrectly that should be mapped at simulation co-ordinate
False Positive (FP) Number of reads mapped that should be unmapped
False Negative (FN) Number of reads unmapped that should be mapped
True Negative (TN) Number of reads unmapped that should be unmapped

Now if I want to calculate sensitivity and specificity at given threshold (for example, at MAPQ 37)

I would calulate:

sensitivity at given threshold = mc / (mc + mu) = TP / (TP + FN)

specificity at given threshold = uu / (mi + um + uu) = TN / (FP + TN)

But, your formula for sensitivity at the threshold is (mc / (mc' + mi' + mu'))

So, I am not sure how much I am correct ?

**nilshomer** · 04-26-2012, 10:00 PM

These values don't fall into TP/FP categories, as you would need four values for a contingency table. We have five, with "mi" not able to be grouped with any other. For example, "mi" is not a false positive. If it was a false positive, that means it should have been a negative. But wait, a negative means it should not be able to be mapped, but in fact, the definition of "mi" is that it should be mapped. This is the critical issue. On the other hand, it is not a false negative, since it was mapped. The issue is that it lies in between: it was mapped (positive), just not to the correct location (so not a true positive).

Again, we can't associate "mi" with a group in the contingency table.

Sensitivity should always be about how often do you map correctly when the read should have mapped correctly.

Specificity should be similar to the positive predictive value, namely, how often do you map it correctly when you map a read.

Edit: I wanted to say that it took me a while to reconcile this fact too, and you can see that in the git repository log, where I changed my definitions to come to this final conclusion during the course of the development of this software.

**chintanspy** · 04-26-2012, 11:16 PM

Originally posted by nilshomer View Post

These values don't fall into TP/FP categories, as you would need four values for a contingency table. We have five, with "mi" not able to be grouped with any other. For example, "mi" is not a false positive. If it was a false positive, that means it should have been a negative. But wait, a negative means it should not be able to be mapped, but in fact, the definition of "mi" is that it should be mapped. This is the critical issue. On the other hand, it is not a false negative, since it was mapped. The issue is that it lies in between: it was mapped (positive), just not to the correct location (so not a true positive).

Again, we can't associate "mi" with a group in the contingency table.

I agree with your explanation. There should be someway to resolve this. If I could have the counts of the read with same sequence will that be helpful ? So we could have probability the read 'X' could map to its simulation position with probability 'n' . But I guess it will be complicated. And it will be same case while identifying variations, right?

My other question was why mc', mi', mu' have be used for calculating sensitivity at the threshold instead of mc, mi, mu ?

**nilshomer** · 04-27-2012, 07:06 AM

Originally posted by chintanspy View Post

My other question was why mc', mi', mu' have be used for calculating sensitivity at the threshold instead of mc, mi, mu ?

This is another subtlety, how do we actually know at which threshold a read should map? The aligner is the one that determines the mapping quality, but I don't know of a good way to have the "true" mapping quality.

**NextGenSeb** · 05-10-2012, 01:24 AM

Hi Nils,
just stumbled across dwgsim in my search for a read simulator, love the fact that I can generate Illumina, Solid and Ion-torrent on the same tool as I have data from all platforms. Nice touch as well on including the galaxy wrappers

At the risk of sounding ungrateful, is there an ETA on the dwgsim_pileup_eval.pl? You'd solve all my problems in one go.
Thanks for sharing,
Cheers
Seb

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 59 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 57 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 51 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 56 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News