Seqanswers Leaderboard Ad

**lh3** · 02-16-2012, 05:04 AM

I think you should map your reads to the assembly and then do SNP calling. SAMtools should in principle work, but I have not tried.

**Zam** · 02-16-2012, 05:17 AM

Re Cortex:
1. You have much more than 30x coverage if you have many samples at 20x
2. It's not as simple as "you need 30x" for Cortex. But you are absolutely right that an assembly approach will be less sensitive to SNPs.

Re what to do
- it depends what you want to achieve. Do you want a conservative small set of SNPs for building a genetic map, or a big sensitive set for some other purpose etc.

If you have the time, then try both methods (mapping/assembly) and compare. If you are doing population genetic studies, then experience suggests that you will need to be very careful with SNP calls based on an assembly that is not high quality, as it is easy for assembly artefacts to look like interesting scientific finds in your SNPs.

**fcr** · 02-16-2012, 07:10 AM

Hi Zam,

Thanks a lot.

Cortex:
1. True, the distribution of coverage will include regions above 30x.
2. What are the Computational needs for 10 individuals with 2.9 Gbp genome? In your server you stated "10 humans on a 256Gb RAM server" How long this takes? Would it be possible to call SNPs with less RAM?

What to do:
This is a 60 X coverage genome. I would assume that many of the scaffolds are bona fide and that many of the changes (adding more libraries) are going to affect mainly to the connection among scaffolds rather than disrupting them...but I might be wrong and shouldn't guess. The main interest are; 1. develop genome-wide set of markers and 2. do some population inferences by estimating Fst, Pi and Ne.

So you think is too risky using scaffolds?

Cheers,
Fernando

**Zam** · 02-16-2012, 07:26 AM

Hi Fernando

>True, the distribution of coverage will include regions above 30x.
One of the examples in our paper is of SNP calling in 10 samples each sampled to 6x,
for example.

2. Actually, you could call on 10 individuals with much less than 256Gb of RAM. You need 256Gb to hold all of ALL of their genomes at the same time. But lots of the genome is either monomorphic, or doesn't consist of things Cortex can call. So you could do those 10 samples in ~80Gb of RAM (for comparison I've just done 85 humans in 320 Gb of RAM).
The trick is to call on the joint graph (1 colour, probably needs 80Gb RAM) and then pull out just the variants and make a graph just of the variants. Then "multicolourise" the graph and make a 10-colour graph of the variants only, and genotype everyone in that.
Uses far less memory.

How much coverage do your 10 samples have? Is the 60x individual a different sample?

I'm not saying it is too risky with scaffolds, just that if you find something really exciting, you need to do some work making sure it's not an artefact. I've seen people have to work very hard to avoid problems with the chimp genome.

best

Zam

**lh3** · 02-16-2012, 07:33 AM

With 60X, you should be able to get an assembly decent enough for most analyses. This is true for human. Nonetheless, Zam is right that misassembly may cause artifacts. You have to live with it. If you are careful enough, you can greatly reduce the effect of that. Also beware that there will be reference bias when estimating population statistics (i.e. individuals closer to the reference will be mapped better).

**Zam** · 02-16-2012, 07:37 AM

Just to clarify one thing (and agree with Heng) - my understanding is that Fernando doesnt want to have to wait until his assembly is finished (I mean done/completed, not finished by manual finishers), and wants to get on with it and start calling now. That's what got me nervous about artefacts.

**fcr** · 02-17-2012, 08:56 AM

Hi,

Yes, Zam got it right. I want to start calling SNPs now. The assembly is unfinished and it's going to take time polishing it (~1000,000 scaffolds now). In response to Zam, the assembly is based on an individual, and the estimated coverage is 60X.

The other 10 individuals have 20 X coverage and i want to use them for SNP calling and perhaps "pilot" genotype calling. I think is worthy advance on this, even if in the future a second calling based on the assembly will help to verify/reject candidate regions of interest.

lh3: Thanks for your comment about the reference bias when estimating the population statistics...I will keep that in mind.

Cheers,
Fernando

**rururara** · 09-21-2012, 01:35 AM

De novo SNP calling in absence of complete reference assembly

Hai all,

What about if the incomplete reference genome like papaya? The available information on papaya are scaffolds and contigs. Is it possible to use papaya scaffolds as a reference to align against my reads? In my case, the objective is to discover the SNPs.

**Zam** · 09-21-2012, 01:49 AM

Hi Rururara
Are you working on the same project as Fernando or a different one? If different, how many samples are you trying to discover SNPs in, and what are their depths od coverage and with what technology. Finally, sorry for ignorance, but what is the ploidy of papaya?
regards
Zam

**fcr** · 09-21-2012, 01:57 AM

Hi Zam,

Rururara is not working in the same project as me. If papaya is a diploid, he could probably use the papaya scaffolds with the "Coordinates Only" option during the calling with cortex_var (actually a acompanying script called runcalls.pl). Right?

Cheers,
Fernando

**Zam** · 09-21-2012, 02:06 AM

Yes, and to explain that in more detail:
Rururura:

1.If you have one diploid sample you can de novo discover variants using Cortex, and then use your contigs/scaffolds to assign them coordinates. This is what Fernando meant by "CoordinatesOnly", an option for Cortex's new wrapper script.

2. If you have several samples, then you can do two things
a) You can also use the Cortex "population filter" to classify putative variants as repeat/error/polymorphism - this method is robust to reference assembly errors - it catching missing collapsed repeats in the reference - and this will give you a high quality set of variants
b) you could use this method to look into the quality of the reference and annotate regions which you trust and do not trust.

Zam

**rururara** · 09-21-2012, 02:11 AM

Hi Zam & fcr,

Yup, we are not in the same team. Hehe. Papaya is diploid. I have 3 samples and one of the sample is parental lines. I'm not sure yet the depth coverage as I am still not getting any sequencing information from the company, but soon I will. Papaya is sequence using HiSeq platform.

**Zam** · 09-21-2012, 02:26 AM

Hi there- when you say one of the samples is parental, does that mean you have two parents and 2 F1 samples, and you have sequenced one parent and both progeny?
Zam

**rururara** · 09-21-2012, 02:31 AM

Definitely yes. Is there any concern about that? Do u mind to share? Anyway, I would like to try this approach whereby I assemble the parental reads with scaffold and use it as a reference sequence to align against the other two progeny. What do u think?

Topics	Statistics	Last Post
Evaluating Genome Sequencing for ECMO Patients in the NICU by seqadmin Started by seqadmin, 12-17-2024, 10:28 AM	0 responses 33 views 0 likes	Last Post by seqadmin 12-17-2024, 10:28 AM
New Genetic Toolkit Refines Studies on Gene Function and Disease by seqadmin Started by seqadmin, 12-13-2024, 08:24 AM	0 responses 48 views 0 likes	Last Post by seqadmin 12-13-2024, 08:24 AM
Study Links Brain Mechanism to Emotional Responses in Animals and Humans by seqadmin Started by seqadmin, 12-12-2024, 07:41 AM	0 responses 34 views 0 likes	Last Post by seqadmin 12-12-2024, 07:41 AM
Study Identifies Ribosomal RNA Fingerprints as Early Cancer Biomarkers by seqadmin Started by seqadmin, 12-11-2024, 07:45 AM	0 responses 46 views 0 likes	Last Post by seqadmin 12-11-2024, 07:45 AM

Seqanswers Leaderboard Ad

Announcement

De novo SNP calling in absence of complete reference assembly

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News