SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Please Help: What is the differences between standard trimming and adaptive trimming byou678 Bioinformatics 8 08-22-2011 01:05 PM
gsMapper problem pcg 454 Pyrosequencing 2 12-06-2010 02:29 AM
gsMapper contigs haonmada 454 Pyrosequencing 1 01-22-2010 12:25 PM
Roche's gsMapper Layla 454 Pyrosequencing 6 09-16-2009 01:22 PM
gsMapper issues mjleaks 454 Pyrosequencing 1 05-12-2009 07:13 AM

Reply
 
Thread Tools
Old 10-07-2011, 08:55 AM   #1
Peitx
Junior Member
 
Location: Spain

Join Date: Jan 2011
Posts: 3
Default GSMapper trimming

Hello everyone,

I've some SNP containing sequences, obtained some months ago from a fish species, and now we've obtain a genome close to our fish. I'm using GSmapper to map them to that genome, but I don't known why the program deletes 6 of our 15 sequences from the analysis. I've specified that I don't want a trimming step so I can't understand why the program is doing this. The documentation didn't help neither.

Is a very silly question, but I can't find the solution. Any help of experienced people?

Thanks in advance!
Peitx is offline   Reply With Quote
Old 10-07-2011, 11:15 PM   #2
ketan_bnf
Member
 
Location: India

Join Date: Oct 2010
Posts: 59
Default

Hi Peitx,

Sequences may be of low quality and/or small in length (<20 bp dufault). It is not necessary all sequences will be used for mapping to genome.

Regards,
ketan_bnf is offline   Reply With Quote
Old 10-09-2011, 08:43 AM   #3
sklages
Senior Member
 
Location: Berlin, DE

Join Date: May 2008
Posts: 628
Default

Have a look at 454ReadStatus.txt

Read Mapping Mapped % of Read Ref Ref Ref
Accno Status Accuracy(%) Mapped Accno Start Stop Strand
G5FF2WU01DTSD6 Full 95 100 chr2 227896723 227896852 -
G5FF2WU01CKAXT Full 97 100 chr10 73453619 73453688 +
G5FF2WU01BP3ZV Full 98 100 chr12 48373154 48373213 +
G5FF2WU01CMIB1 Full 99 100 chr14 76948381 76948530 -
G5FF2WU01ARMHW TooShort
G5FF2WU01EVYYN Repeat
G5FF2WU01EL8WA Repeat
[...]


It should at least answer your question why your reads are not mapped.

cheers,
Sven
sklages is offline   Reply With Quote
Old 10-10-2011, 11:44 AM   #4
Peitx
Junior Member
 
Location: Spain

Join Date: Jan 2011
Posts: 3
Default

Thanks both for the reply

Ketan, I'm sure that that the length is more than 20bp (minimum is 146, like you can see below). I dont have quality scores, but the seems that this is not a problem, because it accept the sequence (and I'm not interested in variant calling)


Accno Trimpoints Used Used Trimmed Length Orig Trimpoints Orig Trimmed Length Raw Length
ADR_F. 1-146 146 1-146 146 146
ATROP_F. 151-151 1 1-341 341 341
CITOCHROME-C_F. 105-105 1 1-280 280 280
CITRATO3. 42-42 1 1-645 645 645
CITRATO5. 43-43 1 1-551 551 551
GNRH3-1_F. 1-197 197 1-197 197 197
HGFL_R. 1-558 558 1-558 558 558
HIF2-3_F. 20-20 1 1-575 575 575
INTFGP_F. 1-307 307 1-307 307 307
INTRAOPCO2_F. 1-306 306 1-306 306 306
L12_F. 1-551 551 1-551 551 551
LACDB_F. 37-37 1 1-368 368 368
LYS2_F. 1-591 591 1-591 591 591
MTF_F. 1-636 636 1-636 636 636
S7-2_F. 1-605 605 1-605 605 605

I've check the position where the trim is executed, and in some cases I've found IUPAC nucleotide (i.e. Y). In another sequences the problem is a N nucleotide. The fact is that in some sequences the reason is one and other the other, so I can't obtain a final razon. I've been finding this issues in the documentation, but without success...

Skiages, this is my file:

Read Mapping Mapped % of Read Ref Ref Ref
Accno Status Accuracy(%) Mapped Accno Start Stop Strand
ADR_F. Unmapped
GNRH3-1_F. Unmapped
HGFL_R. Repeat
INTFGP_F. Unmapped
INTRAOPCO2_F. Unmapped
L12_F. Partial 94 99 clc_genomicrefv1_contig102970 4336 4884 +
LYS2_F. Unmapped
MTF_F. Unmapped
S7-2_F. Full 94 100 clc_genomicrefv1_contig88520 6032 6633 +

Like you can see, most of the reads are unmapped, but my problem is that some reads are trimmed, and without knowing why this is a problem.

I've try to map using only 40 bp up and downstream the SNP (to avoid IUPAC nucleotides and to check for different mapping) and I've find differencies:

DR_F. Unmapped
ATROP_F. Unmapped
CITOCHROME-C_F. Full 93 100 clc_genomicrefv1_contig152775 2837 2917 +
CITRATO3. Unmapped
CITRATO5. Unmapped
GNRH3-1_F. Unmapped
HGFL_R. Unmapped
HIF2-3_F. Unmapped
INTFGP_F. Unmapped
INTRAOPCO2_F. Unmapped
L12_F. Full 99 100 clc_genomicrefv1_contig102970 4717 4797 +
LACDB_F. Unmapped
LYS2_F. Unmapped
MTF_F. Unmapped
S7-2_F. Full 96 100 clc_genomicrefv1_contig88520 6304 6384 +

Now all the sequences are accepted and I obtain another sequence! Do you know what is happening? I known that the sequence are too long in the first case, but I also thought that the mapper will "split" the sequences into smaller parts, using the seed value. I'm wrong? this will definitively clear up some of my doubts...

Thanks for helping in this silly questions, I'm new in this field and I want to learn
Peitx is offline   Reply With Quote
Old 10-10-2011, 11:58 AM   #5
sklages
Senior Member
 
Location: Berlin, DE

Join Date: May 2008
Posts: 628
Default

These are pre-assembled contigs, not reads. gsMapper won't split the large sequences in smaller chunks.
What are you mapping against? Finished (contigous) or draft (multi contigs). Why don't you map your reads directly against your reference genome instead of preassembling and mapping afterwards?

Maybe you should give blast or blat a try (you don't have too many contigs) for mapping/positioning your contigs on your reference.

my 2p,
Sven
sklages is offline   Reply With Quote
Old 10-10-2011, 12:14 PM   #6
Peitx
Junior Member
 
Location: Spain

Join Date: Jan 2011
Posts: 3
Default

Sorry for this misinformation sklages.

What I'm trying to map are sanger sequences, not reads of NGS, to a draft genome (contructed by hiseq sequencing + assembly). So, as I suspected, the the reads are too long to mapping and definitively are not splited. Now my approximation of using the 40 bp up and downstream make more sense.

I'll try also blat, but I've to install it and I've no experience with it. Do you think is worth after doing the 80bp approximation, taking into account that my only objective is to identify if my sequences are in the reference genome?

You can give me your address to send you some cookies for the help? :P
Peitx is offline   Reply With Quote
Old 10-10-2011, 12:23 PM   #7
sklages
Senior Member
 
Location: Berlin, DE

Join Date: May 2008
Posts: 628
Default

OK, sanger-reads on CLC-assembled contigs ... as you don't have any NGS reads, there is no need to use gsMapper. 'Blast'/'Blast+' should do the job for your handful of sequences; have a look at NCBI's software archive. You could also use 'blat' (have alook at UCSC) or even CLC Genomics WB, if you have access to that software (which is commercial).

Do you have a usable N50 size of your genome assembly?

cheers, Sven
sklages is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 07:42 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO