Hi
I am a total newby when it comes to working with Next-Gen data and only have limited bioinformatics skills, so please bear with me...
About the Project:
I have various bacterial strain with increased thermal tolerance, these were generated through several rounds of random mutagenesis (chemical mutagens & UV irradiation) and selective growth. Now the first of these (monoclonal) strains has been sequenced using Iontorrent PGM (314 Chip), with the rest to follow shortly. The run itself had some technical issues and will be repeated, this (and the sequencing of the other strains) is on hold until the issues can be resolved.
I do however have about 6x coverage (20X was expected) which I want to use to generate some preliminary results and to establish the workflow for the data analysis.
What I have tried so far:
For the first approach, reads where mapped to the wt reference with GS mapper, and SNPs then searched with samtools.
After some tests to remove false positives based on homopolymers, I got good results using the following commands:
The output looks good and does not appear to contain false SNPs based on homopolymers. However I am not merely interested in which positions contain mutations, but in determining affected genes. So I wrote a quick perl script to parse the output file and look up which genes are affected based on position of the mutation (and also to classify them based on functionality).
The final list contained some 280 mutations on various genes.
However some relevant data is still needed, such as which AS where exchanged (if any!). I could add this functionality to my script, however I do not wish to reinvent the wheel, especially if good tools with the needed functionality are available.
I asked a friend with some experience with nextgen Sequencing data and he suggested to use MIRA, as it assigns genes to mutations, checks for AS exchange, and has nice options for output, such as a html-file.
I assembled again with mira and generated the output files:
The format of the output is nice (it includes AS exchange for example), but it contains a ton of false positives. Is there an easy way to filter this data with a parameter when generating the output? I didnt find anything in the MIRA documentation...
Or was my first approach better? Maybe I should be using different tools altogether?
Any help or even a nudge in the right direction would be much appreciated.
Cheers,
Uli
I am a total newby when it comes to working with Next-Gen data and only have limited bioinformatics skills, so please bear with me...
About the Project:
I have various bacterial strain with increased thermal tolerance, these were generated through several rounds of random mutagenesis (chemical mutagens & UV irradiation) and selective growth. Now the first of these (monoclonal) strains has been sequenced using Iontorrent PGM (314 Chip), with the rest to follow shortly. The run itself had some technical issues and will be repeated, this (and the sequencing of the other strains) is on hold until the issues can be resolved.
I do however have about 6x coverage (20X was expected) which I want to use to generate some preliminary results and to establish the workflow for the data analysis.
What I have tried so far:
For the first approach, reads where mapped to the wt reference with GS mapper, and SNPs then searched with samtools.
After some tests to remove false positives based on homopolymers, I got good results using the following commands:
Code:
samtools mpileup -d 10000 -L 1000 -Q 7 -h 50 -o 10 -e 17 -m 4 -uf Reference.fna IonTorrentContigs.bam | bcftools view -bvcg - > var-woh.raw.bcf bcftools view var-woh.raw.bcf | vcfutils.pl varFilter -D100 > var-woh.flt.vcf
The final list contained some 280 mutations on various genes.
However some relevant data is still needed, such as which AS where exchanged (if any!). I could add this functionality to my script, however I do not wish to reinvent the wheel, especially if good tools with the needed functionality are available.
I asked a friend with some experience with nextgen Sequencing data and he suggested to use MIRA, as it assigns genes to mutations, checks for AS exchange, and has nice options for output, such as a html-file.
I assembled again with mira and generated the output files:
Code:
mira --project=c5k --job=mapping,genome,accurate,iontor -AS:nop=1 -SB:bsn=DH10B_wt:bft=gbf:bbq=30 IONTOR_SETTINGS -ASSEMBLY:mrpc=100 -SB:ads=yes:dsn=DH10B_mut COMMON_SETTINGS -GENERAL:not=4 |tee log_assembly.txt convert_project -f caf -t asnp c5k_out.caf output convert_project -f caf -t hsnp c5k_out.caf output_html
Or was my first approach better? Maybe I should be using different tools altogether?
Any help or even a nudge in the right direction would be much appreciated.
Cheers,
Uli
Comment