Seqanswers Leaderboard Ad

**koadman** · 11-29-2011, 05:41 PM

Getting a good assembly can be surprisingly challenging, even for bacteria. Downsampling your data is probably a good thing to keep compute requirements manageable but may cause you to lose coverage in regions of extreme GC content, resulting in a more fragmented assembly.

Our group has had good luck with IDBA, and we have cobbled together an assembly pipeline called A5 that you could try. The A5 pipeline is supposed to be able to do bacterial-size assemblies directly from sequencer output by automating the adapter trimming, quality filtering, error correction, contigging, scaffolding, and some assembly QC. See here for more info: http://code.google.com/p/ngopt/

It's still very much beta, and we could really use some feedback on whether it's working outside our lab. In our hands it's generating pretty good assemblies, as measured by comparison to a reference genome. On your data you would almost certainly have to downsample to get it to work with the A5 pipeline, unless you have some big iron to run it on.

**vamosia** · 11-29-2011, 06:32 PM

It seems to me that your depth is too high. Since errors accumulate, the more depth means more chances of introducing errors, hence breaking the assembly. I would suggest try lowing the depth and see if there is any improvements.

**Bgansw** · 11-29-2011, 08:38 PM

Thank you Koadman and vamosia. Yes, I'll reduce the dept and try again.

@Koadman: I've downloaded your A5 pipe line. Thanks. I'll keep you'll updated with the result.

Bgansw

**Bgansw** · 11-30-2011, 04:22 PM

Hi Koadman,

I've installed the A5 pipeline in my bin folder n have tried to install it.

The a5_pipeline.pl is blinking, and its highlighted in red n not working.

How do I rectify this?

**koadman** · 11-30-2011, 04:59 PM

blinking and highlighted in red? I don't know where to begin.

Are you using a mac? linux? is it blinking in your terminal window when you run `ls`? Or in some GUI file manager?

To start, it's not necessary to move any of the files in the ngopt pipeline, it can and should be run in place. If you move everything in a5's bin to a bin directory somewhere else (but still in your $PATH) it probably will run but will not be able to find the adapters fasta file used for adapter trimming. We should probably change this to look for adapters in $prefix/share/a5 to be more unixy.

To be specific, you should only have to do the following to get an assembly (instructions for linux):

Code:

wget http://ngopt.googlecode.com/files/ngopt_a5pipeline_linux-x64_20111128.tar.gz
tar xvzf ngopt_a5pipeline_linux-x64_20111128.tar.gz
ngopt_a5pipeline_linux-x64/bin/a5_pipeline.pl <read 1 fastq> <read 2 fastq> <my_assembly>

where <read 1 fastq> and <read 2 fastq> are the filesystem paths to your paired read fastq files (one file per read in a pair, do not include the <>! it is just a convention to denote a required argument) and my_assembly is the base name or path you want to give your assembly files.

Instructions for mac (which doesn't ship with wget's holy awesomeness):

Code:

curl http://ngopt.googlecode.com/files/ngopt_a5pipeline_macOS-x64_20111128.tar.gz > ngopt_a5pipeline_macOS-x64_20111128.tar.gz
tar xvzf ngopt_a5pipeline_macOS-x64_20111128.tar.gz
ngopt_a5pipeline_macOS-x64/bin/a5_pipeline.pl <read 1 fastq> <read 2 fastq> my_assembly

**Bgansw** · 12-04-2011, 09:07 PM

Hi Koadman,

The A5 pipeline is running on our cluster right now. So far all good!

I'll keep you updated with the results. Hopefully, it'll all run smoothly..

Thanx a ton.

**Bgansw** · 12-05-2011, 07:52 PM

Hi Koadman,

Am getting this error message, with the pipeline. I suspect that its because the new Hiseq data has been processed by casava 1.8, and generates reads in sanger format (also known as illumina 1.9 format). Most of our scripts were designed for reads in illumina 1.3 format (cause we got the 3 genomes sequenced using the GA2 and not the hiseq).

I had to convert the files to illumina 1.3 to get the pipeline to work on them. Have you used the A5_pipeline on Hiseq data before? If yes, do you know what format your hiseq data is in?

This is the error I'm getting. I've only pasted the last bit of the file. I've attached the file, "nohup.out" as well, so you can see what ran and where it failed. I would really really appreciate your help.

nohup.out
[bgansw@bioinfo ngopt_a5pipeline_linux-x64]$ head nohup.out
[a5] Found the following libraries:
raw1:
id=raw1
p1=8RS_CAGATC_L005_R1_illuminaformat_sed.fastq
p2=8RS_CAGATC_L005_R2_illumina_sed.fastq
[a5] Cleaning reads with SGA
148311540
[a5] Found 1 libraries
[a5] Starting pipeline at step 1
[a5] Cleaning reads with SGA
[bgansw@bioinfo ngopt_a5pipeline_linux-x64]$ tail nohup.out
at java.security.SecureClassLoader.defineClass(libgcj.so.7rh)
at java.net.URLClassLoader.findClass(libgcj.so.7rh)
at java.lang.ClassLoader.loadClass(libgcj.so.7rh)
at java.lang.ClassLoader.loadClass(libgcj.so.7rh)
at org.halophiles.assembly.InsertSizeExporter.main(InsertSizeExporter.java:36)
Use of uninitialized value in numeric gt (>) at ./bin/a5_pipeline.pl line 1205, <IN> line 55454.
Use of uninitialized value in concatenation (.) or string at ./bin/a5_pipeline.pl line 1211, <IN> line 55454.
[a5] Discarding estimate. Not enough data points:
Use of uninitialized value in multiplication (*) at ./bin/a5_pipeline.pl line 1212, <IN> line 55454.
[a5_preproc] Insert estimate for raw1 not reliable.[a5_preproc] No insert estimate for raw1. Please provide one in a library file. Exiting
[bgansw@bioinfo ngopt_a5pipeline_linux-x64]$

Attached Files

nohup.out.txt (13.1 KB, 23 views)

**koadman** · 12-06-2011, 02:43 PM

Thanks for sending the nohup.out, really helpful!
Reading through it I can see that your input files seem to be lacking the canonical /1 and /2 paired read identifiers in their names. a5 tries to add them since bwa insists on their presence when generating a paired-end sam file. a bit further down we see:

Warning, FASTQ quality string is not the same length as the sequence string for read @HWI-ST705:113:C026NACXX:5:1208:5169:188022/2

My guess is that a5's perl code to add the /1 and /2 identifiers is broken, and added a /2 to that read's quality line. The problem will probably go away if you are able to add the /1 and /2 identifiers to the reads yourself and rerun. In the meantime will go fix this part of a5.

**swbarnes2** · 12-06-2011, 03:41 PM

Originally posted by koadman View Post

Thanks for sending the nohup.out, really helpful!
Reading through it I can see that your input files seem to be lacking the canonical /1 and /2 paired read identifiers in their names. a5 tries to add them since bwa insists on their presence when generating a paired-end sam file.

Umm, that's not true at all. I run bwa all the time, and my reads don't have read1 or read2 identifiers on them at all. In fact, some programs, like Picard's MarkDupliactes won't realize that the two reads are paired together unless their names are identical.

bwa just assumes that the first read of fastq1 is paried with the first read of fastq2.

**koadman** · 12-06-2011, 09:29 PM

Indeed you are correct! I must have been distracted or written that too hastily. It's not BWA that has that requirement, but rather SGA which shares a little bit of algorithm with BWA. My deepest apologies for the confusion. SGA's preprocess step, at least the version of the code embedded in the a5 pipeline, when run in paired-end mode, insists on having identifiers of the form /1 /2 if any are used. And lo, I just discovered it will happily run without any identifiers. So the little code fragment which mangled your file by trying to add an identifier does appear to be unnecessary. Deleting code is a win in my book, thank you.

I've posted a revised a5_pipeline.pl to the source code repository. You should be able to drop this in place of the existing a5_pipeline.pl and hopefully there won't be any more hiccups on your data.

Error 404 (Not Found)!!1

http://ngopt.googlecode.com/svn/trunk/bin/a5_pipeline.pl

**jvanleuven** · 12-07-2011, 07:25 AM

How do you calculate the 30MB? In every assembly you could have excess contigs that you don't want to use. If the genome sequence is similar to an already published genome, you can use nucmer or promer to pick out big contigs that map to your genome. If they are much different, then I use a combination of binning my GC content, coverage, and tblastx hits.

jt

**Rockx** · 12-07-2011, 09:02 PM

How would you deal with gene transfer in these instances? Horizontally acquired genes would have a different GC content, which would be eliminated using GC and blast binning.

**jvanleuven** · 12-07-2011, 09:34 PM

I do inclusive searches. First take all contigs with similar coverage, then add all contigs with close GC, then add contigs that hit by blast, ect.

Although I am also new at this and still learning, my goal is to not exclude contigs. However, I work with samples that have multiple bacterial genomes and a large amount of non-bacterial sequences.

Does this sound reasonable? As I mentioned, I am still learning.

**koadman** · 12-07-2011, 09:35 PM

well, the largest currently known bacterial genomes are around 10Mbp, and most seem to be under 5Mbp so there's a good chance that 30Mbp is way too large. One simple method to eliminate material not part of your isolate is a coverage-based filter on contigs. For contigs that are sufficiently large, the variance in coverage due to library PCR bias and bridge PCR bias can be averaged out, and sequences that are part of the isolate will show much higher average coverage than contaminant or noisy contigs. This is the approach employed by IDBA and many other tools (the a5 pipeline includes this). I have seen this approach fail to produce the desired isolate genome in the past, and every time it turned out the lab had accidentally cocultured one or more other organisms with the target isolate.

I would be extremely wary of separating on the basis of GC (or higher order k-mer) content since as the previous poster pointed out, lateral transfer can introduce regions with unusual GC and you don't want to lose these!

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 18 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 22 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 17 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 48 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Genome assembly -hiseq illumina data

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News