SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Insert size distribution for SV detection BnaT Bioinformatics 1 11-12-2014 07:59 AM
BWA insert size distribution too large etwatson Bioinformatics 1 10-31-2013 12:16 PM
custom ssDNA library with strange size distribution of qPCR products pyridine Sample Prep / Library Generation 4 07-17-2013 03:18 AM
Bowtie2 insert size=0 manore Bioinformatics 2 12-26-2012 01:03 AM
Bimodal insert size distribution Pepe Bioinformatics 1 03-03-2010 04:10 AM

Reply
 
Thread Tools
Old 02-04-2015, 09:04 AM   #1
M4TTN
Member
 
Location: UK

Join Date: Jan 2014
Posts: 75
Default NexteraXT MiSeq Bowtie2 strange insert size distribution

Dear all, we are trying to reproducibly determine the size distribution of our inserts obtained from NexteraXT-processed MiSeq (2x300bp) multiplexed samples.

We have used Bowtie2 for alignment and SeqMonk for visualising the apparent insert lengths.

For unknown reasons we observe aberrant peaks at around 300bp, and another at around 550 bp, the latter of which disappears if we alter the insert length cutoff (-X option in bowtie) to 1000 (as was suggested elsewhere). The default being 500bp.

However, the peak at approx 300 bp persists. Any idea what this is?

In the following image, the left image is mapped with default Bowtie parameters. The right image with -X 1000.

https://www.dropbox.com/s/brwnp2onpf...s1000.png?dl=0

We suspect it to be some sort of aberration during mapping due to the 300bp read length (possibly from the heavily overlapping reads?), but we can't really make sense of it.

Last edited by M4TTN; 02-04-2015 at 09:15 AM.
M4TTN is offline   Reply With Quote
Old 02-04-2015, 05:56 PM   #2
mikep
Member
 
Location: Singapore

Join Date: Feb 2011
Posts: 45
Default

How much QC have you done on the fragmentation prior to sequencing? It could be the kit, and I say that because a flaky nextera kit on Miseq has turned into a living nightmare for us on a genotyping project.
mikep is offline   Reply With Quote
Old 02-05-2015, 12:17 AM   #3
M4TTN
Member
 
Location: UK

Join Date: Jan 2014
Posts: 75
Default

Personally I don't think this has anything explicitly to do with the fragmentation. However good/bad it is, I cannot see how this could create a peak in insert size of almost exactly 300bp.

Nevertheless, to give the details: we are using NexteraXT which only uses the inbuilt library normalisation steps. Starting DNA amount is 1-1.5ng, cut at some point with AMPure and finally with Illumina's NexteraXT library normalisation beads (possibly also just Ampure?) prior to pooling for MiSeq (2x300bp). Multiplexing 16-24 samples.

We see the same peaks in all of the samples we've analysed so far.

Overall genome coverage is OK (averages about 30x, but with some regions of very low coverage, and some very high).

I really think this has something to do with the way Bowtie analyses/reports fragments where the two reads fully overlap (which will be the case for any fragment shorter than 300bp).

Does anyone know how Bowtie deals with such a situation?
i.e. where the start position of Read1 = end position of Read2?

What about where the insert is <300bp long, but each read is actually >300 bp long?

i.e. for inserts that are 290-299 bp long, yet the read read length remains as 300bp in each in the FastQ file due to an inability for Illumina software to remove the 10 bp of putative adapter. Would this enable Bowtie to record the insert lengths as 300 (because the read lengths in the FastQ are 300bp long), even thought the actual insert was slightly shorter than 300bp.

Indeed, any information about how Illumina recognises adapter sequences at the ends and trims them when making the FastQ? Presumably it does trim, but requires a certain length of adapter sequence before determining that it can be trimmed off?

Last edited by M4TTN; 02-05-2015 at 02:05 AM.
M4TTN is offline   Reply With Quote
Old 02-05-2015, 03:47 AM   #4
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

You can get a reference-free insert size distribution histogram with BBMerge, like this:

bbmerge.sh in1=r1.fq in2=r2.fq ihist=ihist.txt loose

That will be based on read overlapping rather than mapping coordinates. It's very fast, and you can cap it at the first 1m pairs with the flag "reads=1000000" to get it to finish in a few seconds, which is plenty for a good histogram.

You can generate insert size histograms based on mapping with BBMap, again with the ihist flag. I have no idea how Bowtie deals with inserts shorter than read length, but both BBMerge and BBMap process them correctly; though with the default parameters and no "local" flag BBMap won't map them when the insert size is under roughly 75% of the read length. This can be fixed by doing adapter trimming prior to mapping, which I highly recommend for any aligner. Since adapter trimming (for fragment libraries) only removes bases on the right end of reads, it strictly increases the accuracy of insert size calculations, when done correctly.

Incidentally, running BBDuk with the "tbo" (trim by overlap) flag will allow the removal of adapters down to 1bp, because that removes adapters based on apparent insert size as calculated by overlap, rather than matching adapter sequences.

Last edited by Brian Bushnell; 02-05-2015 at 03:49 AM.
Brian Bushnell is offline   Reply With Quote
Old 02-05-2015, 04:04 AM   #5
M4TTN
Member
 
Location: UK

Join Date: Jan 2014
Posts: 75
Default

Thanks Brian. Please forgive my Unix ignorance, but how would I go about installing/using BBMerge?
M4TTN is offline   Reply With Quote
Old 02-05-2015, 04:14 AM   #6
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

You just need to unzip and untar it, and then it will work as long as you have Java installed. You can extract it in one command with tar:

tar -zxvf BBMap_34.41.tar.gz

Then just run bbmerge.sh, which will be in the bbmap directory. For example:

/path/to/bbmap/bbmerge.sh in1=r1.fq in2=r2.fq ihist=ihist.txt loose

...where "/path/to/" is the location you extracted BBMap, and your working directory is wherever the reads are located (though you can provide absolute or relative paths to the read files, too).

Last edited by Brian Bushnell; 02-05-2015 at 04:24 AM.
Brian Bushnell is offline   Reply With Quote
Old 02-05-2015, 04:18 AM   #7
M4TTN
Member
 
Location: UK

Join Date: Jan 2014
Posts: 75
Default

But how does one run the bbmerge.sh file?

If I double-click it opens in Xcode.

Edit: OK. Got it!

Last edited by M4TTN; 02-05-2015 at 04:27 AM.
M4TTN is offline   Reply With Quote
Old 02-05-2015, 04:33 AM   #8
M4TTN
Member
 
Location: UK

Join Date: Jan 2014
Posts: 75
Default

OK. I am in the bbmap directory, and I can see the bbmerge.sh file.

But when I enter something similar to your code (but with my fastq files replacing r1.fq and r2.fq, I just get:

-bash: bbmerge.sh: command not found
M4TTN is offline   Reply With Quote
Old 02-05-2015, 04:35 AM   #9
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,054
Default

You may need to add execute permissions to the shell scripts

Code:
$ chmod u+x bbmerge.sh
Code:
$ ./bbmerge.sh in1=r1.fq in2=r2.fq ihist=ihist.txt loose
GenoMax is offline   Reply With Quote
Old 02-05-2015, 04:39 AM   #10
M4TTN
Member
 
Location: UK

Join Date: Jan 2014
Posts: 75
Default

ls -l the bbmap directory, gives:


-rwxr-xr-x 1 mn98 staff 5286 23 Jan 18:15 bbmerge.sh

And when trying to run the code, I still get:

-bash: bbmerge.sh: command not found
M4TTN is offline   Reply With Quote
Old 02-05-2015, 04:46 AM   #11
M4TTN
Member
 
Location: UK

Join Date: Jan 2014
Posts: 75
Default

OK. Adding ./ ahead of bbmerge.sh etc

Now gives me this output:


Exception in thread "main" java.lang.NoClassDefFoundError: 2/jni/
Caused by: java.lang.ClassNotFoundException: 2.jni.
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)

[Unfortunately I have next to no experience with Terminal/Unix...]
M4TTN is offline   Reply With Quote
Old 02-05-2015, 04:49 AM   #12
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,054
Default

Just to make sure. You have not moved any of BBMap files/directories after you uncompressed the tar file? Are you running the command from within the BBMap directory?
GenoMax is offline   Reply With Quote
Old 02-05-2015, 04:53 AM   #13
M4TTN
Member
 
Location: UK

Join Date: Jan 2014
Posts: 75
Default

I did not move the files after download.

In terminal I typed:

cd

Then dragged in the bbmap folder and pressed return.

I can check the content of this directory with ls -l

(That is about the limit of my Unix...)

To attempt to execute the command, I type it in as written by brian, but starting directly at "bbmerge.sh" and dragged fast.q files in where the r1.fq and r2.fq were.

I added the reads=1000000 before the hist command.

To get the second (longer) error message I started command with ./bbmerge.sh
M4TTN is offline   Reply With Quote
Old 02-05-2015, 04:57 AM   #14
M4TTN
Member
 
Location: UK

Join Date: Jan 2014
Posts: 75
Default

OK..something happening now...all 8 cores active...

OUTPUT:


BBMerge version 5.5
Finished reading
Total time: 20.117 seconds.

Pairs: 1000000
Joined: 789709 78.971%
Ambiguous: 66486 6.649%
No Solution: 143805 14.381%
Too Short: 0 0.000%
Avg Insert: 279.9
Standard Deviation: 111.5
Mode: 249

Insert range: 36 - 590
90th percentile: 437
75th percentile: 348
50th percentile: 267
25th percentile: 202
10th percentile: 145
M4TTN is offline   Reply With Quote
Old 02-05-2015, 05:00 AM   #15
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,054
Default

Success!

Looks like 79% of your reads could be merged.
GenoMax is offline   Reply With Quote
Old 02-05-2015, 05:04 AM   #16
M4TTN
Member
 
Location: UK

Join Date: Jan 2014
Posts: 75
Default

Here is the histogram:

https://www.dropbox.com/s/honw7b6rzr...reads.png?dl=0

So the aberrant inserts of approx 300bp are clearly caused by an error in how Bowtie classifies insert length.

I do think this might be for those inserts in the 280-300bp range that fail to have adapters trimmed in the FastQ. What do you think?

If so, it would explain why there is a dip in apparent insert lenghts in the 280-300bp range, an the anomalous peak at 300bp.
M4TTN is offline   Reply With Quote
Old 02-05-2015, 05:09 AM   #17
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,054
Default

Since you now have BBMap installed you can easily check with BBDuk to see how many reads still have adapter contamination. I assume you have not done any trimming on these reads. Specify appropriate adapter file when you trim. Standard illumina adapter files are in "/path_to/bbmap-xx.xx/bbmap/resources/".

Last edited by GenoMax; 02-05-2015 at 05:11 AM.
GenoMax is offline   Reply With Quote
Old 02-05-2015, 05:19 AM   #18
M4TTN
Member
 
Location: UK

Join Date: Jan 2014
Posts: 75
Default

Trimming was performed automatically by Illumina prior to FASTQ download, but I suspect there will be short tails left on a subset of fragments.
M4TTN is offline   Reply With Quote
Old 02-05-2015, 05:34 AM   #19
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,054
Default

No harm in trying a pass through BBDuk to verify (specially if you want to get rid of those tails).
GenoMax is offline   Reply With Quote
Old 02-05-2015, 06:07 AM   #20
M4TTN
Member
 
Location: UK

Join Date: Jan 2014
Posts: 75
Default

Initial:
Memory: free=1039m, used=21m

Added 16767 kmers; time: 0.225 seconds.
bbduk output:

Memory: free=1032m, used=28m

Input is being processed as paired
Started output streams: 0.010 seconds.
Processing time: 14.997 seconds.

Input: 2546118 reads 645049609 bases.
KTrimmed: 2543559 reads (99.90%) 633504163 bases (98.21%)
Result: 467188 reads (18.35%) 11545446 bases (1.79%)

Time: 15.238 seconds.
Reads Processed: 2546k 167.09k reads/sec
Bases Processed: 645m 42.33m bases/sec
M4TTN is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 01:39 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO