SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Insert size distribution for SV detection BnaT Bioinformatics 1 11-12-2014 08:59 AM
BWA insert size distribution too large etwatson Bioinformatics 1 10-31-2013 01:16 PM
custom ssDNA library with strange size distribution of qPCR products pyridine Sample Prep / Library Generation 4 07-17-2013 04:18 AM
Bowtie2 insert size=0 manore Bioinformatics 2 12-26-2012 02:03 AM
Bimodal insert size distribution Pepe Bioinformatics 1 03-03-2010 05:10 AM

Reply
 
Thread Tools
Old 03-27-2015, 05:54 AM   #41
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,083
Default

Do this to avoid the excel part

Code:
$ samtools view -f 0x4 you_file.sam | wc -l
That should just spit a number out.

If R1/R2 reads are overlapping then you should consider using BBmerge.sh (or FLASH) to get a collapsed representation of the fragment before doing the mapping.

Total number of reads in your original/trimmed file can be easily found by FastQC (detailed stats should also be in the output of the bbduk.sh log).
GenoMax is offline   Reply With Quote
Old 03-27-2015, 06:01 AM   #42
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,083
Default

Quote:
Originally Posted by M4TTN View Post
I just used SAMTools and it spat out a huge file. Copied it into Excel (for lack of a better method!) and counted the rows: 103,163
But if I open the file in Qualimap, it states: 923,010 (with 103,136 stated as unmapped).
This is for one sample? That does not look too bad then. You have about 90% reads mapping with 10% unmapped.
GenoMax is offline   Reply With Quote
Old 03-27-2015, 07:35 AM   #43
M4TTN
Member
 
Location: UK

Join Date: Jan 2014
Posts: 75
Default

Here are the stats from Qualimap:

https://www.dropbox.com/s/s6aqmqheme...mmary.png?dl=0

If I am reading this correctly, the total mapped reads includes those without a matched pair (listed as singletons = 24,393).

What surprises me somewhat is why 10% are not mapping at all (neither the Read1 nor Read2 mapping). Would trimming the file based on quality help? Or would slicing into smaller reads help?

I am also really surprised by the mapped duplication rate of 24.57%. This literally means that a quarter of our clusters were made of identical molecules? I am amazed given that this is whole genome sequencing of Nextera fragmented DNA.

Or is it due to the complete overlap of 25% of our reads? This seems more likely, so I hope it is that!
M4TTN is offline   Reply With Quote
Old 03-27-2015, 07:43 AM   #44
M4TTN
Member
 
Location: UK

Join Date: Jan 2014
Posts: 75
Default

I just had a thought: Presumably, if I arbitrarily trim 1-2 bases off EVERY read 3' end, there will be no pairs that fully overlap, and thus the TLEN column in the SAM file should no longer be broken/misinterpreted by Qualimap.

Would that work do you think?
M4TTN is offline   Reply With Quote
Old 03-27-2015, 09:14 AM   #45
M4TTN
Member
 
Location: UK

Join Date: Jan 2014
Posts: 75
Default

Hmmm.. a quick manual BLAST alignment of half a dozen of the unmapped sequences spat out by SamTools reveals that they map to the yeast 2 micron plasmid!

If this is representative sample, it suggests that we have about 200 copies of the 2 micron plasmid contaminating our yeast genomic DNA preps!

Going forward, we're going to try to add the 2 micron sequence to our reference file for mapping, so that we can at least monitor it correctly. But ideally, it would be good to rid our gDNA preps of the plasmid as much as possible, so I'll look into some protocols to see if it is possible to selectively partition the gDNA and 2 micron.

Finally: I repeated the alignment using BWA (versus Bowtie2). The number of aligned reads was much lower, however, it was very clear that the odd peaks around 250 and 350 bp and around 500 bp are not present in a BWA aligned file. I assume this is due to the reduced stringency of alignment by Bowtie2 allowing alignment of reads with adapters/mismatches in them.

Bowtie2 is at the top, BWA at the bottom, both are aligned using UGene from bbduk trimmed fastq files.

https://www.dropbox.com/s/0sh3rehjv8...ottom.png?dl=0

What I still don't really understand is why bbduk is not successfully removing any trace of the adapters from the 3' ends of the reads. Thoughts?

Last edited by M4TTN; 03-27-2015 at 09:16 AM.
M4TTN is offline   Reply With Quote
Old 03-27-2015, 09:26 AM   #46
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,083
Default

Quote:
Originally Posted by M4TTN View Post
But ideally, it would be good to rid our gDNA preps of the plasmid as much as possible, so I'll look into some protocols to see if it is possible to selectively partition the gDNA and 2 micron.
This should be very simple to do with informatically with BBSplit.

BTW: Does UGENE have latest versions of all software (bwa/bowtie2)? Are you able to tweak parameters (I have not used UGENE)?

It is normal to have a fraction of your read not map at all. You do have a high percentage of duplicates. Are they PCR-type duplicates (identical start and end when mapped)?

Last edited by GenoMax; 03-27-2015 at 09:54 AM.
GenoMax is offline   Reply With Quote
Old 03-27-2015, 10:30 AM   #47
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

The tlen field is determined by the aligner, so different aligners will give you different numbers there. BBMap does produce correct insert size histograms for overlapping reads, without having to process the sam file, so I'd be interested to see what its histogram looks like. You can generate it like this:

bbmap.sh ref=reference.fa in=reads.fq ihist=ihist.txt
Brian Bushnell is offline   Reply With Quote
Old 03-31-2015, 05:02 AM   #48
M4TTN
Member
 
Location: UK

Join Date: Jan 2014
Posts: 75
Default

Quote:
Originally Posted by GenoMax View Post
This should be very simple to do with informatically with BBSplit.

BTW: Does UGENE have latest versions of all software (bwa/bowtie2)? Are you able to tweak parameters (I have not used UGENE)?

It is normal to have a fraction of your read not map at all. You do have a high percentage of duplicates. Are they PCR-type duplicates (identical start and end when mapped)?
We are making progress: At least in some of the samples, the 10% unmapped reads are coming from the 2 micron plasmid (approx 200x copies per cell it seems), and also from mitochondrial genome, which although 10x larger than the 2 micron plasmid, is present at only about 1-2 copies per cell so is not a big deal.)

Interestingly we have some other samples (with 25% unmapped initially), that still have 15-20% unmapped after including 2 micron and mtDNA in the reference file.

These were slow growing clones, so I am assuming we have some seriously contaminated cultures - perhaps bacteria or some other yeast species was competing more efficiently that normal. About to do some BLAST with the unmapped reads.

Regarding the duplicate rate: I really think this is due to so many pairs of reads being perfect reads due to the insert size being 30 bp or shorter: thus read 1 and read 2 perfectly overlap.

In more recent samples, we used less Ampure beads to purify, which should size select large inserts. Indeed, in these samples, the duplicate rate is only 5-10%. Disregarding the potential for PCR duplicates, this suggests that only 5-10% of molecules were 300 bp or shorter.

However, the best way I guess would be to do a single end alignment. The prediction being that our duplication rate would pretty much disappear (other than PCR duplicates. If that makes sense I can try that.
M4TTN is offline   Reply With Quote
Old 03-31-2015, 05:03 AM   #49
M4TTN
Member
 
Location: UK

Join Date: Jan 2014
Posts: 75
Default

Quote:
Originally Posted by Brian Bushnell View Post
The tlen field is determined by the aligner, so different aligners will give you different numbers there. BBMap does produce correct insert size histograms for overlapping reads, without having to process the sam file, so I'd be interested to see what its histogram looks like. You can generate it like this:

bbmap.sh ref=reference.fa in=reads.fq ihist=ihist.txt
bbmap is taking quite a long time... (8 core mac mini).

In the absence of a growing output file, is there any way to tell how far it may have got? It's been running for at least 15 mins now at 800% CPU usage. Each fasq has about 460,000 reads.

Edit: After all that, it appears to have failed:
Code:
BBMap version 34.72
Set insert size histogram output to /Desktop/bbmap_analysis/bbmap_aligner/ihist.txt
Retaining first best site only for ambiguous mappings.
No output file.
NOTE:	Ignoring reference file because it already appears to have been processed.
NOTE:	If you wish to regenerate the index, please manually delete ref/genome/1/summary.txt
Set genome to 1

Loaded Reference:	0.179 seconds.
Loading index for chunk 1-1, build 1
Generated Index:	1.947 seconds.
Analyzed Index:   	3.244 seconds.
Cleared Memory:    	0.218 seconds.
Processing reads in paired-ended mode.
Started read stream.
Started 8 mapping threads.
Exception in thread "Thread-14" java.lang.AssertionError: 
There appear to be different numbers of reads in the paired input files.
The pairing may have been corrupted by an upstream process.  It may be fixable by running repair.sh.
	at stream.ConcurrentGenericReadInputStream.pair(ConcurrentGenericReadInputStream.java:492)
	at stream.ConcurrentGenericReadInputStream.readLists(ConcurrentGenericReadInputStream.java:358)
	at stream.ConcurrentGenericReadInputStream.run(ConcurrentGenericReadInputStream.java:195)
	at java.lang.Thread.run(Thread.java:695)
Indeed, by opening in FastQC: read1 461,558 reads has and read2 has 461,540 reads.

Or did something go wrong do you think?

Edit: actually, I don't think the Terminal has finished yet. I don't have a -bash prompt. (But CPU has dropped to minimal usage.)

Last edited by M4TTN; 03-31-2015 at 05:50 AM.
M4TTN is offline   Reply With Quote
Old 03-31-2015, 12:04 PM   #50
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Looks like your input files are corrupt - the reads are no longer correctly paired. Go ahead and kill BBMap; it won't terminate (I'll try to add a way for it to exit gracefully in that situation).

You need to either start over with the original input data and trim making sure you always use both files as input at the same time when trimming (I apologize if I did not make that clear - with paired input files, running bbduk, always use in1= and in2= rather than doing them independently). Or you can fix them with repair.sh, but re-trimming from the original files would be better because that allows you to use the "tbo" flag which won't work otherwise.
Brian Bushnell is offline   Reply With Quote
Old 03-31-2015, 01:21 PM   #51
M4TTN
Member
 
Location: UK

Join Date: Jan 2014
Posts: 75
Default

Quote:
Originally Posted by Brian Bushnell View Post
Looks like your input files are corrupt - the reads are no longer correctly paired. Go ahead and kill BBMap; it won't terminate (I'll try to add a way for it to exit gracefully in that situation).

You need to either start over with the original input data and trim making sure you always use both files as input at the same time when trimming (I apologize if I did not make that clear - with paired input files, running bbduk, always use in1= and in2= rather than doing them independently). Or you can fix them with repair.sh, but re-trimming from the original files would be better because that allows you to use the "tbo" flag which won't work otherwise.
Thanks Brian. I need to check when back in work tomorrow, but I don't think I used fasta files that had been processed with bbduk first for this bbmap attempt. I think they were the raw files supplied by Illumina from basespace.

Is it normal that Bowtie2 accepts fast files of that are not "paired" - it didn't throw any errors when using the same fast files as input.

We are also experiencing some oddities with our files that we ran as 350+250bp sequencing run. On manual inspection it appears that a significant number of unmapped reads contain adapter sequences embedded in the middle of them - even though the "read" is either 350 or 250 bp long. I don't quite understand how the MiSeq can have sequenced so far past the adapter (over 150 bp beyond in some molecules!). Perhaps it starts calling bases from the adjacent clusters?

For these aberrant reads, only the sequence to the "left" of the adapter maps to teh reference genome.

Since we haven't noticed this problem before and we don't normally process the fasta files in any way prior to Bowtie2 alignment, I can only come to the conclusion that the automated Illumina trimming algorithm fails (for unknown reasons) when the reads are not identical in length.

Thoughts?

Last edited by M4TTN; 03-31-2015 at 02:04 PM.
M4TTN is offline   Reply With Quote
Old 03-31-2015, 01:47 PM   #52
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Unlike, say, the PacBio platform, Illumina platforms do not know or care how long the molecule is that they are sequencing. They identify the locations of clusters in the first X cycles (~20, I think?) and then call the base at that cluster location according to the dominant light color at each cycle. If sequencing goes of the end of a molecule it will keep calling each from stray light noise, which as you say is probably from nearby clusters. The quality is likely to be low, though, so quality-trimming may sometimes get rid of these bases. And yes, with short inserts from normal fragment libraries, the sequence to the left of the adapter is genomic, and to the right is just noise.

For determining whether your files are correctly paired according to Illumina naming conventions, you can run "reformat.sh in=r1.fq in2=r2.fq vpair" which will examine the names. Not sure about Bowtie2 - some programs allow mismatched read files, and some don't; I try to do a lot of checking and make my programs crash if the input is corrupt so that people will be forced to stop and fix the input, rather than end up with incorrect results later.

Unfortunately, I don't know anything about the built-in Illumina trimming routines; we always keep them disabled, as that's just another variable that we don't have control over which could cause problems. But if you are sure the left and right reads came from the same lane and run (which is another thing to check), then you should fix them with repair.sh, then adapter trim them (as pairs), then try mapping again:

repair.sh in1=r1.fq in2=r2.fq out1=fixed1.fq out2=fixed2.fq outs=singletons.fq

Also, it might help if you could post the first few headers from each file.
Brian Bushnell is offline   Reply With Quote
Old 03-31-2015, 02:11 PM   #53
M4TTN
Member
 
Location: UK

Join Date: Jan 2014
Posts: 75
Default

The files are correctly paired since the resulting sam files we generate clearly have pairs of reads in sequential lines that map to the same genomic location (and most of the time overlap significantly with each other - sometimes completely. i.e. where the insert is shorter than the read length).

I'm not sure why the the fastq files are apparently of different length though. I mean: why they contain a different number of reads.

Good point about the way illumina works. I had assumed that once the end of the molecule was reached, the lack of light emission for any base was interpreted as the end of the molecule, but I guess it is simpler to just keep all the data (no matter how noisy it has become) and then trim it back down to the adapter location.

Regarding trimming: I wasn't aware there was a method to disable auto-trimming of the FASTQ files. We'll have to look into this in the future now that we know there may be some problems.

We've only just started doing NGS (this is only our 4th MiSeq run), so there is a lot to learn, and a lot ways to improve the efficiency of our analysis pipeline. Our main stumbling block is that we are a wet lab with no dedicated bioinformatics support, so we are needing to learn a lot.

Thanks for your help!

Edit: Question: if there is are one or more sequencing errors in the adapter sequence, will bbduk still recognise and trim them?

Last edited by M4TTN; 03-31-2015 at 02:21 PM.
M4TTN is offline   Reply With Quote
Old 03-31-2015, 03:03 PM   #54
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,083
Default

Quote:
Originally Posted by M4TTN View Post
Regarding trimming: I wasn't aware there was a method to disable auto-trimming of the FASTQ files.
Look into doing a Fastq Only/GenerateFASTQ run.

Quote:
Edit: Question: if there is are one or more sequencing errors in the adapter sequence, will bbduk still recognise and trim them?
Yes. That is where the hdist option comes in. Add hdist=1 (hamming distance) to one mismatch.
GenoMax is offline   Reply With Quote
Old 03-31-2015, 03:09 PM   #55
M4TTN
Member
 
Location: UK

Join Date: Jan 2014
Posts: 75
Default

As far as I know we do just run a FASTQ only on the MiSeq. But my understanding was that the output sequences are already pretrimmed.

Thanks for the bbduk options explanation - I am slowly getting there. I see that Brian also gave me this info earlier.
M4TTN is offline   Reply With Quote
Old 04-01-2015, 03:56 AM   #56
M4TTN
Member
 
Location: UK

Join Date: Jan 2014
Posts: 75
Default

It turns out that on the last run (350+250), adapter trimming had not been enabled in MISeq. We are requeuing the analysis with Illumina so at least the output fastas will have been processed the same as all our others.

This is presumably why our mapping dropped to 75%.
M4TTN is offline   Reply With Quote
Old 04-01-2015, 04:28 AM   #57
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,083
Default

Is there a reason you are doing odd cycle runs (350 + 250)? I am guessing that the 50 extra cycles on read 1 are resulting in crappy qualities towards the end of that read with increasing chances that are you reading into adapter (and past) at that point.
GenoMax is offline   Reply With Quote
Old 04-01-2015, 06:43 AM   #58
M4TTN
Member
 
Location: UK

Join Date: Jan 2014
Posts: 75
Default

Quote:
Originally Posted by GenoMax View Post
Is there a reason you are doing odd cycle runs (350 + 250)? I am guessing that the 50 extra cycles on read 1 are resulting in crappy qualities towards the end of that read with increasing chances that are you reading into adapter (and past) at that point.
Actually I am not sure that quality drops off much at all. It appears to in the MiSeq reporter software, but this is just because so many inserts are shorter than the read length - and as Brian pointed out: once the adpater has been sequenced, quality tanks rapidly.

The motivation was actually based on the fact that the average qaulity of read 1 was greater than read2, and thus that t might be better to read more bases in read 1 than read 2.

It was also based on the (probable, mis)assumption that by having differing read lengths it might prevent misinterpretation of the insert length (TLEN), since we'll have fewer paired reads than fully overlap. In actual fact, on thinking about this more, this probably won't make much difference since there are still lots of inserts that are shorter than 250 bp, then these will fully overlap causing the TLEN column to report "odd" insert lengths (negative etc)
M4TTN is offline   Reply With Quote
Old 04-01-2015, 06:55 AM   #59
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,083
Default

This thread has become so long that I don't remember if you ever showed us what the Q-scores look like (FastQC report) for this data. Of course this would not be useful if you are pre-trimming your data on MiSeq since the result file would have lost the bad Q-score data.
GenoMax is offline   Reply With Quote
Old 04-01-2015, 08:30 AM   #60
M4TTN
Member
 
Location: UK

Join Date: Jan 2014
Posts: 75
Default

Hi Genomax. Actually we are really quite perplexed, and are starting to wonder if our last run was particularly poor quality (for as yet unknown reasons).

On the first run (3 months ago; 2x300), the Median Q-score as reported by MiSeq reporter stayed high until about cycle 250. In read 2, the overall quality tailed off sooner. The quality on the bottom surface of the flowcell was marginally worse than the top surface, but it is pretty similar.

Now, on our most recent run (350+250), the Median Q-score as reported by MiSeq reporter is much worse, dropping off at around cycle 200 in read1, and there is a rather dramatic difference between top and bottom surfaces.

Specifically: On the top surface, the quality starts to drop off after about cycle 220 on read1 and cycle 500 (i.e. 150bp)on read 2.

On the bottom surface, the quality starts to drop off after cycle 150 in read 1 and cycle 450 (i.e. 100bp) in read2.

One thing I don't understand is how the Q-score is calculated. Without really thinking about it, I assumed the main reason the quality dropped off was due to sequencing into adpaters and off the end of molecules. Indeed I am pretty sure this is why the % ATGC content becomes skewed with increasing length of sequence (and which is not seen int he most recent run in which we size-selected for larger inserts).

However, if I open up trimmed FASTQ files in FastQC, it is clear that the Q-scores are almost the same as those reported for the untrimmed MiSeq reporter (the MiSeq reporter gives this info live and thus is not doing any trimming yet at this point as far as I am aware). Yet the % ATGC content is dramatically improved (now flat across the entire read length rather than becoming skewed).

Also if I open up trimmed (bbduk) versus untrimmed Fastq files in FastQC for the most recent run, there is only a very subtle change in the Qscore distribution. I do not have the untrimmed FastQ files for the first run since these were automatically trimmed by Illumina.

SO what is going on? Did we just have a bad run? Is the machine damaged?

Cluster density in the first (good) run was 33 million with 30 M passing filter (above spec). Then new run was 18.4 M with 14.5 million passing. Is that too low?

Was it because we size-selected for larger inserts by using less Ampure beads? And clustering was both inefficient and resulted in low quality clusters?

The overal % Q>=30 as reported by Illumina was 80.3% for run 1, and 67.3 for recent run.
I attach some images here from FastQC and MiSeq reporter for the 1st and recent run.

Even the two index reads were a lot worse: 94.6% and 97.8% in run 1, but 83.5% and 83.4% in the recent run.

I guess these may all be things I need to discuss with Illumina tech support.

https://www.dropbox.com/s/8d67vtnz0a...rison.pdf?dl=0

https://www.dropbox.com/s/b1txn9a8f1...orter.pdf?dl=0

Last edited by M4TTN; 04-01-2015 at 08:44 AM.
M4TTN is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 02:04 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO