SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
PCR duplicates sequencingfan Sample Prep / Library Generation 3 12-18-2014 03:28 AM
How to collapse forward & reverse Sanger reads into consensus sequence? Harmakhis Bioinformatics 5 03-15-2013 05:44 AM
Ignore PCR Duplicates in CollectTargetedPcrMetrics fongchun Bioinformatics 5 02-16-2013 03:51 PM
PCR duplicates aquleaf RNA Sequencing 5 10-30-2012 11:40 AM
PCR duplicates questions slny Bioinformatics 8 06-07-2011 04:06 AM

Reply
 
Thread Tools
Old 09-28-2015, 04:37 AM   #1
vang42
Member
 
Location: Denmark

Join Date: Mar 2010
Posts: 10
Default Collapse PCR duplicates to a consensus read

We are working on methods to detect very low frequencies of mutations in cancer patients. By drawing a blood sample and isolating the plasma, we are able to detect very small amounts of circulating tumor-DNA (ctDNA). We want to be able to detect mutations in 1/100 or maybe 1/1000 fractions of the germline DNA, so we are using a targeted approach and sequencing to a depth of more than 10.000. We want to be able to tell mutations from sequencing errors, so we are adding a 6bp long identifyer in the pcr reaction during library preparation. We want to find the pcr duplicates for each unique identifyer (UID) and collapse these to a consensus sequence.

Code:
UID1          ------G--------------------
UID1          ---------------------------
UID1          ---------------------------
UID2           -------G----------C---------
UID2           -------G--------------------
UID2	       -------G------A--------T----
UID2                 ---------------------------
UID2                 ---------------------------
UID3     	               ---------------------------
UID3 	                   ---------------------T-----
UID3	                   ---------------------------
As seen her, we have a REF->G mutation in UID2 but we want to weed out the other mutations found in the same UID. By taking the consensus sequence of all UIDs we will get this picture:

Code:
UID1          ---------------------------
UID2           -------G--------------------
UID2                 ---------------------------
UID3     	               ---------------------------
UID3 	                   ---------------------N-----
We would also like to have the number og reads from which each consensus sequence is made stored either an a tag in the bamfile or reflected in the quality score of the consensus sequence.

Could you guide me to how this could be accomplished, or do you have any suggestions to the best tools, to programme this?

Last edited by vang42; 09-30-2015 at 10:21 AM. Reason: The drawing was malformed
vang42 is offline   Reply With Quote
Old 10-30-2015, 06:37 PM   #2
C9r1y
Junior Member
 
Location: San Francisco, USA

Join Date: Jan 2013
Posts: 8
Question Collapsing PCR dupes to consensus seq to find low frequency mutations

I am interested in finding a solution to the exact same question. Does anyone have ideas?
C9r1y is offline   Reply With Quote
Old 10-30-2015, 10:44 PM   #3
blancha
Senior Member
 
Location: Montreal

Join Date: May 2013
Posts: 367
Default

4^6=4096 UIDs

Either separate the FASTQ file into 4096 FASTQ files, by UID, before aligning, or align together, and then separate the BAM file into 4096 temporary BAM files.

Run samtools mpileup separately on each individual BAM file.
Determine the consensus sequence from the pileup results.

Combine together the 4096 pileup results, being careful to preserve the information about the individual UIDs, while also computing the sums for all the UIDs.

If your starting point is the BAM file, you could write a C++ program, using the OpenMP library for parallelism, and the samtools API for the pileup.

It would then be a one command line. It's probably simpler to just use samtools mpileup as a binary tool, and write some scripts to parse then output, but it sounds more impressive to write a program.
blancha is offline   Reply With Quote
Old 10-31-2015, 02:32 AM   #4
vang42
Member
 
Location: Denmark

Join Date: Mar 2010
Posts: 10
Default

Thank you blancha.

As I see it, you method would collapse the reads in the last three UID3 reads, which is not the goal. I need to group together the same reads, that would be defined as pcr-duplicates. That would be all reads with the same 5' position (or pairs of reads for paired end sequencing) within each UID group.
vang42 is offline   Reply With Quote
Old 10-31-2015, 04:52 AM   #5
blancha
Senior Member
 
Location: Montreal

Join Date: May 2013
Posts: 367
Default

Quote:
As I see it, you method would collapse the reads in the last three UID3 reads, which is not the goal. I need to group together the same reads, that would be defined as pcr-duplicates. That would be all reads with the same 5' position (or pairs of reads for paired end sequencing) within each UID group.
Yes, you are right.
Well, let's just make a small adjustment to the algorithm, then.

After having generated the 4096 BAM files for each UID, and sorted them by coordinates, if necessary, let's cycle through each BAM file, and then treat each group of reads with the same start coordinate individually. First, keep a record of the number of reads with the same start and end position, for your custom tag. Second, count the number of each nucleotide at each position, and pick the most common nucleotide at each position. Use this information to generate a new entry in a new BAM file, with the collapsed reads, with the custom tag for the number of reads. Let's call this optional tag XU.

You haven't mentioned the size of the library or the number of reads. If it's significant, I would write this program in C++, possibly using OpenMP for parallelism. Otherwise, I would write the program in Python.

I'm assuming you are a programmer yourself, or at least, have access to qualified programmers. Otherwise, this is the kind of program that I could write myself, if there really is a demand for this program, and I was provided a sample BAM file. It would also be important to know the number and length of the reads, and also if a Unix cluster is available to process the reads. It's a simple and amusing task for a programmer.

There are a few additional details that are not clear, that a programmer would have to know, and a couple of options in the processing of the data. Is the UID in the sequence read, or was it sequenced separately? Instead of separating the BAM files, the BAM file could also be treated as a whole, and just be sorted first by coordinates, and then by UID, for processing. The UID with the same start and end coordinates could then be processed together. I'm not sure yet which approach is more efficient. One must also verify if any sequencing errors in the UID are possible, which again depends on where the UID was inserted.

Last edited by blancha; 10-31-2015 at 04:58 AM. Reason: Added questions about details.
blancha is offline   Reply With Quote
Old 10-31-2015, 05:29 AM   #6
blancha
Senior Member
 
Location: Montreal

Join Date: May 2013
Posts: 367
Default

I'll make a small adjustment to my algorithm. I'll consider reads as duplicates if they have the same start position, without looking at the end position. I was thinking initially of taking into account different reads aligned at the same position, that were hard-trimmed or soft-trimmed differently. Upon reflection, it isn't really useful to take into account this possibility. If reads have the same UID,and start position, my algorithm would consider them to be duplicates.
blancha is offline   Reply With Quote
Old 10-31-2015, 12:49 PM   #7
vang42
Member
 
Location: Denmark

Join Date: Mar 2010
Posts: 10
Default

I am sure that many would actually use such a program if you wrote and released it to the community.

So far I have used a mixture of my own scripts with the consolidate.py script from this package https://github.com/aryeelab/umi

If you wrote something fast and flexible it would be valuable to many labs, since the very deep sequencing in cancer genomics is becoming more needed. It might be difficult to make a one-solution-for-all, since there are so many experimental procedures that would change the input fastq files.

I would be happy to collaborate and provide examples and bam files for you. The library size is typically up to 10^9 reads
vang42 is offline   Reply With Quote
Old 10-31-2015, 08:21 PM   #8
blancha
Senior Member
 
Location: Montreal

Join Date: May 2013
Posts: 367
Default

Sure.
Just send me a private message through the forum, so we can exchange email addresses.
I'll make the program, and the code publicly available on GitHub when finished, if there is a demand for it.

For the sample files, there are a few options.
If you can give me the location where the files are, I can download them myself.
I also have my own FTP server at my research institute, very occasionally unreliable.
There is also a service called Globus Connect, in Canada, which allows one to access directly and write to files directly on one of Compute Canada's servers. You'd have too open an account with Globus Connect first though and use their software.
I don't currently have a lot of capacity available on my Dropbox account.
blancha is offline   Reply With Quote
Old 10-31-2015, 08:33 PM   #9
blancha
Senior Member
 
Location: Montreal

Join Date: May 2013
Posts: 367
Default

Sent you a private message, with my contact information.
Not sure if it worked though, since the message doesn't appear in the sent messages list.
blancha is offline   Reply With Quote
Old 11-01-2015, 11:48 AM   #10
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

I do not believe you can confidently identify 1/1000 mutations with Illumina reads, even with 10000x coverage. The error rate is too high and too nonrandom. Obviously it would be possible with sufficient coverage and a perfectly random error mode, but Illumina's errors are clearly nonrandom, as you can see from base frequency histograms. This is unfortunate as the most glaring biases would be trivial to correct using the realtime mapping data from spiked-in PhiX, but Illumina makes no effort to do that.
Brian Bushnell is offline   Reply With Quote
Old 11-02-2015, 08:46 AM   #11
C9r1y
Junior Member
 
Location: San Francisco, USA

Join Date: Jan 2013
Posts: 8
Default Follow up for pcr dupes without uid

Thanks vang and blancha for the insightful dialogue. I too would be interested in a streamlined program, if made available.

I also have the problem of working with aligned data that contains many pcr duplicates (as defined by position) but lacks uids. Do either of you have ideas of how to tackle that problem? I am a confident command line user but still developing in terms of programming skills. I realize trying to collapse large library sizes will take some parallelization that I am less familiar with.
C9r1y is offline   Reply With Quote
Old 11-02-2015, 09:08 AM   #12
blancha
Senior Member
 
Location: Montreal

Join Date: May 2013
Posts: 367
Default

If you don't have UIDs, and just want to mark or remove the PCR duplicates, there are many tools available. No programming skills are required.
Picard tools MarkDuplicates is my favorite.

The case described by vang42 is a novel experimental approach to identify PCR artefacts in ultra-deep sequencing. @Brian Bushnell raises an interesting question about the validity of ultra-deep sequencing, but I won't get into that debate. Given the novelty of the approach, no existing software programs exist.

Removing or marking duplicates without ultra-deep sequencing or UIDs is just a run-of-the-mill operation, that can be performed by multiple existing programs, notably Picard tools MarkDuplicates.
blancha is offline   Reply With Quote
Old 11-02-2015, 09:44 AM   #13
C9r1y
Junior Member
 
Location: San Francisco, USA

Join Date: Jan 2013
Posts: 8
Default

Thanks blancha. I am familiar with MarkDuplicates, but in my understanding that program assess the duplicate reads and retains the one with the highest quality score. I am looking for a way of generating a consensus from pcr duplicate reads by collapsing the reads. Such that:


-----------A-------C----
--------------------C----
--------------------C----
---T----------------C----
---------G---------------

Becomes:

---------------------C----

based on a majority vote that is set programmatically (ie the bp that is present in >50% of reads becomes part of the consensus)

Anyone have any ideas??

Last edited by C9r1y; 11-06-2015 at 12:44 PM.
C9r1y is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 10:31 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO