Hi, long time reader, first time poster. Our lab has been banging our collective heads against some bizarre results for the past several weeks, and we're hoping that the good folks here will be interested in helping us figure out what's going on here.
Some background: we have four cell lines, prepped for sequencing using the SOLiD Whole Transcriptome Analysis Kit (SWTAK), with separate barcodes for each library prepared including the libraries made using Adapter Mix A and Mix B. Results were mapped using Bowtie and are being analyzed using SAMTools and IGV.
Sequencing was done by a facility using SOLiD for the first time, and the samples were prepped by someone using that kit for the first time, so there's a chance that errors could have been introduced at just about any point in the workflow. We're trying to figure out exactly what happened.
I've attached a picture that encapsulates more or less all the problems we've discovered with the data. Specifically:
1. In almost all areas, reads from the A and B libraries of the same cell line map to the same strand, either + or -. In areas of low coverage, there's occasional mixing of strand assignment, but in high coverage areas (>200-fold, often thousands of folds), it's always the same strand. Shouldn't they map to opposite strands?
2. We have evidence that some kind of mis-barcoding might have happened, in that we observe SNPs occurring in, for example, the B library of one cell line, and again in the B library of another cell line, but not in their A libraries. It would be straightforward if it was the same cell lines, but it's not - sometimes 654RMB will match with JHPB, and other times 701054A will match JHPA. We've observed this in a number of locations, always with high coverage.
3. Though we have eight libraries, areas of high coverage only show two patterns of coverage. The 654RM and 701054 cell lines appear similar for both their A and B libraries, and the FD123 and JHP libraries do the same with each other. Similarly, we find SNPs common to both the A and B libraries of both 654RM and 701054 cell lines quite often, and similarly with the FD123 and JHP libraries.
As you might guess, this has been driving us crazy. The possibilities for the errors are endless, but also self-contradictory. If the samples had been mis-barcoded, why do the SNPs mismatch in some places but not others? If the samples were mixed or otherwise contaminated, why do they appear to have such common coverage patterns?
I found a conversation in this thread (link) pointing out that the B mix can introduce errors in the first hexamer sequenced, so I'm currently thinking that a valid way of detecting whether an A or B mix had been used would be to do a frequency analysis of errors in the reads. Does this sound like a reasonable place to start?
Thanks for you time and consideration!
Bob
Some background: we have four cell lines, prepped for sequencing using the SOLiD Whole Transcriptome Analysis Kit (SWTAK), with separate barcodes for each library prepared including the libraries made using Adapter Mix A and Mix B. Results were mapped using Bowtie and are being analyzed using SAMTools and IGV.
Sequencing was done by a facility using SOLiD for the first time, and the samples were prepped by someone using that kit for the first time, so there's a chance that errors could have been introduced at just about any point in the workflow. We're trying to figure out exactly what happened.
I've attached a picture that encapsulates more or less all the problems we've discovered with the data. Specifically:
1. In almost all areas, reads from the A and B libraries of the same cell line map to the same strand, either + or -. In areas of low coverage, there's occasional mixing of strand assignment, but in high coverage areas (>200-fold, often thousands of folds), it's always the same strand. Shouldn't they map to opposite strands?
2. We have evidence that some kind of mis-barcoding might have happened, in that we observe SNPs occurring in, for example, the B library of one cell line, and again in the B library of another cell line, but not in their A libraries. It would be straightforward if it was the same cell lines, but it's not - sometimes 654RMB will match with JHPB, and other times 701054A will match JHPA. We've observed this in a number of locations, always with high coverage.
3. Though we have eight libraries, areas of high coverage only show two patterns of coverage. The 654RM and 701054 cell lines appear similar for both their A and B libraries, and the FD123 and JHP libraries do the same with each other. Similarly, we find SNPs common to both the A and B libraries of both 654RM and 701054 cell lines quite often, and similarly with the FD123 and JHP libraries.
As you might guess, this has been driving us crazy. The possibilities for the errors are endless, but also self-contradictory. If the samples had been mis-barcoded, why do the SNPs mismatch in some places but not others? If the samples were mixed or otherwise contaminated, why do they appear to have such common coverage patterns?
I found a conversation in this thread (link) pointing out that the B mix can introduce errors in the first hexamer sequenced, so I'm currently thinking that a valid way of detecting whether an A or B mix had been used would be to do a frequency analysis of errors in the reads. Does this sound like a reasonable place to start?
Thanks for you time and consideration!
Bob
Comment