Seqanswers Leaderboard Ad

**mkeehan** · 05-17-2010, 01:58 PM

8 controls and 8 cases just like an early micro-array experiment.
Consult a statistician to diagnose how much power was in the experiment.
Then report the species of the bacterial genomes in the 3-5% of the reads that aligned to NCBI and that the statistician lets you call significant.

**zbjorn** · 05-17-2010, 06:13 PM

I'm not sure if this would lead toward your goal, but you can BLAST all unmapped reads against the nt database, then use something like MEGAN to taxonomically evaluate the BLAST results. I use this approach for samples with unknown contents. It is computationally taxing to BLAST millions of queries against such a large database but it works. http://www.plosone.org/article/info%...l.pone.0010256 this references the technique.

**KevinLam** · 05-17-2010, 10:55 PM

I am very interested how many more human seq the authors managed to remove by blastn against human and exactly what do they mean by human ambiguous sequences in "a BLAST homology search against human genomic DNA and human ambiguous sequences extracted from the nt database"

But I have yet to receive a reply

MEGAN looks interesting will explore that!

**jiaco** · 05-18-2010, 02:21 AM

Thanks for the replies so far. I have setup a Mosaik run to pass each pool through Human cDNA, then Human Genomic and then the NCBI bacterial genomes to collect and filter the results. But I know that I am gonna end up being left with > 60% of the data in the nowhere bin. I guess I can use quality trimmer and then BLAST nr but deciphering that output sounds like a nightmare at those expect values.

I will definitely need to look at MEGAN (thanks for the info)

The first thing I did was to merge all the full length (72mers) sequences into 1 database which was stored in a hash to have 1 instance per sequence and I got counts for the number of times the same sequence was found in each pool. What I would really like would be a program that could use the quality information and allow mismatches and would generate a rough estimate of frequencies per pool so that I could see if there is a clear separation between pool 1 & 2 versus pool 3 & 4.

If anyone knows of such a tool, or a way to use an alignment program such as Mosaik or something to accomplish this goal, your suggestions would be appreciated.

**zbjorn** · 05-18-2010, 08:09 AM

jiaco,

I'm curious, how many counts did you get of redundant sequences did you get? (They were perfectly redundant?) Theoretically you get very few of these with Illumina sequencing, so the strategy you described of looking for similar reads unassembled I don't think would be an effective one...

MOSAIK is a good choice for what you're trying to do compared to maq.

**jiaco** · 05-18-2010, 10:03 AM

The hashing program would check the hash, if not already seen, it would reverse complement and recheck, if still not seen insert. Out of the 4 times ~3 million reads, the highest count is like only 5 or something. I actually thought that strange, but I am happy to hear that it is normal.

For Mosaik, should I make a reference out of all the pools combined and then align that same sequence set to itself? Would that be a valid way to find "equivalent" reads with sequencing errors? I have read the Mosaik docs (excellent by the way) but do not see any downstream application that I could use to make the next step (after sorting the alignment results). Any suggestions?

**zbjorn** · 05-19-2010, 06:03 PM

Do you have a number of duplicate reads found total, rather than max number of duplicates found for any one read? I'm curious now.

Interesting idea with mapping reads to reads. That wouldn't take too long to run if you want to try it. A few thoughts. You have "long" reads, so I'd additionally look for reads that are identical but shifted a few bases in either direction. You could do this a few ways. One would be to use the full length reads in your artificial chromosomes, then map trimmed reads (shave just a few bases from either end) to those. Another would be to write your own script that searches pairwise through all the reads for longest common substrings with lengths of at least x with an edit distance of x-n. Basically the same thing. Finally, an assembler will do much the same thing. I believe the output format of Velvet is a FASTA file of contigs with a header that lists the number of reads used to make the contig--in essence a measure of how many related sequences you have.

I wouldn't underestimate using BLAST (dc-megablast) with the nr/nt database. That is well annotated; each sequence as a taxon id associated with it that you can use to parse your data. Handling data from GenBank is sometimes a pain though. The tools with BLAST+ are useful (blastdbcmd allows you to make queries that, in conjunction with awk commands, will e.g. extract all sequences for a certain taxon id), though you can do the same things online.

Hope that helps. I'm dealing with a similar issue, except I'll be handling this data on a regular basis. I've thought about this a lot, but don't necessarily have the best answers yet. Please let me know how this goes for you.

One other thought. Reads I'm working with now are from libraries that were poorly prepared. The insert lengths were << than the length of the reads, resulting in sequencing into adapter sequences. This prevents mapping by standard algorithms. You might grep your reads for the first ~12 bases of the Illumina adapter sequence in case they are present, since it sounds like this company used somewhat of a modified/homebrew protocol.

**jiaco** · 05-19-2010, 09:02 PM

Thanks zbjorn, you have given me lots to think about.

I have spent a lot of time on these adapters so far and can summarize that for you now.
The company gave me these sequences:
>5-end adapter
5’- CTCTGGACCTTGGCTGTCACTCAGTT-3’
>3-end adapter
5’- CCTTGGCTGTCACTCACTGCGA-(dT25)-3
dT25 means the adapter is followed by 25 Ts

We first got "raw" data, then a few days later "cleaned" reads. While I had the raw data, I wrote a program to take out reads with adapters with mismatches. My program took them out on either strand and when I got the cleaned data, I re-ran this program on that too and was still masking/removing lots of sequence. I was finding many instances where the 5end adapter was on the reversecomplement strand exactly adjacent to an instance of the 3end adapter on the sense strand. I emailed them and gave them some examples and also asked about the mismatches (as when I searched NCBI bug genomes with these adapter sequences with up to 5 mismatches, I could not find them) why they do not remove instances with mismatches since they seem obvious to me to be adapters with sequencing errors. After a couple of days they replied and sent back a third set of files, claiming that their cleaning program is not very customizable.

This experience has led me to become quite suspect of the company. And with my discovery of this forum, I have begun to become acquainted with the "tools of the trade" and right now am interested in using things like Mosaik, which take into account the quality data, more than my blast/blat pipeline because even this 3rd round of cleaned data still has lots of sequences with big stretches of 2s in the quality file and I have now the impression that what I was doing initially, treating identical sequences as identical (without taking into consideration the quality info) was wrong.

I will read up on Velvet today as that is a program I have not yet seen.

Meanwhile, maybe you could take a look at this question too:

http://i.seqanswers.com/question/26/illumina-read-quality-scores

Thanks again.

**zbjorn** · 05-19-2010, 11:41 PM

Hi there,

I'm a little unclear what you mean about the 5' end adjacent to the 3'. Do you mean you had 5'end-3'end adapter heterodimers? If the libraries were size selected properly, I think these are not supposed to be present. Or there's some modification on the oligo(s) to make them not form. I forget.

Ideally you should have *no* adapter sequences in your reads. If you do, depending on your application you should trim them off. That's all I meant, as a possible reason why your reads aren't assembling or mapping as well as you'd like. It's also an indicator of whether or not your libraries were prepared well.

If I understand you correctly, you're removing entire reads based on presence of mismatches in the adapter sequences you're seeing in your reads? I wouldn't use errors in the adapter sequences as an indicator of the total read quality. The read quality decreases with length, so if there are errors they are more likely to be in the 5' end where you'd see the adapter. You can map your PhiX control (hopefully the company ran one) all the way out to 76 bp instead of just the default 25 to get an idea of how the flow cell performed on the whole, as you get into deeper cycles.

----------
This is in response to the question in the link you sent (moderators, this is a related question anyway so hope it's not an issue posting in this thread). I'm having troubles logging in to that branch of the site, and my reply is more discussion-like than definitive answer-like. (I'll reply to the post above in a bit.)
----------
I have a similar question with Illumina tech support right now (mine was more open-ended--do you guys evaluate reads somehow and filter on that metric). I don't filter reads at all after running the Illumina pipeline (which does some of its own filtering), considering that the most commonly used alignment algorithms (maq being the pioneer) take into consideration the quality scores when mapping and generating the consensus, as you mentioned. This hasn't resulted in any noticeable problems for me, but I can't say whether or not filtering would improve your analysis. For sure, if you're writing your own algorithm that doesn't consider quality scores of each base or each read, then filtering your input file would be wise. It may be the case that reads with an average score of 2 are like TN_(75), which algorithms should not do anything with anyway. (And yes, 40 is the max on the Illumina scale; -5 is the lowest.) I'd say your parameters are reasonable.

I forget the Mosaik workflow... check that you are processing the Illumina quality scores properly in your first step. I usually go Illumina-->FASTQ first. If the quality scores are still Illumina scale, that will result in problems. However, this sounds like another problem.

You might check with this company how they are "cleaning" the reads. Are they using something other than the Illumina pipeline? You can't use the Illumina pipeline in an iterative manner very easily... If they are using something than the Illumina pipeline, I would be questioning. That thing was well engineered. Then again, there are open-source alternatives that are purportedly better at at least base calling, but I'm not familiar with them at all.

-- it's a little late, if I missed anything don't hesitate to restate.

Best,
Zach

Topics	Statistics	Last Post
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Today, 08:47 AM	0 responses 12 views 0 likes	Last Post by seqadmin Today, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 59 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 54 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM

Seqanswers Leaderboard Ad

Announcement

What to do with Illumina 76bp data with no reference?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News