Seqanswers Leaderboard Ad

**GenoMax** · 09-28-2015, 09:26 AM

Can you provide information about size of this sequence and the reads you have?

**ErikFas** · 09-28-2015, 01:17 PM

The read lengths I have are 100 and 125 bp, and the full sequence I'm interested in is ~5kb, but I'm also interested in smaller fragments of that sequence, all still longer than the read lengths.

**GenoMax** · 09-28-2015, 01:47 PM

Have you tried to use that 5kb sequence as a reference for mapping the reads against it? You could also try to blat your reads against that sequence.

I am a bit puzzled by your statement: "a sequence that I believe would not be successfully mapped to the reference". What exactly do you mean by that and why can it not be mapped? That sequence is part of the reference, correct?

**ErikFas** · 09-28-2015, 11:05 PM

Ah, no, I haven't tried that... Good idea!

No, that's why I'm asking, the sequence is not part of the reference (it's an insert that's part of a silencing of a gene of interest), and I would like to see if that sequence is part of the data without having run Tophat all over again after adding the sequence to the reference manually (computing time, etc...)

**Brian Bushnell** · 09-28-2015, 11:37 PM

Since your reference of interest is small... you either can map to it, or do kmer-matching. For example -

bbduk.sh in=reads.fq outm=matching.fq ref=sequence.fa k=25 edist=2

That will give you all the reads that share a 25-mer with the reference, allowing up to 2 edits. You could subsequently assemble those reads to find a consensus. BBDuk is very fast in this kind of situation.

**ErikFas** · 09-28-2015, 11:54 PM

Originally posted by Brian Bushnell View Post

bbduk.sh in=reads.fq outm=matching.fq ref=sequence.fa k=25 edist=2

So, the input here would then be the raw FASTQ reads from the sequencing, and the reference just my sequence of interest? I don't know that much about sizes of k-mers, but are there other altneratives / reasons to use something other than k=25, such as longer ones (approaching the read lengths, for example)?

[EDIT]

I used the following command to run BBDuk:

bbduk.sh in1=sample_1.fastq.gz in2=sample_2.fastq.gz outm1=matching1.fq outm2=matching2.fq ref=sequence.fa k=25 edist=2

... which gave me ~36000 matching k-mers in matching1.fq. I'm not sure what this means... I find 36000 k-mers of length 25 in the reads of my experiment matching k-mers of the same length in the reference sequence? How does this tell me if the sequence exists in my data, or is that something I have to do downstream (you mention assembling the reads to a consensus)?

**GenoMax** · 09-29-2015, 04:14 AM

The reads that matched at least share a 25 bp sequence (with up to 2 edits) with your "reference 5kb". You could take the reads in matching files and try to assemble them.

You could also lengthen the k parameter (and reduce edit distance) to see if you are able to match longer lengths.

**ErikFas** · 09-29-2015, 04:23 AM

Ok, I'll do that then. I have never tried assembling a transcriptome before, so I don't really know how to go about doing it. Is it Trinity that's used, or is there a more simpler solution for my smaller data set?

**GenoMax** · 09-29-2015, 04:43 AM

Is the actual inactivating "insert" 5kb or is that the entire region you are looking at?

Since you have a potential set of reads identified how about just trying to align them (use BBMap) to the 5kb reference. At 25bp original match that is only about a 1/4th of the read so the rest of the read may not align well/map at all.

**ErikFas** · 09-29-2015, 05:49 AM

It's the whole region that we're looking at, at least for this question.

Oh, you can align with BBmap? How do I do that, which functions, etc?

Originally posted by GenoMax View Post

At 25bp original match that is only about a 1/4th of the read so the rest of the read may not align well/map at all.

I'm not sure I understand what you're saying. Are you saying that a 25 bp k-mer is too short to do this, and that I should use a longer one?

Also, can do I this for fastq-data from either RNA-seq or DNA-seq, or is it specific (or requires different parameters) for one of them?

**GenoMax** · 09-29-2015, 06:40 AM

Originally posted by ErikFas View Post

It's the whole region that we're looking at, at least for this question.

So most of the sequence should be from the real reference with a part (size?) coming from the foreign insert?

Oh, you can align with BBmap? How do I do that, which functions, etc?

Absolutely. Something along the lines of (provided samtools is available in PATH and your R1/R2 files match in read order)

Code:

$ bbmap.sh in1=matching1.fq in2=matching2.fq ref=sequence.fa out=alignments.bam ambig=all

I'm not sure I understand what you're saying. Are you saying that a 25 bp k-mer is too short to do this, and that I should use a longer one?

I was thinking that a read got selected because at least 25 bp (from 100/125 total) matched the reference. Rest of the read may or may not map at all so when you try the alignment. In that process you may end up losing some/many of the 36000 reads that were originally selected by BBduk.

**ErikFas** · 09-29-2015, 10:45 PM

When I use your command I do get an alignment, but I don't know exactly what's happened, as I get as many reads as the total amount from the two input files, and it seems odd. There's 36416 k-mers from matching1.fq / matching2.fq each, and I get 73832 aligned reads in the output SAM file. It would make more sense if, like you said, I lose some of the reads in the alignment because the reads above and beyond the k-mer size of 25 might not align.

Disregarding the weird result, how do I actually look in this file for my sequence? I mean, it's not a consensus sequence (as in a single, long sequence), right? Even so, just having something aligned to the sequence I'm interested in should tell me that the sequence is in the data...?

**GenoMax** · 09-30-2015, 04:11 AM

You should visually inspect the alignment (e.g. use IGV and 5kb reference) and check what the alignments look like. It is possible that you do have good alignments. Can you post the alignment summary statistic for BBMap?

**Brian Bushnell** · 09-30-2015, 10:46 AM

Originally posted by ErikFas View Post

When I use your command I do get an alignment, but I don't know exactly what's happened, as I get as many reads as the total amount from the two input files, and it seems odd. There's 36416 k-mers from matching1.fq / matching2.fq each, and I get 73832 aligned reads in the output SAM file. It would make more sense if, like you said, I lose some of the reads in the alignment because the reads above and beyond the k-mer size of 25 might not align.

Are you sure you have 73832 aligned reads? By default, BBMap includes both aligned and unaligned reads in the output. That said, it's also possible that all of the reads aligned. As GenoMax said, the statistics output from BBMap (what it prints to the screen) would be helpful.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 25 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 24 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Searching for a sequence in WGS / RNA-seq data

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News