SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Supplementing WGS variant calls with RNA-Seq chrismit Bioinformatics 2 05-15-2013 12:01 PM
Searching for available CML RNA seq data! Karenj Bioinformatics 0 01-18-2013 02:10 AM
Sorting fastq by primers, then searching by sequence (with mismatches) jme Bioinformatics 0 01-18-2012 09:25 AM
RNA-Seq: A Powerful and Flexible Approach to the Analysis of RNA Sequence Count Data. Newsbot! Literature Watch 0 08-04-2011 02:00 AM
PubMed: SeqWare Query Engine: storing and searching sequence data in the cloud. Newsbot! Literature Watch 0 05-28-2011 11:13 AM

Reply
 
Thread Tools
Old 09-28-2015, 09:16 AM   #1
ErikFas
Member
 
Location: Sweden

Join Date: Jun 2014
Posts: 86
Default Searching for a sequence in WGS / RNA-seq data

I am interested in searching for a specific sequence in both my RNA-seq and WGS data, and the sequence is quite a bit above the read lengths for either experiment. I have access to all BAM files, some VCF files for WGS, raw fastq files, and everything else you can imagine coming from the sequencing. I want to see if a sequence is present in the data, and if it is, if it's present in the aligned or unaligned BAM files.

The background to my question would be that the sequence in question is a sequence that I believe would not be successfully mapped to the reference, but might still exist in the data/reads. I am unsure of how to go about this, or if it's even something that can be done.

My initial idea was to create some kind of consensus sequence from the RNA-seq BAM-files (both unaligned and aligned), and simply search the resulting sequencing against my sequence of interest. This, however, has proven to be hard, as there seems to be numerous ways of doing it according to Google, and none being the best (the "best" of which involving vcftools, which I for the life of me I cannot get to install on my Mac; no make files, although the documentation says there should be!)

In essence, I just want to find my sequence in my data. How do I do this?
ErikFas is offline   Reply With Quote
Old 09-28-2015, 09:26 AM   #2
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,982
Default

Can you provide information about size of this sequence and the reads you have?
GenoMax is offline   Reply With Quote
Old 09-28-2015, 01:17 PM   #3
ErikFas
Member
 
Location: Sweden

Join Date: Jun 2014
Posts: 86
Default

The read lengths I have are 100 and 125 bp, and the full sequence I'm interested in is ~5kb, but I'm also interested in smaller fragments of that sequence, all still longer than the read lengths.
ErikFas is offline   Reply With Quote
Old 09-28-2015, 01:47 PM   #4
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,982
Default

Have you tried to use that 5kb sequence as a reference for mapping the reads against it? You could also try to blat your reads against that sequence.

I am a bit puzzled by your statement: "a sequence that I believe would not be successfully mapped to the reference". What exactly do you mean by that and why can it not be mapped? That sequence is part of the reference, correct?
GenoMax is offline   Reply With Quote
Old 09-28-2015, 11:05 PM   #5
ErikFas
Member
 
Location: Sweden

Join Date: Jun 2014
Posts: 86
Default

Ah, no, I haven't tried that... Good idea!

No, that's why I'm asking, the sequence is not part of the reference (it's an insert that's part of a silencing of a gene of interest), and I would like to see if that sequence is part of the data without having run Tophat all over again after adding the sequence to the reference manually (computing time, etc...)
ErikFas is offline   Reply With Quote
Old 09-28-2015, 11:37 PM   #6
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Since your reference of interest is small... you either can map to it, or do kmer-matching. For example -

bbduk.sh in=reads.fq outm=matching.fq ref=sequence.fa k=25 edist=2

That will give you all the reads that share a 25-mer with the reference, allowing up to 2 edits. You could subsequently assemble those reads to find a consensus. BBDuk is very fast in this kind of situation.
Brian Bushnell is offline   Reply With Quote
Old 09-28-2015, 11:54 PM   #7
ErikFas
Member
 
Location: Sweden

Join Date: Jun 2014
Posts: 86
Default

Quote:
Originally Posted by Brian Bushnell View Post
bbduk.sh in=reads.fq outm=matching.fq ref=sequence.fa k=25 edist=2
So, the input here would then be the raw FASTQ reads from the sequencing, and the reference just my sequence of interest? I don't know that much about sizes of k-mers, but are there other altneratives / reasons to use something other than k=25, such as longer ones (approaching the read lengths, for example)?

[EDIT]

I used the following command to run BBDuk:

bbduk.sh in1=sample_1.fastq.gz in2=sample_2.fastq.gz outm1=matching1.fq outm2=matching2.fq ref=sequence.fa k=25 edist=2

... which gave me ~36000 matching k-mers in matching1.fq. I'm not sure what this means... I find 36000 k-mers of length 25 in the reads of my experiment matching k-mers of the same length in the reference sequence? How does this tell me if the sequence exists in my data, or is that something I have to do downstream (you mention assembling the reads to a consensus)?

Last edited by ErikFas; 09-29-2015 at 12:50 AM.
ErikFas is offline   Reply With Quote
Old 09-29-2015, 04:14 AM   #8
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,982
Default

The reads that matched at least share a 25 bp sequence (with up to 2 edits) with your "reference 5kb". You could take the reads in matching files and try to assemble them.

You could also lengthen the k parameter (and reduce edit distance) to see if you are able to match longer lengths.
GenoMax is offline   Reply With Quote
Old 09-29-2015, 04:23 AM   #9
ErikFas
Member
 
Location: Sweden

Join Date: Jun 2014
Posts: 86
Default

Ok, I'll do that then. I have never tried assembling a transcriptome before, so I don't really know how to go about doing it. Is it Trinity that's used, or is there a more simpler solution for my smaller data set?
ErikFas is offline   Reply With Quote
Old 09-29-2015, 04:43 AM   #10
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,982
Default

Is the actual inactivating "insert" 5kb or is that the entire region you are looking at?

Since you have a potential set of reads identified how about just trying to align them (use BBMap) to the 5kb reference. At 25bp original match that is only about a 1/4th of the read so the rest of the read may not align well/map at all.
GenoMax is offline   Reply With Quote
Old 09-29-2015, 05:49 AM   #11
ErikFas
Member
 
Location: Sweden

Join Date: Jun 2014
Posts: 86
Default

It's the whole region that we're looking at, at least for this question.

Oh, you can align with BBmap? How do I do that, which functions, etc?

Quote:
Originally Posted by GenoMax View Post
At 25bp original match that is only about a 1/4th of the read so the rest of the read may not align well/map at all.
I'm not sure I understand what you're saying. Are you saying that a 25 bp k-mer is too short to do this, and that I should use a longer one?

Also, can do I this for fastq-data from either RNA-seq or DNA-seq, or is it specific (or requires different parameters) for one of them?
ErikFas is offline   Reply With Quote
Old 09-29-2015, 06:40 AM   #12
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,982
Default

Quote:
Originally Posted by ErikFas View Post
It's the whole region that we're looking at, at least for this question.
So most of the sequence should be from the real reference with a part (size?) coming from the foreign insert?

Quote:
Oh, you can align with BBmap? How do I do that, which functions, etc?
Absolutely. Something along the lines of (provided samtools is available in PATH and your R1/R2 files match in read order)
Code:
$ bbmap.sh in1=matching1.fq in2=matching2.fq ref=sequence.fa out=alignments.bam ambig=all
Quote:
I'm not sure I understand what you're saying. Are you saying that a 25 bp k-mer is too short to do this, and that I should use a longer one?
I was thinking that a read got selected because at least 25 bp (from 100/125 total) matched the reference. Rest of the read may or may not map at all so when you try the alignment. In that process you may end up losing some/many of the 36000 reads that were originally selected by BBduk.

Last edited by GenoMax; 09-29-2015 at 06:43 AM.
GenoMax is offline   Reply With Quote
Old 09-29-2015, 10:45 PM   #13
ErikFas
Member
 
Location: Sweden

Join Date: Jun 2014
Posts: 86
Default

When I use your command I do get an alignment, but I don't know exactly what's happened, as I get as many reads as the total amount from the two input files, and it seems odd. There's 36416 k-mers from matching1.fq / matching2.fq each, and I get 73832 aligned reads in the output SAM file. It would make more sense if, like you said, I lose some of the reads in the alignment because the reads above and beyond the k-mer size of 25 might not align.

Disregarding the weird result, how do I actually look in this file for my sequence? I mean, it's not a consensus sequence (as in a single, long sequence), right? Even so, just having something aligned to the sequence I'm interested in should tell me that the sequence is in the data...?
ErikFas is offline   Reply With Quote
Old 09-30-2015, 04:11 AM   #14
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,982
Default

You should visually inspect the alignment (e.g. use IGV and 5kb reference) and check what the alignments look like. It is possible that you do have good alignments. Can you post the alignment summary statistic for BBMap?
GenoMax is offline   Reply With Quote
Old 09-30-2015, 10:46 AM   #15
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by ErikFas View Post
When I use your command I do get an alignment, but I don't know exactly what's happened, as I get as many reads as the total amount from the two input files, and it seems odd. There's 36416 k-mers from matching1.fq / matching2.fq each, and I get 73832 aligned reads in the output SAM file. It would make more sense if, like you said, I lose some of the reads in the alignment because the reads above and beyond the k-mer size of 25 might not align.
Are you sure you have 73832 aligned reads? By default, BBMap includes both aligned and unaligned reads in the output. That said, it's also possible that all of the reads aligned. As GenoMax said, the statistics output from BBMap (what it prints to the screen) would be helpful.
Brian Bushnell is offline   Reply With Quote
Old 09-30-2015, 09:27 PM   #16
ErikFas
Member
 
Location: Sweden

Join Date: Jun 2014
Posts: 86
Default

Ok, here is what BBMap stdout:

Code:
  ------------------   Results   ------------------   

Genome:                	1
Key Length:            	13
Max Indel:             	16000
Minimum Score Ratio:  	0.56
Mapping Mode:         	normal
Reads Used:           	72832	(9176832 bases)

Mapping:          	2.021 seconds.
Reads/sec:       	36044.03
kBases/sec:      	4541.55


Pairing data:   	pct reads	num reads 	pct bases	   num bases

mated pairs:     	  0.1016% 	       37 	  0.1016% 	        9324
bad pairs:       	  0.0000% 	        0 	  0.0000% 	           0
insert size avg: 	  219.24


Read 1 data:      	pct reads	num reads 	pct bases	   num bases

mapped:          	  0.1181% 	       43 	  0.1181% 	        5418
unambiguous:     	  0.1181% 	       43 	  0.1181% 	        5418
ambiguous:       	  0.0000% 	        0 	  0.0000% 	           0
low-Q discards:  	  0.0137% 	        5 	  0.0137% 	         630

perfect best site:	  0.0000% 	        0 	  0.0000% 	           0
semiperfect site:	  0.0000% 	        0 	  0.0000% 	           0
rescued:         	  0.0220% 	        8

Match Rate:      	      NA 	       NA 	 84.0716% 	        4555
Error Rate:      	100.0000% 	       43 	 15.8915% 	         861
Sub Rate:        	100.0000% 	       43 	 15.2270% 	         825
Del Rate:        	  0.0000% 	        0 	  0.0000% 	           0
Ins Rate:        	 16.2791% 	        7 	  0.6645% 	          36
N Rate:          	  4.6512% 	        2 	  0.0369% 	           2


Read 2 data:      	pct reads	num reads 	pct bases	   num bases

mapped:          	  0.1428% 	       52 	  0.1428% 	        6552
unambiguous:     	  0.1428% 	       52 	  0.1428% 	        6552
ambiguous:       	  0.0000% 	        0 	  0.0000% 	           0
low-Q discards:  	  1.6751% 	      610 	  1.6751% 	       76860

perfect best site:	  0.0000% 	        0 	  0.0000% 	           0
semiperfect site:	  0.0000% 	        0 	  0.0000% 	           0
rescued:         	  0.0275% 	       10

Match Rate:      	      NA 	       NA 	 84.1270% 	        5512
Error Rate:      	100.0000% 	       52 	 15.8730% 	        1040
Sub Rate:        	100.0000% 	       52 	 15.8730% 	        1040
Del Rate:        	  0.0000% 	        0 	  0.0000% 	           0
Ins Rate:        	  0.0000% 	        0 	  0.0000% 	           0
N Rate:          	  0.0000% 	        0 	  0.0000% 	           0
ErikFas is offline   Reply With Quote
Old 10-01-2015, 03:05 AM   #17
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,982
Default

That does not look promising. Few reads are mapped.

You are going to have to spend sometime playing with BBMap options to see if you can improve the alignments. Checking some of the individual reads would be needed to see if you have chimeric reads and/or just random hits based on the initial selection of k=25. Have you tried using higher k values to see if you can narrow down the read set?
GenoMax is offline   Reply With Quote
Old 10-01-2015, 03:53 AM   #18
ErikFas
Member
 
Location: Sweden

Join Date: Jun 2014
Posts: 86
Default

Actually for this data (i.e. RNA-seq) the sequence shouldn't be there, so aligning few reads is actually good for me! I will run through the workflow again with another k-mer, maybe 35 or 50?

I looked at the BAM file in IGV, and it looks like the attached image. If I'm interpreting this correctly, there are a bunch of reads that map to the sequence, but they all have a bunch of mutations, which would then correspond to the allowed mismatched in the BBDuk step? Should I then rerun with fewer (or no) allowed mismatches, or am I misinterpreting the data?

I would also like to do this on my whole genome sequencing data, would I just do the exact same procedure for that (starting at the fastq-files from DNA sequencing)?
Attached Images
File Type: png Screen Shot 2015-10-01 at 13.40.54.png (70.2 KB, 4 views)
ErikFas is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 03:55 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO