Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Searching for a sequence in WGS / RNA-seq data

    I am interested in searching for a specific sequence in both my RNA-seq and WGS data, and the sequence is quite a bit above the read lengths for either experiment. I have access to all BAM files, some VCF files for WGS, raw fastq files, and everything else you can imagine coming from the sequencing. I want to see if a sequence is present in the data, and if it is, if it's present in the aligned or unaligned BAM files.

    The background to my question would be that the sequence in question is a sequence that I believe would not be successfully mapped to the reference, but might still exist in the data/reads. I am unsure of how to go about this, or if it's even something that can be done.

    My initial idea was to create some kind of consensus sequence from the RNA-seq BAM-files (both unaligned and aligned), and simply search the resulting sequencing against my sequence of interest. This, however, has proven to be hard, as there seems to be numerous ways of doing it according to Google, and none being the best (the "best" of which involving vcftools, which I for the life of me I cannot get to install on my Mac; no make files, although the documentation says there should be!)

    In essence, I just want to find my sequence in my data. How do I do this?

  • #2
    Can you provide information about size of this sequence and the reads you have?

    Comment


    • #3
      The read lengths I have are 100 and 125 bp, and the full sequence I'm interested in is ~5kb, but I'm also interested in smaller fragments of that sequence, all still longer than the read lengths.

      Comment


      • #4
        Have you tried to use that 5kb sequence as a reference for mapping the reads against it? You could also try to blat your reads against that sequence.

        I am a bit puzzled by your statement: "a sequence that I believe would not be successfully mapped to the reference". What exactly do you mean by that and why can it not be mapped? That sequence is part of the reference, correct?

        Comment


        • #5
          Ah, no, I haven't tried that... Good idea!

          No, that's why I'm asking, the sequence is not part of the reference (it's an insert that's part of a silencing of a gene of interest), and I would like to see if that sequence is part of the data without having run Tophat all over again after adding the sequence to the reference manually (computing time, etc...)

          Comment


          • #6
            Since your reference of interest is small... you either can map to it, or do kmer-matching. For example -

            bbduk.sh in=reads.fq outm=matching.fq ref=sequence.fa k=25 edist=2

            That will give you all the reads that share a 25-mer with the reference, allowing up to 2 edits. You could subsequently assemble those reads to find a consensus. BBDuk is very fast in this kind of situation.

            Comment


            • #7
              Originally posted by Brian Bushnell View Post
              bbduk.sh in=reads.fq outm=matching.fq ref=sequence.fa k=25 edist=2
              So, the input here would then be the raw FASTQ reads from the sequencing, and the reference just my sequence of interest? I don't know that much about sizes of k-mers, but are there other altneratives / reasons to use something other than k=25, such as longer ones (approaching the read lengths, for example)?

              [EDIT]

              I used the following command to run BBDuk:

              bbduk.sh in1=sample_1.fastq.gz in2=sample_2.fastq.gz outm1=matching1.fq outm2=matching2.fq ref=sequence.fa k=25 edist=2

              ... which gave me ~36000 matching k-mers in matching1.fq. I'm not sure what this means... I find 36000 k-mers of length 25 in the reads of my experiment matching k-mers of the same length in the reference sequence? How does this tell me if the sequence exists in my data, or is that something I have to do downstream (you mention assembling the reads to a consensus)?
              Last edited by ErikFas; 09-29-2015, 12:50 AM.

              Comment


              • #8
                The reads that matched at least share a 25 bp sequence (with up to 2 edits) with your "reference 5kb". You could take the reads in matching files and try to assemble them.

                You could also lengthen the k parameter (and reduce edit distance) to see if you are able to match longer lengths.

                Comment


                • #9
                  Ok, I'll do that then. I have never tried assembling a transcriptome before, so I don't really know how to go about doing it. Is it Trinity that's used, or is there a more simpler solution for my smaller data set?

                  Comment


                  • #10
                    Is the actual inactivating "insert" 5kb or is that the entire region you are looking at?

                    Since you have a potential set of reads identified how about just trying to align them (use BBMap) to the 5kb reference. At 25bp original match that is only about a 1/4th of the read so the rest of the read may not align well/map at all.

                    Comment


                    • #11
                      It's the whole region that we're looking at, at least for this question.

                      Oh, you can align with BBmap? How do I do that, which functions, etc?

                      Originally posted by GenoMax View Post
                      At 25bp original match that is only about a 1/4th of the read so the rest of the read may not align well/map at all.
                      I'm not sure I understand what you're saying. Are you saying that a 25 bp k-mer is too short to do this, and that I should use a longer one?

                      Also, can do I this for fastq-data from either RNA-seq or DNA-seq, or is it specific (or requires different parameters) for one of them?

                      Comment


                      • #12
                        Originally posted by ErikFas View Post
                        It's the whole region that we're looking at, at least for this question.
                        So most of the sequence should be from the real reference with a part (size?) coming from the foreign insert?

                        Oh, you can align with BBmap? How do I do that, which functions, etc?
                        Absolutely. Something along the lines of (provided samtools is available in PATH and your R1/R2 files match in read order)
                        Code:
                        $ bbmap.sh in1=matching1.fq in2=matching2.fq ref=sequence.fa out=alignments.bam ambig=all
                        I'm not sure I understand what you're saying. Are you saying that a 25 bp k-mer is too short to do this, and that I should use a longer one?
                        I was thinking that a read got selected because at least 25 bp (from 100/125 total) matched the reference. Rest of the read may or may not map at all so when you try the alignment. In that process you may end up losing some/many of the 36000 reads that were originally selected by BBduk.
                        Last edited by GenoMax; 09-29-2015, 06:43 AM.

                        Comment


                        • #13
                          When I use your command I do get an alignment, but I don't know exactly what's happened, as I get as many reads as the total amount from the two input files, and it seems odd. There's 36416 k-mers from matching1.fq / matching2.fq each, and I get 73832 aligned reads in the output SAM file. It would make more sense if, like you said, I lose some of the reads in the alignment because the reads above and beyond the k-mer size of 25 might not align.

                          Disregarding the weird result, how do I actually look in this file for my sequence? I mean, it's not a consensus sequence (as in a single, long sequence), right? Even so, just having something aligned to the sequence I'm interested in should tell me that the sequence is in the data...?

                          Comment


                          • #14
                            You should visually inspect the alignment (e.g. use IGV and 5kb reference) and check what the alignments look like. It is possible that you do have good alignments. Can you post the alignment summary statistic for BBMap?

                            Comment


                            • #15
                              Originally posted by ErikFas View Post
                              When I use your command I do get an alignment, but I don't know exactly what's happened, as I get as many reads as the total amount from the two input files, and it seems odd. There's 36416 k-mers from matching1.fq / matching2.fq each, and I get 73832 aligned reads in the output SAM file. It would make more sense if, like you said, I lose some of the reads in the alignment because the reads above and beyond the k-mer size of 25 might not align.
                              Are you sure you have 73832 aligned reads? By default, BBMap includes both aligned and unaligned reads in the output. That said, it's also possible that all of the reads aligned. As GenoMax said, the statistics output from BBMap (what it prints to the screen) would be helpful.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              25 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              28 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              24 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              52 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X