Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Where to start processing Illumina Hi-Seq apatamer data?

    Not a bioinformatician in the slightest, but need to start wearing that hat.

    Had my last round of SELEX sequenced w/ Illumina Hi-Seq, and now I've got no idea how to go about processing it.

    I know what the length and end sequences should be (primers) but don't have a sequence to align to, and that's what the majority of help I've been able to find talks about.

    I guess I need to trim and filter the reads so I can pick the top __ most abundant (depending on how many individual sequences were found)? Just looking for some direction.

    Any help most appreciated!

  • #2
    As a first step start scanning your data/trimming it to remove adapter sequences. It sounds like you expect the reads to contain adapters. One of the better options to do this is trimmomatic: http://www.usadellab.org/cms/?page=trimmomatic.

    Another simple option would be bbduk (Post #4: http://seqanswers.com/forums/showthread.php?t=41976).

    If you truly want to find a non-redundant dataset then take a look at this software: http://bioinformatics.oxfordjournals...7/18/2502.full

    Comment


    • #3
      It seems you are working on this type of data: http://en.wikipedia.org/wiki/Systema...ial_Enrichment ???

      How long are the sequences between the adapters and how long are your reads? It seems like you perhaps would select reads first for the presence of adapter fragments - assuming these are the reads of interest??

      Comment


      • #4
        If you know your adapter sequences, one quick and dirty look would be to do this at your Unix command line:

        grep ADAPTERSEQ1 file_name | grep ADAPTERSEQ2 | cut -c 10-20

        the two grep commands find lines with the adapter sequences, then pass the good sequences to the cut command, which extracts the enriched sequenced between the adapters (10-20 is just an example, you'd have to start and end at the proper position).

        The results sequences could be pasted into a motif detection tool like DREME http://meme.nbcr.net/meme/cgi-bin/dreme.cgi

        I'm not that current with tools for motif detection from ChIP-Seq data, so there may be better ones out there.

        Perhaps even easier:
        grep ADAPTERSEQ1 file_name | grep ADAPTERSEQ2 | cut -c 10-20 | uniq -c | sort

        Takes the good enriched sequences and passes it to uniq, which collapses the list to one example of each unique occurrence with the number of times it occurs, then sorts on that number. Most SELEX data get particular sequences enriched plus some variants, and you'd see that pretty quickly in the resulting list.
        Last edited by SNPsaurus; 04-08-2014, 06:07 PM.
        Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com

        Comment


        • #5
          If I understood what you were trying to accomplish, I might be able to help. Can you rephrase it as though you were talking to someone who had no idea what you were doing, as verbosely as possible?

          Comment


          • #6
            Originally posted by GenoMax View Post
            As a first step start scanning your data/trimming it to remove adapter sequences. It sounds like you expect the reads to contain adapters. One of the better options to do this is trimmomatic: http://www.usadellab.org/cms/?page=trimmomatic.

            Another simple option would be bbduk (Post #4: http://seqanswers.com/forums/showthread.php?t=41976).

            If you truly want to find a non-redundant dataset then take a look at this software: http://bioinformatics.oxfordjournals...7/18/2502.full
            Great, thank you! bbduk might be the right option for me.

            Comment


            • #7
              Originally posted by luc View Post
              It seems you are working on this type of data: http://en.wikipedia.org/wiki/Systema...ial_Enrichment ???

              How long are the sequences between the adapters and how long are your reads? It seems like you perhaps would select reads first for the presence of adapter fragments - assuming these are the reads of interest??
              Yes, that's exactly what I'm doing. The sequence is 60 bp between adapters, and reads will be 100 bp (the centre said to choose a length that encapsulates my sequence, not including adapters).

              If I'm not mistaken, all reads should have adapter fragments so this might not be the best way to filter the data...?

              Comment


              • #8
                Originally posted by SNPsaurus View Post
                If you know your adapter sequences, one quick and dirty look would be to do this at your Unix command line:

                grep ADAPTERSEQ1 file_name | grep ADAPTERSEQ2 | cut -c 10-20

                the two grep commands find lines with the adapter sequences, then pass the good sequences to the cut command, which extracts the enriched sequenced between the adapters (10-20 is just an example, you'd have to start and end at the proper position).

                The results sequences could be pasted into a motif detection tool like DREME http://meme.nbcr.net/meme/cgi-bin/dreme.cgi

                I'm not that current with tools for motif detection from ChIP-Seq data, so there may be better ones out there.

                Perhaps even easier:
                grep ADAPTERSEQ1 file_name | grep ADAPTERSEQ2 | cut -c 10-20 | uniq -c | sort

                Takes the good enriched sequences and passes it to uniq, which collapses the list to one example of each unique occurrence with the number of times it occurs, then sorts on that number. Most SELEX data get particular sequences enriched plus some variants, and you'd see that pretty quickly in the resulting list.
                Thanks, this looks useful for once I get a bit further ahead with the data processing!

                Comment


                • #9
                  Originally posted by Brian Bushnell View Post
                  If I understood what you were trying to accomplish, I might be able to help. Can you rephrase it as though you were talking to someone who had no idea what you were doing, as verbosely as possible?
                  Okay, I'm going to give it my best shot, leaving out the boring SELEX details.
                  _________________________________________
                  I started with a pool of DNA that looked like this: GTTGACTGTAGGTCA - N30 - GAGCATCGGACAACG. That's A TON of (~10^17) different sequences. I manipulated the DNA several times over to (hopefully) get rid of many of those sequences.

                  I now need to figure out which sequences are left, and their relative amount (ie. #1 = 20%, #2 = 2 %, #3-whatever = remaining 78%). Based on a recommendation, I sent my pool off to be sequenced by Illumina Hi-Seq.

                  I amplified my sequence with indexes F1 & R1 (no staggering, final size 212 bp). The sample was spiked with 50% phiX for HiSeq paired end (100) sequencing.

                  I received a couple of fastq.gz files I now need to work with, but don't really know how to start! With direction I'm a fast learner, but I have never done anything like this before.
                  _____________________________________________
                  That's all the info w/out writing a novel, thanks for trying to help!

                  Comment


                  • #10
                    Google for papers on NGS analysis of SELEX data, and see what software or methods they used to analyse their reads.

                    This might be helpful:

                    Background SELEX is an iterative process in which highly diverse synthetic nucleic acid libraries are selected over many rounds to finally identify aptamers with desired properties. However, little is understood as how binders are enriched during the selection course. Next-generation sequencing offers the opportunity to open the black box and observe a large part of the population dynamics during the selection process. Methodology We have performed a semi-automated SELEX procedure on the model target streptavidin starting with a synthetic DNA oligonucleotide library and compared results obtained by the conventional analysis via cloning and Sanger sequencing with next-generation sequencing. In order to follow the population dynamics during the selection, pools from all selection rounds were barcoded and sequenced in parallel. Conclusions High affinity aptamers can be readily identified simply by copy number enrichment in the first selection rounds. Based on our results, we suggest a new selection scheme that avoids a high number of iterative selection rounds while reducing time, PCR bias, and artifacts.

                    Comment


                    • #11
                      Originally posted by mastal View Post
                      Google for papers on NGS analysis of SELEX data, and see what software or methods they used to analyse their reads.

                      This might be helpful:

                      http://www.plosone.org/article/info%...0029604-Weese1
                      Yes! I've been going through publications and that's one of the papers I've read. Thanks!

                      Comment


                      • #12
                        Start with FastQC and look at big picture stats (# of reads available). FastQC is probably going to flag many sections of the report as failed but you can ignore those for now. If you want to post the Q-score plot/Nucleotide distribution here, feel free. It sounds like all of your reads are going to have more or less the same sequence at the beginning and end and bbduk should help address those. Run another FastQC analysis after the trimming to see the difference.

                        Are you familiar with unix? If not it may be time to start learning.

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Essential Discoveries and Tools in Epitranscriptomics
                          by seqadmin




                          The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                          04-22-2024, 07:01 AM
                        • seqadmin
                          Current Approaches to Protein Sequencing
                          by seqadmin


                          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                          04-04-2024, 04:25 PM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, 04-11-2024, 12:08 PM
                        0 responses
                        59 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-10-2024, 10:19 PM
                        0 responses
                        57 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-10-2024, 09:21 AM
                        0 responses
                        51 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-04-2024, 09:00 AM
                        0 responses
                        56 views
                        0 likes
                        Last Post seqadmin  
                        Working...
                        X