Seqanswers Leaderboard Ad

**GenoMax** · 04-08-2014, 02:40 PM

As a first step start scanning your data/trimming it to remove adapter sequences. It sounds like you expect the reads to contain adapters. One of the better options to do this is trimmomatic: http://www.usadellab.org/cms/?page=trimmomatic.

Another simple option would be bbduk (Post #4: http://seqanswers.com/forums/showthread.php?t=41976).

If you truly want to find a non-redundant dataset then take a look at this software: http://bioinformatics.oxfordjournals...7/18/2502.full

**luc** · 04-08-2014, 03:06 PM

It seems you are working on this type of data: http://en.wikipedia.org/wiki/Systema...ial_Enrichment ???

How long are the sequences between the adapters and how long are your reads? It seems like you perhaps would select reads first for the presence of adapter fragments - assuming these are the reads of interest??

**SNPsaurus** · 04-08-2014, 06:03 PM

If you know your adapter sequences, one quick and dirty look would be to do this at your Unix command line:

grep ADAPTERSEQ1 file_name | grep ADAPTERSEQ2 | cut -c 10-20

the two grep commands find lines with the adapter sequences, then pass the good sequences to the cut command, which extracts the enriched sequenced between the adapters (10-20 is just an example, you'd have to start and end at the proper position).

The results sequences could be pasted into a motif detection tool like DREME http://meme.nbcr.net/meme/cgi-bin/dreme.cgi

I'm not that current with tools for motif detection from ChIP-Seq data, so there may be better ones out there.

Perhaps even easier:
grep ADAPTERSEQ1 file_name | grep ADAPTERSEQ2 | cut -c 10-20 | uniq -c | sort

Takes the good enriched sequences and passes it to uniq, which collapses the list to one example of each unique occurrence with the number of times it occurs, then sorts on that number. Most SELEX data get particular sequences enriched plus some variants, and you'd see that pretty quickly in the resulting list.

**Brian Bushnell** · 04-08-2014, 06:20 PM

If I understood what you were trying to accomplish, I might be able to help. Can you rephrase it as though you were talking to someone who had no idea what you were doing, as verbosely as possible?

**sciash** · 04-09-2014, 09:38 AM

Originally posted by GenoMax View Post

As a first step start scanning your data/trimming it to remove adapter sequences. It sounds like you expect the reads to contain adapters. One of the better options to do this is trimmomatic: http://www.usadellab.org/cms/?page=trimmomatic.

Another simple option would be bbduk (Post #4: http://seqanswers.com/forums/showthread.php?t=41976).

If you truly want to find a non-redundant dataset then take a look at this software: http://bioinformatics.oxfordjournals...7/18/2502.full

Great, thank you! bbduk might be the right option for me.

**sciash** · 04-09-2014, 09:56 AM

Originally posted by luc View Post

It seems you are working on this type of data: http://en.wikipedia.org/wiki/Systema...ial_Enrichment ???

How long are the sequences between the adapters and how long are your reads? It seems like you perhaps would select reads first for the presence of adapter fragments - assuming these are the reads of interest??

Yes, that's exactly what I'm doing. The sequence is 60 bp between adapters, and reads will be 100 bp (the centre said to choose a length that encapsulates my sequence, not including adapters).

If I'm not mistaken, all reads should have adapter fragments so this might not be the best way to filter the data...?

**sciash** · 04-09-2014, 09:58 AM

Originally posted by SNPsaurus View Post

If you know your adapter sequences, one quick and dirty look would be to do this at your Unix command line:

grep ADAPTERSEQ1 file_name | grep ADAPTERSEQ2 | cut -c 10-20

the two grep commands find lines with the adapter sequences, then pass the good sequences to the cut command, which extracts the enriched sequenced between the adapters (10-20 is just an example, you'd have to start and end at the proper position).

The results sequences could be pasted into a motif detection tool like DREME http://meme.nbcr.net/meme/cgi-bin/dreme.cgi

I'm not that current with tools for motif detection from ChIP-Seq data, so there may be better ones out there.

Perhaps even easier:
grep ADAPTERSEQ1 file_name | grep ADAPTERSEQ2 | cut -c 10-20 | uniq -c | sort

Takes the good enriched sequences and passes it to uniq, which collapses the list to one example of each unique occurrence with the number of times it occurs, then sorts on that number. Most SELEX data get particular sequences enriched plus some variants, and you'd see that pretty quickly in the resulting list.

Thanks, this looks useful for once I get a bit further ahead with the data processing!

**sciash** · 04-09-2014, 10:21 AM

Originally posted by Brian Bushnell View Post

If I understood what you were trying to accomplish, I might be able to help. Can you rephrase it as though you were talking to someone who had no idea what you were doing, as verbosely as possible?

Okay, I'm going to give it my best shot, leaving out the boring SELEX details.
_________________________________________
I started with a pool of DNA that looked like this: GTTGACTGTAGGTCA - N30 - GAGCATCGGACAACG. That's A TON of (~10^17) different sequences. I manipulated the DNA several times over to (hopefully) get rid of many of those sequences.

I now need to figure out which sequences are left, and their relative amount (ie. #1 = 20%, #2 = 2 %, #3-whatever = remaining 78%). Based on a recommendation, I sent my pool off to be sequenced by Illumina Hi-Seq.

I amplified my sequence with indexes F1 & R1 (no staggering, final size 212 bp). The sample was spiked with 50% phiX for HiSeq paired end (100) sequencing.

I received a couple of fastq.gz files I now need to work with, but don't really know how to start! With direction I'm a fast learner, but I have never done anything like this before.
_____________________________________________
That's all the info w/out writing a novel, thanks for trying to help!

**mastal** · 04-09-2014, 12:16 PM

Google for papers on NGS analysis of SELEX data, and see what software or methods they used to analyse their reads.

This might be helpful:

Probing the SELEX Process with Next-Generation Sequencing

http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0029604#pone.0029604-Weese1

Background SELEX is an iterative process in which highly diverse synthetic nucleic acid libraries are selected over many rounds to finally identify aptamers with desired properties. However, little is understood as how binders are enriched during the selection course. Next-generation sequencing offers the opportunity to open the black box and observe a large part of the population dynamics during the selection process. Methodology We have performed a semi-automated SELEX procedure on the model target streptavidin starting with a synthetic DNA oligonucleotide library and compared results obtained by the conventional analysis via cloning and Sanger sequencing with next-generation sequencing. In order to follow the population dynamics during the selection, pools from all selection rounds were barcoded and sequenced in parallel. Conclusions High affinity aptamers can be readily identified simply by copy number enrichment in the first selection rounds. Based on our results, we suggest a new selection scheme that avoids a high number of iterative selection rounds while reducing time, PCR bias, and artifacts.

**sciash** · 04-09-2014, 12:22 PM

Originally posted by mastal View Post

Google for papers on NGS analysis of SELEX data, and see what software or methods they used to analyse their reads.

This might be helpful:

http://www.plosone.org/article/info%...0029604-Weese1

Yes! I've been going through publications and that's one of the papers I've read. Thanks!

**GenoMax** · 04-09-2014, 02:27 PM

Start with FastQC and look at big picture stats (# of reads available). FastQC is probably going to flag many sections of the report as failed but you can ignore those for now. If you want to post the Q-score plot/Nucleotide distribution here, feel free. It sounds like all of your reads are going to have more or less the same sequence at the beginning and end and bbduk should help address those. Run another FastQC analysis after the trimming to see the difference.

Are you familiar with unix? If not it may be time to start learning.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 59 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 57 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 51 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 56 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Where to start processing Illumina Hi-Seq apatamer data?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News