SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Reply
 
Thread Tools
Old 07-18-2013, 06:21 AM   #1
Marina_P
Junior Member
 
Location: USA

Join Date: Jul 2013
Posts: 6
Unhappy RepeatExplorer

Hi everyone, I'm new in this forum !
I'd love to have some guidelines from you!

I 've started using RepeatExplorer, to find highly repeated (obviously) sequences, in data from Miseq Illumina. We don't have reference genomes, cause we are working on parasites that their genome has not been sequenced yet (only mt).
We need the repeats to develop a diagnostic kit, so we don't care pretty much from assembling the sequence.

The output after running hours and hours of RepeatExplorer gives clusters, graphs-so many graphs- and I'm a bit confused.
Anyone has/had same experience ?

Thank you very very much !
Marina_P is offline   Reply With Quote
Old 08-14-2013, 02:36 PM   #2
htetre
Member
 
Location: US

Join Date: Jul 2013
Posts: 26
Default

Hello Marina_P,

I just started using RepeatExplorer, as well. Yes and there are many graphs. I actually came on the forum to see if there is any discussion about the program to help me understand some of the data details. Sadly I noticed no other threads but yours.

How is it going? Are you able to identify most of your clusters/graphs as elements? and for clusters without identification how do you deal with them?

I have just started with RepeatExplorer so maybe with two of us on the forum we can bounce things off of eachother?

Cheers
htetre is offline   Reply With Quote
Old 10-06-2013, 01:02 PM   #3
Marina_P
Junior Member
 
Location: USA

Join Date: Jul 2013
Posts: 6
Default

Hey htetre,

It was for about a month kinda idle the post, so I didn't check it -as you can see- for almost 2 months !

So ??? Did you figure everything out ?
I picked the clusters that looked more homogeneous to me, but now the first good ones, cause when I blasted the sequences included where mito ones, sth that I want to avoid.
What are you looking for ?

Best,
Marina
Marina_P is offline   Reply With Quote
Old 03-13-2014, 02:23 AM   #4
jimacas
Junior Member
 
Location: Czech Republic

Join Date: Mar 2014
Posts: 9
Default RepeatExplorer workshop

Hi, you might be interested in a practical course on using RepeatExplorer and interperting its results:
http://w3lamc.umbr.cas.cz/repeatexplorer/?page_id=14

Jiri
jimacas is offline   Reply With Quote
Old 03-13-2014, 05:45 AM   #5
Marina_P
Junior Member
 
Location: USA

Join Date: Jul 2013
Posts: 6
Default

Thanks jimacas for your response !
I came across to this announcement as well.
Unfortunately, I'm in the US now, I don't think I will be able to make it, it seems a great opportunity to figure out what you're looking for or interpret your results though.
Thank you again !

Have a nice day !
Marina
Marina_P is offline   Reply With Quote
Old 03-13-2014, 06:53 AM   #6
jimacas
Junior Member
 
Location: Czech Republic

Join Date: Mar 2014
Posts: 9
Default

Hi Marina,

It is a pity you cannot make it to the course. You can at least have a look at some of our presentations from the previous workshop, I made them available here: http://w3lamc.umbr.cas.cz/repeatexplorer/?page_id=125

Best, Jiri
jimacas is offline   Reply With Quote
Old 03-13-2014, 09:46 AM   #7
Marina_P
Junior Member
 
Location: USA

Join Date: Jul 2013
Posts: 6
Default

This is extremely helpful Jiri, thank you so much for that !

I hope I can be as helpful for you in the future.

:-) I'll go through your work and will come back if I have any questions !

Thanks a million !

All the best,
Marina
Marina_P is offline   Reply With Quote
Old 03-21-2014, 01:32 PM   #8
dsenalik
Carrot Scientist
 
Location: Madison WI USA

Join Date: Nov 2009
Posts: 41
Default

I am posting a solution for a different problem here for potential Google searchers, it took me some time to track down.

Using the command line RepeatExplorer, with --sq_rename parameter.
The following error occured
Code:
Calculating graph layouts
2014-03-21 09:56

 reading .cls file
original cluster CL 1 was above threshold!, sample of graph is used
original cluster CL 2 was above threshold!, sample of graph is used
Error in { : task 1 failed - "line 1 did not have 3 elements"
Calls: %dopar% -> <Anonymous>
Execution halted
exit status:1
This error is ultimately caused by a '#' character in the read names, as is found in some Illumina reads, e.g. >XXX2XX4ACXX:1:1101:1441:2408#CAAGGAGCA/1

More specifically, in
repeatexplorer/umbr_programs/seqclust/programs/clusters2graphs.R
the command
gd=read.table(file=ncolfile,sep='\t',header=F,as.is=T,col.names=c(1,2,'weight'))
fails if there were '#' characters in the read name.

My solution was just to remove '#' from the read names.
dsenalik is offline   Reply With Quote
Old 03-21-2014, 04:08 PM   #9
Marina_P
Junior Member
 
Location: USA

Join Date: Jul 2013
Posts: 6
Default

Dear dsenalik,

what were you trying to do with the command window?

Something with the graphs and the repeat layouts ?

Thanks for that, I'm sure a lot of people came across to such a struggle.

:-)

M.
Marina_P is offline   Reply With Quote
Old 03-22-2014, 08:38 AM   #10
dsenalik
Carrot Scientist
 
Location: Madison WI USA

Join Date: Nov 2009
Posts: 41
Default

Dear Marina_P,
I have about 30 genotypes I want to analyze, and it is easier to run on my own server than on the Galaxy server, and also I don't want to overload it! Well, easier only once I have everything installed properly, there were a number of dependencies to install or configure.
It might help someone else, so here are my installation notes.

My plan is to see if all genotypes have a particular repeat cluster of interest.
To do this, I have put sequences from that cluster from an initial analysis into a custom RepeatMasker database, and I hope to see if a corresponding cluster shows up annotated in the other genotypes. It will take some time to run all of these...
dsenalik is offline   Reply With Quote
Old 05-06-2014, 08:37 AM   #11
AleixArnau
Junior Member
 
Location: Barcelona

Join Date: Mar 2014
Posts: 7
Default

Dear all,

I've been working with RepeatExplorer for the last month and I would be interested in get a fasta file with all the singlet reads. It provide you the number of singlet reads which aren't in any cluster but I don't know (even if it's possible) how to get these singlet reads. Someone know if that is possible? or how can I get them?

Thanks in advance!
AleixArnau is offline   Reply With Quote
Old 05-06-2014, 08:37 AM   #12
AleixArnau
Junior Member
 
Location: Barcelona

Join Date: Mar 2014
Posts: 7
Default RepeatExplorer singlet reads

Dear all,

I've been working with RepeatExplorer for the last month and I would be interested in get a fasta file with all the singlet reads. It provide you the number of singlet reads which aren't in any cluster but I don't know (even if it's possible) how to get these singlet reads. Someone know if that is possible? or how can I get them?

Thanks in advance!
AleixArnau is offline   Reply With Quote
Old 05-06-2014, 10:47 AM   #13
dsenalik
Carrot Scientist
 
Location: Madison WI USA

Join Date: Nov 2009
Posts: 41
Default

The file that will list all reads in all clusters is
Code:
MyREoutputdir/seqClust/clustering/hitsort_PID90_LCOV55.cls
This file lists all reads in all clusters, even those too small for the summary HTML output. The numbers of clusters and of reads will match those in the summary graph at the top of the HTML output.
The format is a fasta-style header line with cluster number and number of reads, and then a second long line with all reads in that cluster
e.g.
Code:
...
>CL13980 3
I01405774f I01340829r I01263003f
>CL13981 3
I01149129r I01499415r I01202179f
...
Now, to do what you want would take some programming or clever shell scripts, any read whose ID is in this file is excluded, and what is left are the unclustered reads.

One way that might work:

1. Make a file with list of IDs to exclude
Code:
grep -v ">" MyREoutputdir/seqClust/clustering/hitsort_PID90_LCOV55.cls | tr " " "\n" > Myexcludelist.txt
The renamed input sequence in FASTA format can be found as
Code:
MyREoutputdir/seqClust/sequences/seqClust
You could then use biopieces to exclude these reads
Code:
read_fasta -i MyREoutputdir/seqClust/sequences/seqClust | grab -i -E Myexcludelist.txt | write_fasta -xo Mysinglecopy.fasta
dsenalik is offline   Reply With Quote
Old 05-07-2014, 03:46 AM   #14
AleixArnau
Junior Member
 
Location: Barcelona

Join Date: Mar 2014
Posts: 7
Default

Thanks very much dsenalik!

You have solved my problem!
AleixArnau is offline   Reply With Quote
Old 06-16-2014, 11:48 AM   #15
dsenalik
Carrot Scientist
 
Location: Madison WI USA

Join Date: Nov 2009
Posts: 41
Default Telomeres not clustered by RepeatExplorer

(I am posting this here for lack of a better place, just for information.)

I discovered that I had reads that were entirely the classic arabidopsis telomere repeat, i.e.
AGGGTTT
But despite adequate abundance, none of these reads show up in any clusters. However, a smaller number of reads that are two thirds this motif did get clustered.
It is probably some aspect of the clustering process that can't handle a 7-nucleotide repeat motif.
dsenalik is offline   Reply With Quote
Old 06-25-2014, 06:24 AM   #16
AleixArnau
Junior Member
 
Location: Barcelona

Join Date: Mar 2014
Posts: 7
Default

Hi,

Someone know how can I get the "consensus sequence" from each cluster provided by RepeatExplorer. I mean, Repeat Explorer provides clusters of repeat elements. I want to know how can I get the consensus sequence from each cluster which is used to identify the repeat element through repeatmasker.

I'm using RepeatExplorer with unmapped reads from Begonia's genome. The main clusters are not defined by repeat masker so I would like to get the consensus sequence from each cluster to see what they look like because we know that they probably are specific repeat elements from Begonia.

Someone known?

Thanks!
AleixArnau is offline   Reply With Quote
Old 06-25-2014, 08:04 AM   #17
dsenalik
Carrot Scientist
 
Location: Madison WI USA

Join Date: Nov 2009
Posts: 41
Default

RepeatExplorer assembles the clustered reads for each cluster with cap3.
These contigs are in a directory for each cluster, for example
seqClust/clustering/clusters/dir_CL0123/contigs_CL123
Some clusters are simple, but often there are many many contigs!
I have tried Muscle to align the contigs, but you need to first flip half of the contigs that are by chance in the opposite orientation.
You can also find how many reads went into each cap3 contig from the ace file
seqClust/clustering/clusters/dir_CL0123/ACE_CL123.ace
and maybe select the contigs with the most reads.
It's not easy!
dsenalik is offline   Reply With Quote
Old 08-06-2014, 01:22 PM   #18
jimacas
Junior Member
 
Location: Czech Republic

Join Date: Mar 2014
Posts: 9
Default RE: Telomeres not clustered by RepeatExplorer

Quote:
Originally Posted by dsenalik View Post
(I am posting this here for lack of a better place, just for information.)

I discovered that I had reads that were entirely the classic arabidopsis telomere repeat, i.e.
AGGGTTT
But despite adequate abundance, none of these reads show up in any clusters. However, a smaller number of reads that are two thirds this motif did get clustered.
It is probably some aspect of the clustering process that can't handle a 7-nucleotide repeat motif.
Hi Doug, this is caused by masking simple sequence repeats (low complexity regions) during mgblast search, resulting in no similarity hits from reads entirely made of telomeric motifs. Unfortunately, there is no simple solution to this problem because disabling this masking option would probably generate many non-specific hits in read comparisons, leading to bad clustering results.

Best, Jiri
jimacas is offline   Reply With Quote
Old 11-02-2016, 10:42 PM   #19
Heena_2002
Junior Member
 
Location: India

Join Date: Nov 2016
Posts: 4
Default

Hi all,

I'm new to Repeatexplorer as well as bioinformatics. Recently i have ran repeatexplorer on my low pass genome sequencing data at Galaxy server. However, my result files generated are empty and i am not able to figure out what the problem is as there is no indication of any sort of error happening during the run.

If anyone have any detail about this and can guide me i'l be grateful.

Thanks!
Heena_2002 is offline   Reply With Quote
Old 11-03-2016, 02:11 AM   #20
jimacas
Junior Member
 
Location: Czech Republic

Join Date: Mar 2014
Posts: 9
Default

Hello, could you provide more details about your run - how many reads did you submit, what tool did you use (probably RepeatExplorer / clustering ?), and what exactly was the result of your run (did the history items turn green or red ?). What is the content of the analysis log ? (the item named "Log information of clustering...")

Thanks,

Jiri Macas

Quote:
Originally Posted by Heena_2002 View Post
Hi all,

I'm new to Repeatexplorer as well as bioinformatics. Recently i have ran repeatexplorer on my low pass genome sequencing data at Galaxy server. However, my result files generated are empty and i am not able to figure out what the problem is as there is no indication of any sort of error happening during the run.

If anyone have any detail about this and can guide me i'l be grateful.

Thanks!
jimacas is offline   Reply With Quote
Reply

Tags
clusters, diagnostic applications, miseq output, repeatexplorer

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:41 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO