SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
RepeatMasker output sydghyyh14 Bioinformatics 3 09-11-2012 04:26 PM
RepeatMasker error parulvk Bioinformatics 0 10-26-2011 02:59 PM
RepeatMasker -lib option zwzhu Bioinformatics 1 09-15-2011 03:53 AM
RepeatMasker output flipwell Bioinformatics 5 08-17-2011 04:48 PM
RepeatMasker question slny Bioinformatics 0 06-08-2011 01:31 PM

Reply
 
Thread Tools
Old 05-25-2012, 06:17 AM   #1
flyyuan
Junior Member
 
Location: China

Join Date: Nov 2010
Posts: 3
Default question about Repeatmasker

I have tried to use repeatmasker to analysis the repeat elements of soybean genome, howerver the pecentage of masked sequence is only 30%, it is much less than the 60% which was reported in published paper. Also, in the readme file for RepeatMakser, it is record that the repeat percentage for Hg18 is 35%, which it should be 50% as described in other papers. could anyone tell me why such difference happened?

Another strange thing is that when I run RepeatMask on chr10 or chr15 of Hg19(I ony test these two chroms of Hg19), the program is always stop at processing output stage (at cylce 9), but it works well on soybean chrom sequences and whole genome sequence.

thanks in advance.
flyyuan is offline   Reply With Quote
Old 05-25-2012, 07:01 AM   #2
mike.t
Member
 
Location: Spain

Join Date: Mar 2010
Posts: 36
Default

Quote:
Originally Posted by flyyuan View Post
I have tried to use repeatmasker to analysis the repeat elements of soybean genome, howerver the pecentage of masked sequence is only 30%, it is much less than the 60% which was reported in published paper. Also, in the readme file for RepeatMakser, it is record that the repeat percentage for Hg18 is 35%, which it should be 50% as described in other papers. could anyone tell me why such difference happened?
Are you using RepBase? They probably used their own library of repeats which may have more repeat families than are in RepBase.

Quote:
Originally Posted by flyyuan View Post
Another strange thing is that when I run RepeatMask on chr10 or chr15 of Hg19(I ony test these two chroms of Hg19), the program is always stop at processing output stage (at cylce 9), but it works well on soybean chrom sequences and whole genome sequence.
What to you mean by 'stop'? Does it crash and give you an error message?
mike.t is offline   Reply With Quote
Old 05-25-2012, 07:56 AM   #3
flyyuan
Junior Member
 
Location: China

Join Date: Nov 2010
Posts: 3
Default

Quote:
Originally Posted by mike.t View Post
Are you using RepBase? They probably used their own library of repeats which may have more repeat families than are in RepBase.
I used the lastest repeatmasker libaray downloaded from RepBase site. The repeat elemets for soybean were just added to latest version, it maybe not include all the repeat families. So I tried to test the program on Hg19 to ensure I used the program in a correct way.
Quote:
Originally Posted by mike.t View Post
What to you mean by 'stop'? Does it crash and give you an error message?
the program suspend at "cycle 9", the program did not exit and there is no any error message. the screen is showed as following:
-----------------------------------------------
cycle7.....................................................
cycle8......................................................
cycle9
------------------------------------------------
the program will be finishied if cycle10 is completed, but it does not go ahead. I checked the thread information and found that it comsumed about 10G memory, it looks strange.
flyyuan is offline   Reply With Quote
Old 05-27-2012, 11:51 PM   #4
gprakhar
Member
 
Location: India

Join Date: Aug 2010
Posts: 78
Default

Quote:
Originally Posted by flyyuan View Post
I have tried to use repeatmasker to analysis the repeat elements of soybean genome, howerver the pecentage of masked sequence is only 30%, it is much less than the 60% which was reported in published paper. Also, in the readme file for RepeatMakser, it is record that the repeat percentage for Hg18 is 35%, which it should be 50% as described in other papers. could anyone tell me why such difference happened?
The Soyabean people do have a custom Repeat Library for it. I got it from here, http://SoyTEdb.org
I use this command,
$RepeatMasker -s -nolow -norna -no_is -gff -lib Glycine_max_TE_lib.fa <sequence-file.fa>

Quote:
Originally Posted by flyyuan View Post
Another strange thing is that when I run RepeatMask on chr10 or chr15 of Hg19(I ony test these two chroms of Hg19), the program is always stop at processing output stage (at cylce 9), but it works well on soybean chrom sequences and whole genome sequence.
thanks in advance.
What exact command did you use to do this analysis??


--
pg

Last edited by gprakhar; 05-27-2012 at 11:54 PM.
gprakhar is offline   Reply With Quote
Old 05-28-2012, 07:16 AM   #5
rahularjun86
Member
 
Location: Frankfurt(M), Germany

Join Date: Jan 2011
Posts: 58
Default

Hi,
Have you tried RepeatScout (http://bix.ucsd.edu/repeatscout/). It uses repeatmasker in one of its 6 steps. I used it and got nice results.
Best wishes,
Rahul
__________________
Rahul Sharma,
Ph.D
Frankfurt am Main, Germany
rahularjun86 is offline   Reply With Quote
Old 05-29-2012, 12:52 PM   #6
Artem
Junior Member
 
Location: Vancouver, BC

Join Date: May 2012
Posts: 6
Default

Repeats in the genome are not rigorously defined. The common quote of repetitive elements making up 45% of the human genome is made using a relatively liberal definition of repeats (this is to account for the divergence in sequence) from the original human genome paper in 2001.

RepeatMasker uses a slightly modified RepBase library to do it's searching and from the annotation uses a cutoff I believe of 30% divergence, which is more on the conservative side. What you are getting then is an annotation of all elements that are within 30% divergence of the consensus sequence for that element.

The biological hypothesis on the other hand (and some more recent papers looking for deep homology), is that repetitive-derived elements make up somewhere on the order of 70% of bulk human DNA. The 70% - 45% difference is highly divergent and difficult to recognize or the common ancestor cannot be reconstructed. Further, as a repetitive sequence becomes older and older, the probability that it acquires endogenous function will increase so some very old elements (and even some young ones) become necessary for an organism to survive. Syncitin A in mice is the textbook example; an endogenous retroviral envelop gene was 'exapted' in early placenta formation to form the synctiotrophoblast and is necessary for survival.

So in answer to your question, if you want to increase the % of repeats masked you increase the divergence cutoff, but risk informative endogenous sites. The cut-offs are what 'works', and if you want to do a detailed study of something like gene-regulation sites then it may even be beneficial to include repeats as they can greatly effect gene expression.

I'm new to the repeats field so if I made any mistakes above please forgive me but that's the jist of it.

From the RepeatMasker Manual:

"The program can be run at three levels of speed or sensitivity. The only difference between these settings is the minimum match or word length in the initial (not quite) hashing step of the cross_match program (see the cross_match/phrap documentation). The "slow" setting will take about 3 times longer and will find and mask 0-5% more repetitive DNA sequences than the default setting. The "quick" settings miss 5-10% of the sequences masked by default, but will be 3 to 6 times faster. The alignments may extend more or be somewhat more accurate in the more sensitive settings as well.

At the sensitive settings RepeatMasker currently finds, on average, 47% of human genomic DNA to be derived from interspersed repeats. RepeatMasker is very sensitive in comparison with other programs, although comparison to some is skewed because of the use of much smaller databases."

Try using the slower settings.

Last edited by Artem; 05-29-2012 at 12:55 PM. Reason: Addendum
Artem is offline   Reply With Quote
Old 06-06-2012, 09:33 PM   #7
mege
Junior Member
 
Location: Marseille

Join Date: Jun 2012
Posts: 1
Default

I am repeating the second question of flyyuan, because I have the same problem and there is no answer to this so far.
I have used RepeatMasker for Drosophila, Caenorhabditis and Arabidopsis genomes without any problem. However, when I want to run the RepeatMasker on the human genome with the’ nolow’ option, it does not get through. There is no problem running it with the ‘noint’ option. When using ‘nolow’ option RepeatMasker deals with most of the jobs, but when Processing Repeats, it is blocked at Cycle 9 (no error message, but nothing happens for 2-3 hours). I tried to cut up the genome into smaller bits (I suspected memory issues). It works fine for Chr19 (59 Mb) or ChrY, but not for Chr18 (78 Mb). So I cut up all the chromosomes into 50Mb pieces and tried to run Repeat Masker on only one fragment. This did not help, RepeatMasker is still blocked at cylcle9. Do you have any idea of where the problem comes from and how to fix it? The command used is as follows
‘RepeatMasker –pa 4 –nolow –spec homo –dir temp input_file.fas ‘

Another question is related to the ‘nolow’ option again. I would like to know, if using this option, RepeatMasker skips the detection of low-complexity regions and simple sequences all together, or it does detect first these regions, then TEs, but it does not print out Low-complexity and simple sequence hits? In the first case I could expect false TE hits in low-complexity regions that are reported, while in the second it is more unlikely.
mege is offline   Reply With Quote
Old 06-06-2012, 09:52 PM   #8
mike.t
Member
 
Location: Spain

Join Date: Mar 2010
Posts: 36
Default

Have you tried contacting the developers directly?

it looks like the problem is in the Perl script ProcessRepeats. Setting $DEBUG = 1; in that file may provide you with some more helpful output.

This part of the script seems to be finding and removing hits that are nested within other hits. You might check your repeat library to ensure there are no duplicates or nested elements.

You might try censor instead of RepeatMasker:
http://www.girinst.org/downloads/software/censor/
mike.t is offline   Reply With Quote
Old 07-08-2012, 03:03 PM   #9
A.N.Other
Member
 
Location: London, UK

Join Date: Feb 2012
Posts: 25
Default

I too am having the same issue as above with RepeatMasker stopping at Cycle 9 after identifying the elements in the input sequence. In fact, googling 'repeatmasker cycle 9' brought me to this thread

I've had RepeatMasker quite happily do the entire mouse genome, but it won't manage any of the human chromosomes I've tried. 2-3 days (over a weekend), and the process is much as described above - ~10-12 Gb RAM, 1-2% proc usage - without progressing a single dot on the progress bar for cycle 9. This is using the latest downloads of RepeatMasker, TRF, RM-BLAST and the library from RepBase. The computer certainly isn't a limiting factor for my issues - 128Gb RAM, 64 CPU cores etc ...

Have either of you solved this issue, by any chance? I've set debug flags to 1 for the RepeatMasker perl script itself and the same for the ProcessRepeats script, but it doesn't give me anything to work with - nothing at all, in fact. I haven't tried to get a more useful output form the scripts yet, but will next week.

I've emailed the developers, but I understand they've probably got other things on their minds, so if we can sort it out ourselves, it's probably better
A.N.Other is offline   Reply With Quote
Old 07-09-2012, 03:37 AM   #10
A.N.Other
Member
 
Location: London, UK

Join Date: Feb 2012
Posts: 25
Default

Setting the debug flags for cycle 9 within the ProcessRepeats script doesn't shed any light on the issue for me. Execution stops after the following, with human ch22 for a fast run:

Code:
cycle 8 ..............................................................................
cycle 9 Cycle9: Considering: 1524 12.3 1.0 14.2 chr22                     0        0 30218631 + DELETE_ME#SINE/Alu        2      120      192   2 13072 b364s2i1       
Unhide Inserts:
 --vs-->1462 17.7 0.3 17.7 chr22                     0        0 20842511 + DELETE_ME#SINE/Alu        1      122      190   2 37457 b526s2i0       
 --vs-->1452 14.4 0.0 1.8 chr22                     0        0 11929977 + DELETE_ME#SINE/Alu      286      289       23   1 62701 b680s1i11
To avoid complication, settings were simply '-s -pa 16 -species "homo sapiens"'.
A.N.Other is offline   Reply With Quote
Old 09-11-2012, 11:40 AM   #11
rhubley
Member
 
Location: Washington

Join Date: Sep 2012
Posts: 10
Default

A user just pointed out this forum to us. We are sorry to hear that people have been having problems running RepeatMasker. Luckily with this user's help we tracked down the problem and have a fix for everyone. We will be releasing a new version of RepeatMasker soon, until then please download a new ProcessRepeats script from: http://www.repeatmasker.org/ProcessRepeats.gz , place it in your RepeatMasker directory and uncompress it with: "gunzip ProcessRepeats.gz". Please let us know if you have any further problems using the feedback page on our website: http://www.repeatmasker.org.
rhubley is offline   Reply With Quote
Old 01-20-2014, 01:20 AM   #12
amitbik
Member
 
Location: Italy

Join Date: May 2013
Posts: 50
Default Repeatmasker

i am trying repeatscout to find plant repeats. There are 6 steps to do it but in 1 step it uses repeatmasker with command

./RepeatMasker -s -lib your_repeats_filtered1.fasta yourgenome.fasta

but in the commandline it is Checking for E. coli insertion elements not any plant repeats. or
Do I have to make any custom library for this?

Thanks
amitbik is offline   Reply With Quote
Old 01-20-2014, 01:43 AM   #13
mike.t
Member
 
Location: Spain

Join Date: Mar 2010
Posts: 36
Default

The file your_repeats_filtered1.fasta IS your custom library. There is also a flag to enable/disable E coli insertion element checks but I usually disable it.
mike.t is offline   Reply With Quote
Old 01-20-2014, 02:33 AM   #14
amitbik
Member
 
Location: Italy

Join Date: May 2013
Posts: 50
Default Repeatmasker

Thanks mike.t

actually i read it here http://seqanswers.com/forums/archive...hp/t-5448.html

my your_repeats_filtered1.fasta file is my genome.fasta file which is processed by several steps.

so do i have to make a repeate file by merging all file into one then process with my genome.fasta file?

Thanks..
amitbik is offline   Reply With Quote
Old 04-10-2014, 03:04 PM   #15
lwhitmore
Member
 
Location: washington

Join Date: Aug 2013
Posts: 70
Default

Hey Everyone, I am new to RepeatMasker so this may be a basic question but does anyone know if this program can align sequences with IUPAC codes in them such as (R, K, and M). I cannot seem to find a definitive answer anywhere.

Thanks for your help
lwhitmore is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 03:41 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO