SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
problem with RepeatScout heiya Bioinformatics 3 10-02-2012 10:44 AM
RepeatMasker output sydghyyh14 Bioinformatics 3 09-11-2012 03:26 PM
RepeatMasker output flipwell Bioinformatics 5 08-17-2011 03:48 PM
RepeatMasker question slny Bioinformatics 0 06-08-2011 12:31 PM
RepeatScout anna_sh Bioinformatics 2 05-06-2011 07:38 AM

Reply
 
Thread Tools
Old 10-02-2012, 10:39 AM   #21
sunhh
Member
 
Location: Ithaca, NY

Join Date: Jun 2012
Posts: 18
Default

Quote:
Originally Posted by mike.t View Post
I haven't run RepeatScout in a while so I'm afraid I can't help you. You may want to try another de novo repeat finding program. Try piler or RepeatModeler. piler usually works pretty well on fungi, although I am using the REPET pipeline these days.
Hi mike.t,
I am using RepeatModeler, but it took 151 hours in a "Round-5" with sample size 81 Mb.
The program is still running (over a week), and I can not estimate when it will finish.
Is that a normal case? Although I am using soybean genome sizing ~ 973 Mb.

Thanks!
sunhh is offline   Reply With Quote
Old 10-02-2012, 05:07 PM   #22
DFJ111
Member
 
Location: Auckland

Join Date: Aug 2012
Posts: 20
Default

Quote:
Originally Posted by tnguyen View Post
Hi DFJ111 and mike.t,

I followed the suggestions from you both, the repeat library was successfully built.

When I ran the first filter, the results said:

14184 deleted. 14185 saved. 111 skipped for length.

but the output file (contigs65fullQC2.filtered.fa.gt1k.fa.repeatscout.filter1) was empty.
Code:
cat /group/aquaculture/mussels/sequencing/MUSSEL1/repeatscout/contigs65fullQC2.filtered.fa.gt1k.fa.repeatscout | ./filter-stage-1.prl > contigs65fullQC2.filtered.fa.gt1k.fa.repeatscout.filter1
Do you have any idea why?

Thanks,
TN
If the problem is occurring when using
Code:
filter-stage-1.prl
check that TRF and nseg are properly installed and on your PATH. I had the same problem but I can't actually remember how I solved it.. it's solvable though.

Last edited by DFJ111; 10-02-2012 at 05:11 PM.
DFJ111 is offline   Reply With Quote
Old 10-02-2012, 08:17 PM   #23
mike.t
Member
 
Location: Spain

Join Date: Mar 2010
Posts: 36
Default

Quote:
Originally Posted by sunhh View Post
Hi mike.t,
I am using RepeatModeler, but it took 151 hours in a "Round-5" with sample size 81 Mb.
The program is still running (over a week), and I can not estimate when it will finish.
Is that a normal case? Although I am using soybean genome sizing ~ 973 Mb.

Thanks!
I believe the "Round-5" is coming from one of the programs RepeatModeler uses - It could be RepeatRunner or RECON - I don't remember. In any case, if you're using only 81 Mb then this is not normal behavior.

There are repeats for soybean already in RepBase Just run RepeatMasker with them. If you suspect that there are new repeats in your genome that aren't in RepBase, then run RepeatModeler or some other program on the masked genome made by RepeatMasker. It will be a lot faster and possibly won't hang.
mike.t is offline   Reply With Quote
Old 10-03-2012, 05:31 AM   #24
sunhh
Member
 
Location: Ithaca, NY

Join Date: Jun 2012
Posts: 18
Default

Quote:
Originally Posted by mike.t View Post
I believe the "Round-5" is coming from one of the programs RepeatModeler uses - It could be RepeatRunner or RECON - I don't remember. In any case, if you're using only 81 Mb then this is not normal behavior.

There are repeats for soybean already in RepBase Just run RepeatMasker with them. If you suspect that there are new repeats in your genome that aren't in RepBase, then run RepeatModeler or some other program on the masked genome made by RepeatMasker. It will be a lot faster and possibly won't hang.
Thanks, Mike.t
Yes, I read the script and I think these "rmblastn" are called for RECON. Although my input genome is ~973 Mb, in this round the sample size is only 82500254 bp. I think the main problem should be caused by rmblastn, because most of time it only use 1 cpu instead of 20 I assigned by "-num_threads 20"!
I have to run de novo search, because I will have some other genome to deal with. However, I had run RepeatMasker on soybean genome, and I can only find 24.93 % of LTR. While in the genome paper, LTR elements covers 41.99%.
sunhh is offline   Reply With Quote
Old 10-03-2012, 06:01 AM   #25
sunhh
Member
 
Location: Ithaca, NY

Join Date: Jun 2012
Posts: 18
Default

I found another thread in SEQanswer, and someone else had a similar problem with me.
His blast+ aligning always drop to 1 thread no matter how many "-num_threads" he assigned.
Some one said it is because the query sequence are too short (only word matching step is multithreads), but in my case, a batch sequence in RepeatModeler (for RECON) is 40kb. It is still not large enough?
sunhh is offline   Reply With Quote
Old 11-27-2012, 07:32 PM   #26
Lyn Hsiong
Member
 
Location: China

Join Date: Sep 2011
Posts: 14
Default

Quote:
Originally Posted by sunhh View Post
Thanks, Mike.t
Yes, I read the script and I think these "rmblastn" are called for RECON. Although my input genome is ~973 Mb, in this round the sample size is only 82500254 bp. I think the main problem should be caused by rmblastn, because most of time it only use 1 cpu instead of 20 I assigned by "-num_threads 20"!
I have to run de novo search, because I will have some other genome to deal with. However, I had run RepeatMasker on soybean genome, and I can only find 24.93 % of LTR. While in the genome paper, LTR elements covers 41.99%.
Hi, my repeatmoderler run very slowly too, and the input genome is 300M. Maybe the abblast, by default, also used only 1 cpu, so I assigned 10 by "-num_threads 10" like you, however, the repeatmoderler contained no this option. Could you pls tell me how to set the parameter in repeatmoderler/abblast.
Thank you very much!
lyn
Lyn Hsiong is offline   Reply With Quote
Old 12-08-2012, 08:01 AM   #27
sunhh
Member
 
Location: Ithaca, NY

Join Date: Jun 2012
Posts: 18
Default

Quote:
Originally Posted by Lyn Hsiong View Post
Hi, my repeatmoderler run very slowly too, and the input genome is 300M. Maybe the abblast, by default, also used only 1 cpu, so I assigned 10 by "-num_threads 10" like you, however, the repeatmoderler contained no this option. Could you pls tell me how to set the parameter in repeatmoderler/abblast.
Thank you very much!
lyn
It helps little to modify the threads value for rmblast. But you can do it in the .pm file (you can fing that file by grep threads in .pm files). Just wait for less than two weeks, and you will get final result.
Good luck!
sunhh is offline   Reply With Quote
Old 12-11-2012, 12:38 AM   #28
Lyn Hsiong
Member
 
Location: China

Join Date: Sep 2011
Posts: 14
Default

Quote:
Originally Posted by sunhh View Post
It helps little to modify the threads value for rmblast. But you can do it in the .pm file (you can fing that file by grep threads in .pm files). Just wait for less than two weeks, and you will get final result.
Good luck!
thank you very much! but i don't know how to deal with the .pm file (i suppose you meant the file "RepModelConfig.pm"). the file only contains Pre-installed programs' paths (perl, recon, repeatmasker and so on), so where can i set the threads value? and could you pls tell me what the "grep threads" exactly mean? thank you!
Lyn Hsiong is offline   Reply With Quote
Old 01-22-2014, 09:45 PM   #29
amitbik
Member
 
Location: India

Join Date: May 2013
Posts: 50
Default Repeatmodeler error in building database

I have installed repeatmodeler. But when i am building database

./BuildDatabase -name test test.fa

it is showing error and the RepModelConfig.pm file is empty

RepModelConfig.pm did not return a true value at ./BuildDatabase line 146.
BEGIN failed--compilation aborted at ./BuildDatabase line 146.

Anyone can help me to findout the error..

Thanks..

Last edited by amitbik; 01-22-2014 at 10:18 PM.
amitbik is offline   Reply With Quote
Old 01-22-2014, 10:00 PM   #30
amitbik
Member
 
Location: India

Join Date: May 2013
Posts: 50
Default

Hi DFJ111,

I followed according to your steps and it is worked fine but in the .tbl file i am geting this output

file name: file.fa
sequences: 336145
total length: 330872632 bp (330872632 bp excl N/X-runs)
GC level: 39.43 %
bases masked: 199587278 bp ( 60.32 %)
==================================================
number of length percentage
elements* occupied of sequence
--------------------------------------------------
SINEs: 0 0 bp 0.00 %
ALUs 0 0 bp 0.00 %
MIRs 0 0 bp 0.00 %

LINEs: 0 0 bp 0.00 %
LINE1 0 0 bp 0.00 %
LINE2 0 0 bp 0.00 %
L3/CR1 0 0 bp 0.00 %

LTR elements: 0 0 bp 0.00 %
ERVL 0 0 bp 0.00 %
ERVL-MaLRs 0 0 bp 0.00 %
ERV_classI 0 0 bp 0.00 %
ERV_classII 0 0 bp 0.00 %

DNA elements: 0 0 bp 0.00 %
hAT-Charlie 0 0 bp 0.00 %
TcMar-Tigger 0 0 bp 0.00 %

Unclassified: 866174 216405375 bp 65.40 %

Total interspersed repeats:216405375 bp 65.40 %


Small RNA: 0 0 bp 0.00 %

Satellites: 0 0 bp 0.00 %
Simple repeats: 51195 2109015 bp 0.64 %
Low complexity: 0 0 bp 0.00 %
==================================================

* most repeats fragmented by insertions or deletions
have been counted as one element


The query species was assumed to be homo
RepeatMasker version open-4.0.3 , sensitive mode

run with rmblastn version 2.2.27+
The query was compared to unclassified sequences in ".../repeats_1.fa"
RepBase Update 20130422, RM database version 20130422

can you guide me why most of the output are showing 0.

Thanks in advance...
amitbik is offline   Reply With Quote
Old 02-05-2014, 12:34 AM   #31
amitbik
Member
 
Location: India

Join Date: May 2013
Posts: 50
Default

Plz help me guys.. give me some reply...
amitbik is offline   Reply With Quote
Old 02-05-2014, 03:16 AM   #32
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,794
Default

It may be a good idea to try a subset of your data (select a few large contigs and/or a known sequence with the right repeats) before you start running a large genome file through some of these tools. Depending of the size of data set the run times can increase logarithmically.
GenoMax is offline   Reply With Quote
Old 02-05-2014, 03:27 AM   #33
amitbik
Member
 
Location: India

Join Date: May 2013
Posts: 50
Default

Thank You.. GenoMax

I did that and i got the result. I have one more problem
I have installed repeatmodeler. But when i am building database it is showing error

./BuildDatabase -name test test.fa

RepModelConfig.pm did not return a true value at ./BuildDatabase line 146.
BEGIN failed--compilation aborted at ./BuildDatabase line 146.

Can you tell me why the error is coming?
amitbik is offline   Reply With Quote
Old 04-16-2014, 10:24 PM   #34
sunnyseq
Junior Member
 
Location: germany

Join Date: Apr 2014
Posts: 1
Default

Quote:
Originally Posted by tnguyen View Post
Hi Rahul,

How large was your genome? How much memory was needed for your run? I received this error message at the start of Step 2:

"Could not allocate space for sequence"
Please change the code in build_repeat_families.c

sequence = (char *) malloc( (2 * MAXLENGTH + 3 * PADLENGTH) * sizeof(char) );
if( NULL == sequence ) {
fprintf(stderr, "Could not allocate space for sequence\n");
exit(1);
}

to

sequence = (char *) malloc( (2 * (size_t)MAXLENGTH + 3 * (size_t)PADLENGTH) * sizeof(char) );
if( NULL == sequence ) {
fprintf(stderr, "Could not allocate space for sequence\n");
exit(1);
}

otherwise calculation of big numbers (files more than about 1 GB) are not correct and results in much much bigger memory allocations than neccessary. I had this situation previously under FreeBSD, Linux and Solaris. That change helped me to overcome this allocation error... Actually it is running under FreeBSD :-)

Cheers, sunnyseq
sunnyseq is offline   Reply With Quote
Old 07-27-2014, 03:23 AM   #35
solidether
Junior Member
 
Location: Austria

Join Date: Jul 2014
Posts: 2
Default

Hi guys, I still have the same problem that people in this list previously had.

I followed the suggestions above and here is my command for running the step 2 of the RepeatScout:

RepeatScout
-sequence genome.fasta
-output genome_repeat.fasta
-freq genome.freq
-l 14

I get this error : "Could not allocate space for sequence" .

I ran the test file and its running, so the installation is not a problem. Although I realized that the genome.fasta file in the test is only one concensus fasta sequence. However, my genome.fasta is an assembly containing multiple contigs but in fasta format. I should also add that I am giving a big time memory to the machine, so I doubt that its a problem.

Anybody has suggestion.

Thanks a lot, Solidether
solidether is offline   Reply With Quote
Old 09-18-2014, 04:52 AM   #36
mke
Junior Member
 
Location: The Netherlands

Join Date: Feb 2013
Posts: 1
Default

Quote:
Originally Posted by solidether View Post
Hi guys, I still have the same problem that people in this list previously had.

I followed the suggestions above and here is my command for running the step 2 of the RepeatScout:

RepeatScout
-sequence genome.fasta
-output genome_repeat.fasta
-freq genome.freq
-l 14

I get this error : "Could not allocate space for sequence" .

I ran the test file and its running, so the installation is not a problem. Although I realized that the genome.fasta file in the test is only one concensus fasta sequence. However, my genome.fasta is an assembly containing multiple contigs but in fasta format. I should also add that I am giving a big time memory to the machine, so I doubt that its a problem.

Anybody has suggestion.

Thanks a lot, Solidether
I have the same experience. It happens with genomes bigger than roughly 2 GB. The problem, I guess is with the allocation within RepeatScout itself. You can give it any RAM memory you want, but I think one of the variables is wrongly declared, so it cannot contain any more data. So I guess it's a bug.
mke is offline   Reply With Quote
Old 09-30-2014, 01:02 AM   #37
solidether
Junior Member
 
Location: Austria

Join Date: Jul 2014
Posts: 2
Default The error message ""Could not allocate space for sequence"

The error message ""Could not allocate space for sequence" :
The reason for this error is in the RepeatScout software itself.

In the source code file "build_repeat_families.c" there are two
steps where memory allocation is done with command:
malloc( (2 * MAXLENGTH + 3 * PADLENGTH) * sizeof(char) )

This command tries to allocate proper amount of memory, based on the size of your input file. However, for some reason the allocation fails when the input file size is more than 2 GB.

I don't know enough about programming with C to say, why there is
this limit of 2 GB. Anyhow, for testing purposes I created a modified RepeatScout version (RepeatScout_fixmem) where the memory
allocation is allways 5 GB. ( malloc( 5000000000 ) )

After these modifications I was able to run the repeatscout analysis.
solidether is offline   Reply With Quote
Old 09-30-2014, 09:20 AM   #38
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

It's probably because ((2 * MAXLENGTH + 3 * PADLENGTH) * sizeof(char) ) is a signed int. I suspect casting the terms as 64-bit integers would work.
Brian Bushnell is offline   Reply With Quote
Old 11-05-2014, 07:35 PM   #39
bryantd
Junior Member
 
Location: San Diego

Join Date: Nov 2014
Posts: 1
Default

Quote:
Originally Posted by solidether View Post
The error message ""Could not allocate space for sequence" :
The reason for this error is in the RepeatScout software itself.

In the source code file "build_repeat_families.c" there are two
steps where memory allocation is done with command:
malloc( (2 * MAXLENGTH + 3 * PADLENGTH) * sizeof(char) )

This command tries to allocate proper amount of memory, based on the size of your input file. However, for some reason the allocation fails when the input file size is more than 2 GB.

I don't know enough about programming with C to say, why there is
this limit of 2 GB. Anyhow, for testing purposes I created a modified RepeatScout version (RepeatScout_fixmem) where the memory
allocation is allways 5 GB. ( malloc( 5000000000 ) )

After these modifications I was able to run the repeatscout analysis.
I've changed three instances of this allocation, two in build_repeat_families.c and one in build_lmer_table. While I no longer see the allocation error, build_lmer_table finishes almost immediately, with:

Done allocating headptr
Done building headptr
There are 0 l-mers
Done sorting headptr
OOPS no good lmers

Any ideas?
bryantd is offline   Reply With Quote
Old 07-17-2018, 03:12 AM   #40
bioinfo441
Junior Member
 
Location: Malaysia

Join Date: Jul 2018
Posts: 3
Default

hello evryone i have an error when i write the second command of RepeatScout if anyone have an idea please share

Quote:
$ ./RepeatScout -sequence Ca_dromedarius_kacst.fna -output output_repeats -freq output -l 14

RepeatScout(9531,0x7fff9faf2380) malloc: *** mach_vm_map(size=18446744073479073792) failed (error code=3)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
Could not allocate space for sequence
bioinfo441 is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 10:50 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO