SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Bowtie, an ultrafast, memory-efficient, open source short read aligner Ben Langmead Bioinformatics 514 03-13-2020 04:57 AM
Introducing BBMap, a new short-read aligner for DNA and RNA Brian Bushnell Bioinformatics 24 07-07-2014 10:37 AM
Miso's open source joyce kang Bioinformatics 1 01-25-2012 07:25 AM
Targeted resequencing - open source stanford_genome_tech Genomic Resequencing 3 09-27-2011 04:27 PM
EKOPath 4 going open source dnusol Bioinformatics 0 06-15-2011 02:10 AM

Reply
 
Thread Tools
Old 11-13-2015, 05:49 PM   #281
Bioinform
Member
 
Location: US

Join Date: May 2013
Posts: 17
Default

Thanks Brian. Have a gr8 weekend.
Bioinform is offline   Reply With Quote
Old 11-13-2015, 05:54 PM   #282
Bioinform
Member
 
Location: US

Join Date: May 2013
Posts: 17
Default

Are you talking about some lines, my fastq file do have + symbol.

This is my fastq file

@DJB77P1:527:H7RDTADXX:1:1101:2830:2226 1:N:0:
AATTTCCGTCACCCTTTTAAGTCCCCCAGGCGGGGGGCTCGACGAGAGCGACAGACCTTGTGTGTAGAAGTTTCAAAATGCTTATGCATCAAGAGACAGTGCCCTGCCCGAAGATATTTACATTCGGTGTGCCTTGGGCGTATA
+
FFHHHHHJJJJJJJJJJJJICHIIJJIJJIJJJJDDBDDDDBDDBBDDDDDDDDDDDDDDCC??BCCDDDACDDEDDDDDDDEDDDDEECDDDDDDCDDCCDDDCBDBDD<@B<ACDEEDEDDEDED?BBDDDC@>CBBDBB@C
@DJB77P1:527:H7RDTADXX:1:1101:2830:2226 3:N:0:
AGCTCGCCTTACGATGCTCGCACACGAAGACCGCAGAATTAACCAATAACTCCCTGCCTCAAGCCTTGGGCGTATACGCCCAAGGCACACCGAATGTAAATATCTTCGGGCAGGGCACTGTCTCTTGATGCATAAGCATTTTGAAACT
+
=DFFFFHHHHHJJJJJIJJJJJJJJJJJGIJJJJJJJJHHHHHHFFEFFDDEEDEDDDDDDDDDDDDCCDDDDDDDEDDDDDDDDDDDDDDDDDDDCDEEEEDDEEEBDDDDDDDDDDDDDDDDDDDDCDDDDDDDCDDDDECDDDDD
@DJB77P1:527:H7RDTADXX:1:1101:4220:2204 1:N:0:
CGACGCCTGCATAAGGGCTTTCGCTGATTGCCGCGCCCAGGCCCAGGCTGAAATCGCCCGGTTCAGCGTGGGCGAACGGCAACGACAGGGCAGGCAAAGCAACAACGGTCAGTGCGGGAAGGAAAAAGTGTTTCATGACGGCGG
+
DDHHGHGJIJJJJJHIJJJJJJIIIJJJEIIIGIEIJHHEFDFDDDDDDDDDDDDDDDD@B5;<ACC9@BD@?BD<BDDDBDBDBBDBDB<?BBDDBDCDDDBDDDDD@@BD@ACDDDDDDBDDDDDD:::@CC@C@CDBBBBD
@DJB77P1:527:H7RDTADXX:1:1101:4220:2204 3:N:0:
AGCTTATACACGCCAGAAAGATCACTCAGAGAGCCGCCGTCATGAAACACTTTTTCCTTCCCGCAC
+
=DDDFFHHHGGIIIIJJJJJIJIIIIIJFIIIJJJIJIJIFGIJIJJGGIHGHGHGFFFEFFDDDD
@DJB77P1:527:H7RDTADXX:1:1101:4290:2210 1:N:0:
CTGTACTCGATCGGCAAGGATTCGGCCGTGATGCTGCACCTGGCGCGCAAGGCTTTCTTCCCCGGCAAACTGCCATTCCCTGTGATGCATGTCGATACCCGCTGGAAATTCCAGGAGATGTATCGCTTTCGCGACCAGATGGTC
+
FFHFDHHJIBGIGIGIJJJIJJIIJIGIDGBHIIIGEAHIGJGGAHFDDD?BDDDC>CACCDCDDDBBDDDDDDDCCADDDDCDEECDDEC@CAB?ABDABD<BBDCCDCDDDCDBDBCCC>@BBB@D0><9<>BDDD8@ACCC
@DJB77P1:527:H7RDTADXX:1:1101:4290:2210 3:N:0:
TCGACATGCATCACAGGGAATGGCAGTTTGCCGGGGAAGAAAGCCTTGCGCGCCAGGTGCAGCATCACGGCCGAATCCT
+
?BDA@BBCCDEDDDDDDDDDDDDDDDDDEDDDDDDDBBDDDDDDDDDDDDDDDDDDD:A>CCABDDDDDDDDDDDDDCC
@DJB77P1:527:H7RDTADXX:1:1101:6926:2244 1:N:0:
ATATCGTATTTCCAGAGTGCCACTGGCTGATAAAGAAAGAGGCCGGGCATATGCTGATTCTGGCTATTTACCTTGACGCTCGCAGCAGCGATATCGACGGATCAGGGTTTAAATCATCCTCGTATTTAACTTTGGGAAAACAAC
+
FFGHHHHJJJJJJJJJJFHIJJJJJIJJJIJJJIIIJJGIIJIJIJJJIIJJJJIHHHHGHHFFFFFEEEEEEEDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDEDDDDDDDD<BDDEDEDDEDDCDDDDDDDDD
@DJB77P1:527:H7RDTADXX:1:1101:6926:2244 3:N:0:
GGAGCTACCTTCCTGAACCAGGCGCATCACGTGTGTTTTTGACACCGATAGTTGTTTTCCCAAAGTTAAATACGAGGATGATTTAAACCCTGATCCGTCGATATCGCTGCTGCGAGCGTCAAGGTAAATAGCCAGAATCAGCATATGCCC
+
CCCFFFFFGHHHHJJJJJJIGIJJJJIIJJJHHGGHIJJJJJJJIIJGIIFHHHHFFFFFDEEEDEDDEDDECDDDDDDDDEEEDEDDDDDDDDDDDBDDDDBDDDDDDDDDDDDDDDBDDDDD>>CC@DACDDDDDDDDDDDCCDECDD
@DJB77P1:527:H7RDTADXX:1:1101:9747:2206 1:N:0:
TGCTCGTCAGGTACTCGCTGAGGTTGTACAACTGCGCGATGTCGTTGCCATCGGCCATCAACACGCTACTGCGTACGTAGCCGCCGTTGAGCCACTCAAAGCTGTCGTTACCGAAGCTTGCGCGTATTTCTCCACGAACTTCGC
+
FFHHHHHJJIJIHIJJJJJJJJJJJJJJIJJJJJGJJJJJJIJIJHEHHHFFFDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDBDDDDDDBDDDBDDDEEEDDDDDDDDDDD?B
@DJB77P1:527:H7RDTADXX:1:1101:9747:2206 3:N:0:
AACGTCTTCGAGCTTTCTGGCGGCAGCATTATCAACAATCTGGTCGCCGGGTTTGGCCGGGACACGATTACAGTGTCGG
+
8@<@<ACBDDDDBDDDDDCDCDDBDDDDCDCCCDDCCDC?ACC@CBDDDDDBBD?B>CDDD>BDBD3<?<@AC>>@@BB
@DJB77P1:527:H7RDTADXX:1:1101:12008:2199 1:N:0:
CACTGCGGCCGCCTGCGATGAAGTTCTGGCCGAGAATGTTAGGCATCGGTATCTCCTGTTGTTTGATGTTGGAGGCGGGCGCCGAAGGGATCCGGACCGTGCGCTCCTTATGGGTTGCCGAGTTGCTTTGCGGGGTGGCACTGG
+
FFHHHHHJJJJJJJJJJHJJJJJJJJJJJJJJHJIHHHHHHFFFFFEDDDDDDFDDDDDDDDDDDCDDDDDDCDDDDCDBBDDB9B9BDB<<C<B>@@.5?B>@B>@CC>C>CD3<9<A9<B@BCA>ACAC<>BD9BD@A@<C@
@DJB77P1:527:H7RDTADXX:1:1101:12008:2199 3:N:0:
AGCTTAAACAGCAGTGGCGACATAAAGGCTAAACGCTGTCTGGCAGTTGCGTCAGCAGCGAATCAA
+
=DDFFFHHHHHJIIIJJJJJJJJJJJJJGIIJJJJIIJGIGIIIEIIIIJJJHFHFFFFDDDDDDD
@DJB77P1:527:H7RDTADXX:1:1101:14057:2199 1:N:0:
AGCCGGAAATAAGCAACGTCCGATCAATCCTGTCACAGGGTTTGGTGGCAGTGATCGCGCTGGCTCTCGATGGTGAAGGTTCAATGAGGCATTTGAAGATCAAGAGCCTTTCCCTTACCGGACATGACGGACCCGGTATACGCG
+
FFHHHHHJIJJJJJJJJFHIIJIJJJJJJJJJJJJJJJJJHIJJJIHHHHFFFFFEDDDDDDDDDDDDDDDEDACDDDDCCDDEEDDDDDDDDDEEDDDDDDDDDDDDDDDDDDDDDDDDDDBDDDDDDDDDDDDDBDDEEDDD
@DJB77P1:527:H7RDTADXX:1:1101:14057:2199 3:N:0:
AGCTATTGATATTCCAAATCTTGGGTAAGCATCACGATCTGATTGGATCGGATCTTCTCGCGTATA
+
=DDFFFHHHHHJJJJJJJJJJJJJJHHIJJJJJ@GIJIJJJIJJJJIJHIJJJJIJJJIIJJHHFF
@DJB77P1:527:H7RDTADXX:1:1101:14176:2213 1:N:0:
CCTTGGCTTCGGCGTGCGCCGAGAAGGCGGGGATCATCAAGGCCAGAAAAATGGCATGAAATTTCATCGGTGCTCCTGCGGATGAATATGACTTCATTGGATCCAAAATTTACTCCAGTCGCGTGCCTCTGACTAGCGGGTGCC
+
FFHHHHHJJJJJJJIJJJJJJJJJJJJJJHDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDEEEDDBBDDDDDDDDDBDDDDEEEEEDDDDDEEDDDDDDDDDDDDDDEDDDDDDDDDDDDDDDDDDDDDCCDDDDBBBD
@DJB77P1:527:H7RDTADXX:1:1101:14176:2213 3:N:0:
AGCTAGAACAACAGCGATTGCCGGAATCAACCCGCGCCAGGCAAGGTCAGCCAACATGACAACCCGACGAAAGGCCACTGGCACCCGCTAGTCAGAGGCACGCGACTGGAGTAAATTTTGGATCCAATGAAGTCATATTCATCCGCAG
+
=DDFFFHHHHHJJIJIJJJJJJJJJJJJJJJJJJJJJJJJHHHHEF@DFEEEEEDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDCDEDDDDDBDDDDDDDDDDD>CCDDEDDDDDDDDDDDDDDDCDEEEEEEEEEDDDDC
@DJB77P1:527:H7RDTADXX:1:1101:15053:2181 1:N:0:
CAGCTCAGGAGGATTTTCCCGCGTGTCGGCCACAGCAAAGGACACGCCCCGGTTCGCCAGGAAGCGAACCAGGGACATGCCACTCTTGCCAAGCCCCACAACAATGCGGAAGCGGTCGGAAACGATCAGGGACACTCGTTCTAC
+
FFHHHHHJJJJJJJJJJJJJIJJJJJJJJJJJJJJJJJJIJJJHHHFFDDBD?BDDDDDDDDDDDDDDDDDBDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDCDDDDDDDDBDDDBDDDDDDDDDDDDDDDCDDDDDDED
@DJB77P1:527:H7RDTADXX:1:1101:15053:2181 3:N:0:
CAGTGGTCATTCAGGTTGCGTCTTTCAA
+
<@@?@?@?@???????????????????
@DJB77P1:527:H7RDTADXX:1:1101:15413:2202 1:N:0:
GTGCAATTGCCGACCTGCTGGTGGTGGACGGCGACCCGCTGAGCGATATCAGCTGCCTGGTGGGCCAGGGCGAGCAGTTGGCGATGATTGTTCAGGGTGGTCATGTGCACAAGAACACCCTGGCCTGATCAGCCGTCAGTCGGC
+
FDHGHHHIIIJJIJJIJJIJGFHIDHIHIIJJIJJHFEDDD@DDDDDDDDEEDDDDCDDD@CDDDDDDDDBDDDDDDCCDDDBBDDDEEDDDDEDDDCDD>CCDDCDDEDDDDDDDDDDDDDDDDDDDDDEDDDDBBBDDDBBB
@DJB77P1:527:H7RDTADXX:1:1101:15413:2202 3:N:0:
CAGGCCAGGGTGTTCTTGTGCACATGACCACCCTGAACAATCATCGCCAACTGCTCGCCCTGGCCCACCAGGCAGCTGA
+
?BDDDDDDDD<?@DDEDDDDDEDDDDDDDDBDDDCDDDCCDCDDDDBDDDBDCDDDBBBDBBDDDDDDDDBBDDDDDCC
@DJB77P1:527:H7RDTADXX:1:1101:15955:2211 1:N:0:
GTGGCTTCCGTTCCGCCTCAAGGTGGTCGCAAGCATCCGCAGCAGGAATTCCTGCAGGTGGACACGCGCAACATCCTCTTCATATGCGGTGGCGCCTTCTCGGGCTTGGAAAAGGTCATTCAGAACCGTTCCACCCGTGGCGGC
+
FFHHGHHIGIIIJHIIJIJJJIICGGHIIIJIIIIJJJJJJIGIFHGHFFEFFEEEEEEDDDDDDDDDDDDDBDCDDDDDDEEEEEDDDBDBDDB>BDDDDDDDDDDDDDDCCDD@CDDEEECDDDDDBBDDDDDDD<@BDBDD
@DJB77P1:527:H7RDTADXX:1:1101:15955:2211 3:N:0:
AGCTCTCCTTGCTGCGAACTTCGGCGTTGAAACCGATGCCGCCACGGGTGGAACGGTTCTGAATGACCTTTTCCAAGCCCGAGAAGGCGCCACCGCATATGAAGAGGATGTTGCGCGTGTCCACCTGCAGGAATTCCTGCTGCGGATG
+
=DDFFEHFHHHJJJJJGHIIJIIIJJJJJIIJIJJIGIJJJHFFFDDD9?B?CCBD?BBCDDDDDDDDDDDDDCEDDDDDDDDDDDDDDDDDDDDDDDDACDEDABDBDDDDDCBDDDDDDDDDDDDDDDDBDDDDDDDDDDDDDDDD
@DJB77P1:527:H7RDTADXX:1:1101:16246:2219 1:N:0:
CTATATAGACCTGCGCTCTTGGCCAAGTCGATGCCTTGCCGCTGTCGCTCTCGCCGGTCCTCGAAGTCGTCACGCGCTATTTGCAGTGCCACACGCAACAGCATGTCTTGTACGCCTTGCAACACCACCTTGGCGACGCCACTG
+
FFHHHHHJJJJJJJJJJJJJIIIJJJJJJJJJJJJJJJJJJJJJJJJJJJJHHHFDDDDDDDDDDDDDDDDDDDDDDDDDDEDDEDDDDDDDDDDDDDDDDDDDDDDEDDDDEDDDDDDDDDDDDDDDDDDCDCDDDDDDDDCC
@DJB77P1:527:H7RDTADXX:1:1101:16246:2219 3:N:0:
GGAGCTGTGAAGTCGTCATCGCTGAGAAGATCGACCGAATCAGCCGTCTTCCCTTGGTCGAAGCAGAAAGGCTTGTGGACGCGATCAAGGCTAAGGGCGCACGCTTGGCGGTGCCAGGCATCGTCGATTTATCGGAATTGGCCGAGGCAT
+
CC@FFFFFHHHHHIJJJJJIJJJJJJJJJJJJJJJJJJJJIJJJJJJJJHGHHHFFFFFDDDDDDDDDDDDDDDDBDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDBDDDBDDDDDDDDDCDBDD>BDDCDCBD?B?CC@CDD<BBDBC
@DJB77P1:527:H7RDTADXX:1:1101:18918:2216 1:N:0:
GGGCCGTCCGGGGCAGGAAAAGAAGTCATCGCCCAGCGCTTGCATCGGCTTGGGGTGAACCCGCTCCATCCCTTTGTGGATATCAACTGCGCTGCATTGCCAGCACACCTCATTGAAGCAGAGCTATTCGGTCATAGCCGAGGT
+
FFHHHHDHHHIJJJJJJIIJJJJJJGIJJJJJJJHHHFFCDDCCDDDDDDDDDDD)5?C@BD5955;::>AB:ACC@?B:CDDCDCC:>C<><<>>CDACC>@?9???BCA:>@:4@>AC9?8AACCC>9A8<5>B@CC959@9
Bioinform is offline   Reply With Quote
Old 11-13-2015, 05:59 PM   #283
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,079
Default

That offending sequence may be somewhere around #32. Look further down in the file. Otherwise use BBDuk to retrim your data as suggested by Brian to ensure there are no zero length reads.
GenoMax is online now   Reply With Quote
Old 12-08-2015, 01:33 PM   #284
babine
Junior Member
 
Location: Gatersleben

Join Date: Dec 2011
Posts: 7
Default Guidance for mapping with BBMap?

Dear Brian, dear users,

I have reads from an exon based sequence capture and I'm trying to map them to not-so-accurate reference sequences. I started to use BBMap because I hoped it would have a feature that I'm looking for.

My problem is that I'm getting misaligned reads, especially in the end of the reads, when two "stacks" of reads are coming in contact although they should not be but my reference sequence artificially bring them close together. While I would gladly inspect the alignments for those regions and introduce gaps in the reference, I can not do it for more than a few loci. So I was wondering if there is not a more clever way by tweaking the parameters.

I tried a bit with minid, k, padding... but without much success.

Do you have any suggestions on what parameter would have the most impact? Any particular strategy?
Or let me know if you think another algorithm would do it...

Any comment is appreciated...

Cheers
babine is offline   Reply With Quote
Old 12-08-2015, 02:02 PM   #285
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Would it be possible to post an image from IGV, or some kind of text diagram? I can't quite visualize what you're describing.
Brian Bushnell is offline   Reply With Quote
Old 12-09-2015, 02:07 AM   #286
babine
Junior Member
 
Location: Gatersleben

Join Date: Dec 2011
Posts: 7
Default

Sure Brian, here it is! A screenshot from Geneious on an alignment obtained with BBMap as

Executing align2.BBMap [in1=177291_Brachy_300_Brachypodium_distachyon_2x_R1.fastq, in2=177291_Brachy_300_Brachypodium_distachyon_2x_R2.fastq, out=Bdi_1H_AK250130_mapped_bis.sam,
ref=.\resources\Brachypodium_distachyon_2x_AH_AK250130.fasta, K=9, build=3, maxindel=30, pairlen=600, minid=0.9, padding=20, overwrite=t]

"A picture is worth a thousand words"

Best
Attached Images
File Type: png mapping_problem.PNG (62.8 KB, 17 views)
babine is offline   Reply With Quote
Old 12-09-2015, 06:52 AM   #287
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,079
Default

@babine: What do you mean by "not-so-accurate" reference? Is the reference as depicted in the picture wrong (i.e. there are two ends that are joined that should not be together)? Note: I don't use geneious but I assume one of the lines at the top represents the reference (and the other a consensus?)
GenoMax is online now   Reply With Quote
Old 12-09-2015, 06:55 AM   #288
babine
Junior Member
 
Location: Gatersleben

Join Date: Dec 2011
Posts: 7
Default

@Genomax: yes that's exactly what I assume from the read profile. And yes the reference is the sequence with yellow background just below the consensus
babine is offline   Reply With Quote
Old 12-09-2015, 07:09 AM   #289
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,079
Default

I wonder if you should manually put a break (stretch on N's in that position) to force the two ends apart. While that may resolve this "end" of the reads issue not sure if that is how things are in the real genome.

Are you certain your reads (R1/R2) do not overlap? If they could then you could merge them first before alignment.

Are the reads in the image corresponding pairs (R1/R2)?
GenoMax is online now   Reply With Quote
Old 12-09-2015, 10:58 AM   #290
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

I wonder if maybe a polishing program like Quiver could automatically fix the reference for you. That would be the ideal solution. The problem, of course, is that there are misassemblies or major structural variations with respect to the reference. Hmmm...

I think there are 3 main options.

1) Map with the "local" flag. This will soft-clip the part of the read that extends across the misassembly, so it will not result in spurious variations being called. That's certainly the easiest approach.
2) Try polishing the assembly with something like Quiver, to automatically fix these things.
3) Make a new assembly, if this genome differs substantially enough from the reference that this issue is common.
Brian Bushnell is offline   Reply With Quote
Old 12-09-2015, 11:56 AM   #291
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,079
Default

@babine: Are you only "borrowing" the reference i.e. you do not have raw/original data for the "not-so-accurate" reference?
GenoMax is online now   Reply With Quote
Old 12-18-2015, 05:17 PM   #292
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Hi everyone,

I wanted to mention that there is now some additional documentation for BBTools in the form of guides/tutorials, in the /docs/ folder. Currently there are guides for BBDuk, BBMerge, Seal, Tadpole, Reformat, Dedupe, BBNorm, and Taxonomy, and I plan to add more in the near future. There's also an overview of the general usage of all BBTools (UsageGuide.txt) and a list of all the commonly-used tools with a brief description (ToolDescriptions.txt).

I hope these will be useful, and please let me know if anything is unclear or needs to be expanded, or there is a common use that the guides don't address.

-Brian
Brian Bushnell is offline   Reply With Quote
Old 12-18-2015, 06:35 PM   #293
JulesWinchester
Junior Member
 
Location: Houston

Join Date: Dec 2015
Posts: 2
Default

Hello, I am working with genomic data belonging to mammals. I have been aligning raw reads to all the genes within an organism's chromosomes. While working with this data set, I have noticed some odd values when calculating coverage with BBTool's built-in pileup feature. I would really appreciate it if anyone knew how to interpret this.

Here is the issue:

Data being used:
-CDS (W/ Introns, not spliced) of genes extracted from a Chromosome Genbank file (found @ NCBI ftp://ftp.ncbi.nlm.nih.gov/genomes/P...odytes/CHR_01/).
-Raw reads collected from ENA (Adapter free WGS sequences)

BBMap Commandline used to calculate coverage:
bbmap.sh in1=Chimp_1.fastq in2=Chimp_2.fastq ref=ChimpChrom1.fa local=t nodisk covstats=Chrom1Stats.txt covhist=Chrom1Hist.txt

Results for the chimp. The results for all other mammals I've worked with are very similar.
Coverage - 43.97
Standard Deviation - 507.47

I proceeded to look at the statistical output produced by BBMap to see if the SD values were caused by specific gene sequences; there I saw that a lot of genes had median folds equal to 0, others that had surpassed 1,000, and even some with negative values (-1). Why does this happen, is this normal? If not, is there a way to fix this?


I have attached the stats file produced by BBMap on this post if anyone would like to see how the output looks like. I took the liberty of parsing the file based on median fold values (equal to or more than 100), only keeping the columns belonging to gene ID and Median Fold. On a follow-up post, I will add attach a text file containing the low median fold values (anything below 100,).
Note: The coverage and SD tendency remains, even when using a reference fasta containing only the longest isoform per gene (1 isoform per gene).
Attached Files
File Type: txt HighCovID.txt (9.3 KB, 3 views)

Last edited by JulesWinchester; 12-19-2015 at 02:46 AM.
JulesWinchester is offline   Reply With Quote
Old 12-18-2015, 06:49 PM   #294
JulesWinchester
Junior Member
 
Location: Houston

Join Date: Dec 2015
Posts: 2
Default

Here are the files I promised. I also attached the whole statistical output, in case anyone wants to see.

Thanks in advance!
Attached Files
File Type: gz LowCovID.txt.gz (16.0 KB, 3 views)
File Type: gz Stats.txt.gz (121.7 KB, 3 views)
JulesWinchester is offline   Reply With Quote
Old 01-13-2016, 07:02 PM   #295
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,079
Default

@Brian: I am having trouble using dedupe.sh with paired end reads (using BBMap v.35.59).

Code:
Unknown parameter out1=sampleID_L001_R1_001.fastq.gz

     at jgi.Dedupe.<init>(Dedupe.java:383)

     at jgi.Dedupe.main(Dedupe.java:80)
command I am using is in the format you had posted

Code:
$ dedupe.sh in1=read1.fq in2=read2.fq out1=x1.fq out2=x2.fq ac=f
Single-end files work fine.
GenoMax is online now   Reply With Quote
Old 01-13-2016, 09:45 PM   #296
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Thanks for finding that! Looks like a typo crept in; I'll fix that in the next release. In the mean time, you can use "out=" instead of "out1=".
Brian Bushnell is offline   Reply With Quote
Old 01-13-2016, 09:50 PM   #297
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by JulesWinchester View Post
Hello, I am working with genomic data belonging to mammals. I have been aligning raw reads to all the genes within an organism's chromosomes. While working with this data set, I have noticed some odd values when calculating coverage with BBTool's built-in pileup feature. I would really appreciate it if anyone knew how to interpret this.
I don't see anything odd in general - high variances are not really unusual. But the -1 values are definitely problematic and should not happen. I'll investigate and get back to you.
Brian Bushnell is offline   Reply With Quote
Old 01-14-2016, 04:38 AM   #298
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,079
Default

Quote:
Originally Posted by Brian Bushnell View Post
Thanks for finding that! Looks like a typo crept in; I'll fix that in the next release. In the mean time, you can use "out=" instead of "out1=".
For now I only want an estimation of duplicates for some HS4000 paired end data so I will just omit "out=". I assume a single "out=" will result in an interleaved file (for PE input) that would have to be resolved afterwards?
GenoMax is online now   Reply With Quote
Old 01-14-2016, 07:08 AM   #299
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by GenoMax View Post
For now I only want an estimation of duplicates for some HS4000 paired end data so I will just omit "out=". I assume a single "out=" will result in an interleaved file (for PE input) that would have to be resolved afterwards?
"out" is synonymous with "out1". So if the input is paired, "out=x.fq" or "out1=x.fq" will produce interleaved output; "out=x1.fq out2=x2.fq" or "out1=x1.fq out2=x2.fq" would both produce dual-file output (once Dedupe parses "out1" correctly).
Brian Bushnell is offline   Reply With Quote
Old 01-28-2016, 12:58 AM   #300
Shini Sunagawa
Junior Member
 
Location: Germany

Join Date: Jan 2016
Posts: 8
Default dedupe.sh

Dear Brian,

I have been looking for a tool that would quickly dereplicate (100% containments) nucleotide sequences and track for each unique sequence the identifiers of the removed duplicates.

Something like:

dedupe.sh in=in.fa out=out.fa outd=outd.fa mid=100 mop=100

where:

in.fa:
seq1
seq2 (contained in seq1)
seq3 (contained in seq1)
seq4

out.fa:
seq1
seq4

outd.fa:
seq2
seq3

I am interested in:
seq1<tab>seq2,seq3
seq4

dedupe.sh does a fantastic job in returning out and outd, but I cannot find any option that would return the information I am interested in. Is this something that I am missing? Otherwise, I believe this could be a great feature, since compared to other tools that return this information, dedupe is so much faster.

Best,
Shini
Shini Sunagawa is offline   Reply With Quote
Reply

Tags
bbmap, metagenomics, rna-seq aligners, short read alignment

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 08:41 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO