Seqanswers Leaderboard Ad

**Brian Bushnell** · 04-30-2014, 08:29 AM

How long are your reads?

**jpummil** · 04-30-2014, 09:07 AM

As 454 data is written in binary .sff format, how does one tell how long the reads are? Standard tools for ascii text such as grep, awk, wc, etc can't be utilized...

**MurielGB** · 04-30-2014, 09:10 AM

My data are in fast format.
It seems that length are from 60 to 900 bp...

I haven't trim on quality and length, maybe I should ? Which tool to use ? I used Sickle for another project with Illumina data but I heard it's not the best for 454, is it true ?

**Brian Bushnell** · 04-30-2014, 09:18 AM

Originally posted by MurielGB View Post

My data are in fast format.
It seems that length are from 60 to 900 bp...

You could run this with BBMap, which is more error-tolerant than Tophat. For reads over 500bp, though, the command line would be a little different, as it needs to run in PacBio mode:

(index)
mapPacBio8k.sh ref=reference.fasta -Xmx23g

(map)
mapPacBio8k.sh in=reads.fq out=mapped.sam xstag=unstranded maxindel=200000 qin=33 intronlen=10 -Xmx23g

The "-Xmx23g" sets the amount of memory to use; it should be set to around 85% of physical memory.

I haven't trim on quality and length, maybe I should ? Which tool to use ? I used Sickle for another project with Illumina data but I heard it's not the best for 454, is it true ?

Don't trim unless you have problems with the mapping rates. But I'm not sure how a trimming tool designed for Illumina data would perform on 454 data, anyway.

**GenoMax** · 04-30-2014, 09:19 AM

Give BBMap a try: http://seqanswers.com/forums/showthread.php?t=41057

What state is your reference sequence in (multiple contigs)?

Edit: Brian beat me to it

Brian: The reads here are in fasta format. I assume BBMap will accept them.

**MurielGB** · 04-30-2014, 09:20 AM

OK thank you, I didn't know this software, I'll have a look on it !

I wanted to use Tophat since I also have Illumina RNA-seq and for these ones I used Tophat and I thought, it would be cool to use the same...

**Brian Bushnell** · 04-30-2014, 09:22 AM

Originally posted by GenoMax View Post

Edit: Brian beat me to it

Brian: The reads here are in fasta format. I assume BBMap will accept them.

Yes, it will.

**MurielGB** · 04-30-2014, 09:22 AM

Sorry, I meant FastQ format...

**Brian Bushnell** · 04-30-2014, 09:24 AM

Originally posted by MurielGB View Post

Sorry, I meant FastQ format...

No problem, it supports input of fasta, fastq, scarf, and even sam.

**GenoMax** · 04-30-2014, 09:24 AM

Originally posted by MurielGB View Post

Sorry, I meant FastQ format...

In that case Brian's command line is valid for use.

**MurielGB** · 04-30-2014, 09:26 AM

What is the output of BBMap ?
I want then to use Cufflink, is it ok ?

**GenoMax** · 04-30-2014, 09:28 AM

Originally posted by MurielGB View Post

What is the output of BBMap ?
I want then to use Cufflink, is it ok ?

In the example command line above a SAM file.

**MurielGB** · 05-01-2014, 02:03 AM

If I understand well, I need to split my fastq files because for the short reads I will use bbmap.sh while for the longest ones I need to use mapPacBio8k.sh ?
I checked and I have reads from 45 to 1201 bp.

If this is the case, what length should be the max for the short read file used for bbmap.sh ?
And which length is the minimum for mapPacBio8k.sh ?

Do you know a tool to do that ?

Since this is RNA seq data, I guess I should index the reference that way :
bbmap.sh ref=ref.fasta k=14

But how to choose midpad ? My genome is 460 Mb long and there are 8,000 scaffolds.
I guess I could calculate the length of the longest gap in my reference and use a higher value ?

My longest expected read is 1201 bp so should I leave the default value (8,000) for startpad and stoppad ? Or maybe I can put 1,300 ???
Can you explain how this is gonna affect the rest of the analysis ?

Finally, if I use bbmap.sh for short reads and mapPacBio8k.sh for long reads, should I build two different index ? One using bbmap.sh and the other using mapPacBio8k.sh ?

Thank you,

Muriel

**Brian Bushnell** · 05-01-2014, 08:44 AM

Originally posted by MurielGB View Post

If I understand well, I need to split my fastq files because for the short reads I will use bbmap.sh while for the longest ones I need to use mapPacBio8k.sh ?
I checked and I have reads from 45 to 1201 bp.

If this is the case, what length should be the max for the short read file used for bbmap.sh ?
And which length is the minimum for mapPacBio8k.sh ?

Do you know a tool to do that ?

Since this is RNA seq data, I guess I should index the reference that way :
bbmap.sh ref=ref.fasta k=14

But how to choose midpad ? My genome is 460 Mb long and there are 8,000 scaffolds.
I guess I could calculate the length of the longest gap in my reference and use a higher value ?

My longest expected read is 1201 bp so should I leave the default value (8,000) for startpad and stoppad ? Or maybe I can put 1,300 ???
Can you explain how this is gonna affect the rest of the analysis ?

Finally, if I use bbmap.sh for short reads and mapPacBio8k.sh for long reads, should I build two different index ? One using bbmap.sh and the other using mapPacBio8k.sh ?

Thank you,

Muriel

Muriel,

Just use mapPacBio8k.sh for all reads, it will work fine. It has a default midpad of 6000 which will also be fine for those reads. The value of midpad is not very important, as long as it is at least as long as your longest read; it's just there to prevent reads from mapping spanning two adjacent scaffolds.

Since you brought it up, it is worth noting that bbmap has default k=13 while the pacbio version has default k=12 because of pacbio's high error rate. And yes, a longer k can help with rnaseq data that has long introns, so it's probably worth using k=13 or k=14. To set k=14, you need to include that flag both while indexing and while mapping.

Topics	Statistics	Last Post
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Yesterday, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin Yesterday, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 54 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM

Seqanswers Leaderboard Ad

Announcement

Map 454 RNA-seq single-end reads on a genome: UPDATE

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News