SEQanswers

Go Back   SEQanswers > Applications Forums > De novo discovery



Similar Threads
Thread Thread Starter Forum Replies Last Post
RNA-Seq and mouse reference genome ChristmasSunflower Bioinformatics 3 06-25-2014 11:23 PM
custom reference genome for RNA-seq enelkinsan Bioinformatics 2 01-05-2013 02:45 AM
RNA-seq with No Reference Genome taylormjeffery Bioinformatics 1 06-06-2012 08:32 AM
the downstream analysis of RNA-seq Xi Wang RNA Sequencing 18 04-15-2011 07:43 AM
RNA-seq assembly and reference genome lfaino Bioinformatics 3 04-13-2011 07:05 AM

Reply
 
Thread Tools
Old 08-21-2013, 07:06 AM   #1
nareshvasani
Member
 
Location: NC

Join Date: Apr 2013
Posts: 57
Smile Downstream RNA-seq analysis without reference genome

Hello all Seqanswer community Users,

I am biologist. Learning bioinformatics from scratch. Performing RNA-seq analysis for first time:

I got fastq files from ion proton instrument, it has single end read, 50-340 sequence length.
I don't have reference genome.
Here is how I did my analysis:
1]Fastqc
2]Trimmmimg some read using fastx tool:
a]First used fastq_qulaity trimmer -Q33 -t 20 -l 50 -i -o, because some of sequence has quality less than 18
b]fastx_trimmer = to trim off few reads from end of seq

For de novo assembly:
3]Velveth with kmer 31
4]velvetg
5] bowtie-build== to build reference index from contig.fa file created from velvetg
6] Mapping my fastq read with above build reference index

So my questions are:
1] Am i doing trimimg in right manner
2] On what basis you select parameter of velvetg and velveth
2] Which kmer value to select
3] Am I running bowtie in correct manner?
4] If yes, how do i confirm assembly created using velvet contain preserved input information and it's accuracy.

Hope you all can help me out.
Thanks a bunch in advance.

Naresh
nareshvasani is offline   Reply With Quote
Old 08-21-2013, 10:49 AM   #2
westerman
Rick Westerman
 
Location: Purdue University, Indiana, USA

Join Date: Jun 2008
Posts: 1,104
Default

I would use a tool designed to put RNAseq reads together. It has a been a while since I used Velvet but as far as I know it is designed to assemble genomes not transcripts. My favorite RNAseq tool is 'Trinity'.

The above assumes that your reads are from transcripts and not from the entire genome.

I know that the above advise does not answer your specific questions. However in case you did start down a poor path then I wanted to concentrate on correcting that instead of specifics. I suppose I could answer #1 -- trimming. Seems ok. Not sure why you want to trim off the end of the sequence after quality trimming but it won't hurt.
westerman is offline   Reply With Quote
Old 08-21-2013, 12:14 PM   #3
nareshvasani
Member
 
Location: NC

Join Date: Apr 2013
Posts: 57
Smile Hi Westerman,

Hi,

You are right, I have transcripts read. I forgot to mention I have also used oases: it is post assembly processor for velvet, it work as transcriptome assembler.

I trimmed some base from end to improve per base GC content and per base sequence content.
If you don't mind can you please explain me in detail about:
fastq_qulaity trimmer -Q33 -t 20 -l 50 -i -o
upto my understanding it remove nucleotides having quality score less lower than 20 from the ends of the read. Furthermore, any trimmed reads having length less than 50 nt are discarded altogether.

Trinity is also good for transcriptome assembly. But I have never used that.
Can you please help with parameter for trinity command line.

Trinity.pl -SeqType Fq -min_contig_length 150 -JM 10G -single inputfilename -CPU 2 -output output_filename

Which other option do i need to consider for running Trinity like butterfly, inchworm, kmer and Chrysalis, etc

Thanks for your input.
I would really appreciate your feedback

Naresh


Quote:
Originally Posted by westerman View Post
I would use a tool designed to put RNAseq reads together. It has a been a while since I used Velvet but as far as I know it is designed to assemble genomes not transcripts. My favorite RNAseq tool is 'Trinity'.

The above assumes that your reads are from transcripts and not from the entire genome.

I know that the above advise does not answer your specific questions. However in case you did start down a poor path then I wanted to concentrate on correcting that instead of specifics. I suppose I could answer #1 -- trimming. Seems ok. Not sure why you want to trim off the end of the sequence after quality trimming but it won't hurt.
nareshvasani is offline   Reply With Quote
Old 08-21-2013, 12:30 PM   #4
westerman
Rick Westerman
 
Location: Purdue University, Indiana, USA

Join Date: Jun 2008
Posts: 1,104
Default

Your understanding of fastq_quality_trimmer is correct. BTW, if you ever get paired-end sequences then use 'trimmomatic' instead since it works much better with PE reads.

The other parameters to Trinity will depend on the size of your computer system -- e.g., how much memory, how many CPUs -- but these parameters are not required and what you have should be good enough. I suggest running Trinity with the parameters you have and see what happens. In the end you should get a 'Trinity.fasta' file. The other files can be discarded.

Once you get a 'Trinity.fasta' file then you can use bowtie2 to back-map your reads or, perhaps better, the Trinnotate annotation pipeline described on the Trinity web site.
westerman is offline   Reply With Quote
Old 08-21-2013, 12:37 PM   #5
nareshvasani
Member
 
Location: NC

Join Date: Apr 2013
Posts: 57
Smile Hi Westerman,

Hi,

Thanks for your prompt reply.
I really appreciate your suggestion.
Do you think if I put some more input option for butterfly, inchworm, kmer and Chrysalis, it will give me better contig file?

Thanks,
Naresh


Quote:
Originally Posted by westerman View Post
Your understanding of fastq_quality_trimmer is correct. BTW, if you ever get paired-end sequences then use 'trimmomatic' instead since it works much better with PE reads.

The other parameters to Trinity will depend on the size of your computer system -- e.g., how much memory, how many CPUs -- but these parameters are not required and what you have should be good enough. I suggest running Trinity with the parameters you have and see what happens. In the end you should get a 'Trinity.fasta' file. The other files can be discarded.

Once you get a 'Trinity.fasta' file then you can use bowtie2 to back-map your reads or, perhaps better, the Trinnotate annotation pipeline described on the Trinity web site.
nareshvasani is offline   Reply With Quote
Old 08-21-2013, 12:46 PM   #6
westerman
Rick Westerman
 
Location: Purdue University, Indiana, USA

Join Date: Jun 2008
Posts: 1,104
Default

No. The options to inchworm and chrysalis are really performance related. Butterfly has some non-performance related options but I would stick with the defaults unless you get something that seems weird. Really the only useful extra non-performance option is ' --jaccard_clip' which is used on high-gene density genomes.
westerman is offline   Reply With Quote
Old 08-21-2013, 12:51 PM   #7
nareshvasani
Member
 
Location: NC

Join Date: Apr 2013
Posts: 57
Post

Hi,

Thanks a lot.


Naresh
Quote:
Originally Posted by westerman View Post
No. The options to inchworm and chrysalis are really performance related. Butterfly has some non-performance related options but I would stick with the defaults unless you get something that seems weird. Really the only useful extra non-performance option is ' --jaccard_clip' which is used on high-gene density genomes.
nareshvasani is offline   Reply With Quote
Old 08-22-2013, 05:37 AM   #8
Cofactor Genomics
Registered Vendor
 
Location: St. Louis

Join Date: Jan 2010
Posts: 52
Default

One metric you can look at to assess your assembly is the percentage of reads that align back to your transcriptome assembly.
Cofactor Genomics is offline   Reply With Quote
Old 08-22-2013, 06:01 AM   #9
nareshvasani
Member
 
Location: NC

Join Date: Apr 2013
Posts: 57
Post Hi Cofactor,

HI,

Can you please ellobarate how can I perform that?

Thanks,
Naresh

Quote:
Originally Posted by Cofactor Genomics View Post
One metric you can look at to assess your assembly is the percentage of reads that align back to your transcriptome assembly.
nareshvasani is offline   Reply With Quote
Old 08-22-2013, 06:12 AM   #10
Cofactor Genomics
Registered Vendor
 
Location: St. Louis

Join Date: Jan 2010
Posts: 52
Default

Well, your newly formed transcriptome assembly is your reference, the raw read data used to generate the assembly is your data and you treat it like a RNA-seq project. In this manner, you align the raw data (trimming does not matter here, this is just a QA check) to the assembly and divide the number of reads aligning to the assembly by the total number of reads that went into the assembly. This is just a rough check and one could argue that you will miss things, however it is good for a rough check.

From these alignments, you may find some surprising results in that the percentage of reads are pretty low. Non-transcriptome assemblers do not like to see large differences in coverage in an assembly, assuming these are repetitive areas that are piling, but this type of data is inherent with RNA data.

Did you perform any manipulations during library prep to treat the RNA for the assembly process, like double-stranded nuclease treatment (to compress the dynamic range of the sample)? This can greatly help an assembly if one is not to heavy handed in the treatment.

It is hard to tell you what percentages are good and bad since I am not sure how the material was treated prior to sequencing or what your goals are for the assembly.

Hope this helps.

Jon Armstrong
Cofactor Genomics is offline   Reply With Quote
Old 08-22-2013, 07:07 AM   #11
nareshvasani
Member
 
Location: NC

Join Date: Apr 2013
Posts: 57
Post

Hi,
Thanks for your prompt reply.
I didn't perform any manipulation during library prep.

Thanks in advance,
Naresh

Quote:
Originally Posted by Cofactor Genomics View Post
Well, your newly formed transcriptome assembly is your reference, the raw read data used to generate the assembly is your data and you treat it like a RNA-seq project. In this manner, you align the raw data (trimming does not matter here, this is just a QA check) to the assembly and divide the number of reads aligning to the assembly by the total number of reads that went into the assembly. This is just a rough check and one could argue that you will miss things, however it is good for a rough check.

From these alignments, you may find some surprising results in that the percentage of reads are pretty low. Non-transcriptome assemblers do not like to see large differences in coverage in an assembly, assuming these are repetitive areas that are piling, but this type of data is inherent with RNA data.

Did you perform any manipulations during library prep to treat the RNA for the assembly process, like double-stranded nuclease treatment (to compress the dynamic range of the sample)? This can greatly help an assembly if one is not to heavy handed in the treatment.

It is hard to tell you what percentages are good and bad since I am not sure how the material was treated prior to sequencing or what your goals are for the assembly.

Hope this helps.

Jon Armstrong
nareshvasani is offline   Reply With Quote
Old 08-22-2013, 07:21 AM   #12
nareshvasani
Member
 
Location: NC

Join Date: Apr 2013
Posts: 57
Post Hi westerman,

Hi,

I am trying to run read alignment of my fastq file with Trinity.fa file using trinity's script that is alignreads.pl

I used below cmd:
#### /bin/util/alignReads.pl -seqType fq -single inputfile_name -target Trinity.fasta -aligner bowtie2 -p 4 -retain_intermediate_files -num_top_hits 20 -output align_bowtie_output

but i am getting below error:
Must specify target_db and it must exist at that location at /bin/util/alignReads.pl line 180

I don't know what does that mean, as I am good with reading script.

Hope you can help me out.
Thanks,
Naresh


Quote:
Originally Posted by westerman View Post
Your understanding of fastq_quality_trimmer is correct. BTW, if you ever get paired-end sequences then use 'trimmomatic' instead since it works much better with PE reads.

The other parameters to Trinity will depend on the size of your computer system -- e.g., how much memory, how many CPUs -- but these parameters are not required and what you have should be good enough. I suggest running Trinity with the parameters you have and see what happens. In the end you should get a 'Trinity.fasta' file. The other files can be discarded.

Once you get a 'Trinity.fasta' file then you can use bowtie2 to back-map your reads or, perhaps better, the Trinnotate annotation pipeline described on the Trinity web site.
nareshvasani is offline   Reply With Quote
Old 08-22-2013, 12:13 PM   #13
westerman
Rick Westerman
 
Location: Purdue University, Indiana, USA

Join Date: Jun 2008
Posts: 1,104
Default

Well, off-hand I would say that you need to give the whole path to the Trinity.fasta file. I suspect that it is not located in the directory in which you are located.

In these 'file not found' cases it is always helpful to the rest of us if you can include the results of:

pwd

and a

ls -l
westerman is offline   Reply With Quote
Old 08-22-2013, 12:24 PM   #14
nareshvasani
Member
 
Location: NC

Join Date: Apr 2013
Posts: 57
Post Hi westerman,

Hi westerman,

Thanks for you reply.
fullpath was mising in my cmd line.

Above cmd worked but with bowtie not with bowtie2.

With below cmd:
/bin/util/alignReads.pl --seqType fq --single CombineIonXpressRNA_010_NareshPool_Chip1_2_WT2_fastxtrimmer_from_quality_trimmer.fastq --target /media/DATAPART3/Combine_Files/Velvetoptimiser_27_37/trinity_output/Trinity.fasta --aligner bowtie2 --retain_intermediate_files --num_top_hits 20 --output align_bowtie_output1

Following error:
which: no tophat2 in (/root/perl5/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/root/bin:/root/Trinity/util/../trinity-plugins/rsem/sam/)
Error, path to required tophat2 cannot be found at /bin/util/alignReads.pl line 234.

#If I used bowtie instead of bowtie 2, it work fine with problem i.e. sam file has no header.

Hope you can help me out.
Thanks in advance.
Naresh


Quote:
Originally Posted by westerman View Post
Well, off-hand I would say that you need to give the whole path to the Trinity.fasta file. I suspect that it is not located in the directory in which you are located.

In these 'file not found' cases it is always helpful to the rest of us if you can include the results of:

pwd

and a

ls -l
nareshvasani is offline   Reply With Quote
Old 08-23-2013, 07:45 AM   #15
westerman
Rick Westerman
 
Location: Purdue University, Indiana, USA

Join Date: Jun 2008
Posts: 1,104
Default

I am surprised that Trinity would be looking for tophat2. However bowtie2 is closely associated with tophat2. I suggest installing tophat2. You may never need it but that should be the way to get Trinity to use bowtie2.
westerman is offline   Reply With Quote
Old 08-23-2013, 07:59 AM   #16
nareshvasani
Member
 
Location: NC

Join Date: Apr 2013
Posts: 57
Post

Thanks,

If I use bowtie, does it make any difference in alignment.
When I am running cmd with bowtie, while processing it shows below message:
CMD: bowtie -k 20 -S --sam-nohead -q target single_fa > single_fa.pre.sam

Upto my understanding it says sam has no header. Am I correct?

when i run stat.pl cmd to obtain read alignment statistics:
result shows:
read_type count pct
single 6874526 100.00

but no information about proper and improper pairing.

Hope you can shed lights on this.

Thanks a bunch.
Naresh



Quote:
Originally Posted by westerman View Post
I am surprised that Trinity would be looking for tophat2. However bowtie2 is closely associated with tophat2. I suggest installing tophat2. You may never need it but that should be the way to get Trinity to use bowtie2.
nareshvasani is offline   Reply With Quote
Old 08-23-2013, 08:09 AM   #17
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,978
Default

Bowtie 2 supports gapped, local, and paired-end alignment modes. That may be a reason to prefer it over bowtie.

Do you have paired-end data? Otherwise how will you get information about pairing?
GenoMax is offline   Reply With Quote
Old 08-23-2013, 08:35 AM   #18
nareshvasani
Member
 
Location: NC

Join Date: Apr 2013
Posts: 57
Post Hi Genomax,

Thanks for feedback.

I don't have paired end data.
I was confused about output of alignReads.pl script. i.e. how to confirm percentage of alignment. That's why asked about pairing. Sorry about that.
Can you please give me some feedback about how to check preservation of assembly and accurary of assembled contigs.

Thanks,
Naresh



Quote:
Originally Posted by GenoMax View Post
Bowtie 2 supports gapped, local, and paired-end alignment modes. That may be a reason to prefer it over bowtie.

Do you have paired-end data? Otherwise how will you get information about pairing?
nareshvasani is offline   Reply With Quote
Old 08-23-2013, 09:23 AM   #19
nareshvasani
Member
 
Location: NC

Join Date: Apr 2013
Posts: 57
Post Hi

Hi everyone,

I run below cmd for alignment.
bin/util/alignReads.pl --seqType fq --single CombineIonXpressRNA_010_NareshPool_Chip1_2_WT2_fastxtrimmer_from_quality_trimmer.fastq --target /media/DATAPART3/Combine_Files/Velvetoptimiser_27_37/trinity_output/Trinity.fasta --aligner bowtie --retain_intermediate_files --output align_bowtie_output1

##output in terminla show as follow:

CMD: bowtie -k 20 -S --sam-nohead -q target single_fa > single_fa.pre.sam
# reads processed: 18001999
# reads with at least one reported alignment: 3881611 (21.56%)
# reads that failed to align: 14120388 (78.44%)
Reported 6874538 alignments to 1 output stream(s)
CMD: touch single_fa.pre.sam.finished
CMD: /root/Trinity/util/../util/SAM_filter_out_unmapped_reads.pl single_fa.pre.sam > single_fa.sam
-filtered 14120388 of 20994926 reads as unaligned = 67.26% unaligned reads
CMD: touch single_fa.sam.finished
CMD: sort -T . -S 2G -k 1,1 -k 3,3 single_fa.sam > single_fa.nameSorted.sam
CMD: touch single_fa.nameSorted.sam.finished
-child alignment process completed.

## Alignment steps succeeded.

CMD: sort -T . -S 2G -k 3,3 -k 4,4n single_dir/single_fa.nameSorted.sam > single_dir/single.coordSorted.sam
CMD: touch single_dir/single.coordSorted.sam.finished
CMD: cp single_dir/single.coordSorted.sam align_bowtie_output1.pre.coordSorted.sam
CMD: touch align_bowtie_output1.pre.coordSorted.sam.finished
CMD: cp align_bowtie_output1.pre.coordSorted.sam align_bowtie_output1.coordSorted.spliceAdjust.sam
CMD: touch align_bowtie_output1.coordSorted.spliceAdjust.sam.finished
CMD: cp /media/DATAPART3/Combine_Files/Velvetoptimiser_27_37/trinity_output/align_bowtie_output1/align_bowtie_output1.coordSorted.spliceAdjust.sam align_bowtie_output1.coordSorted.sam
CMD: touch align_bowtie_output1.coordSorted.sam.finished
CMD: samtools view -bt target.fa.fai -S align_bowtie_output1.coordSorted.sam | samtools sort - align_bowtie_output1.coordSorted
[bam_header_read] EOF marker is absent. The input is probably truncated.

[sam_header_read2] 102085 sequences loaded.
[bam_sort_core] merging from 3 files...
CMD: touch align_bowtie_output1.coordSorted.bam.finished
CMD: sort -T . -S 2G -k 1,1 -k 3,3 align_bowtie_output1.coordSorted.sam > align_bowtie_output1.nameSorted.sam
CMD: touch align_bowtie_output1.nameSorted.sam.finished
CMD: samtools view -bt target.fa.fai align_bowtie_output1.nameSorted.sam > align_bowtie_output1.nameSorted.bam
[sam_header_read2] 102085 sequences loaded.
CMD: samtools index align_bowtie_output1.coordSorted.bam

As per above ouput:
20994926 reads where used for alignment. But in actually I have only 18,001,199 reads.
# 67.26% was unaligned and 32.74% was aligned

So Questions are:
1] Why it used more amout of input reads i.e. is 20 million instead of 18M.
2] 32% is cosidered as good or bad alignment? If bad, how do i improve alignment %
3] What does this mean == EOF marker is absent. The input is probably truncated


Thanks in advance.
Naresh
nareshvasani is offline   Reply With Quote
Old 05-19-2014, 05:05 AM   #20
sabbir_barj
Junior Member
 
Location: Bangladesh

Join Date: Feb 2014
Posts: 4
Default

Hi nareshvasani,

I am curious about your assembly result with trinity. I have a instrument of Ion proton data and I am trying to assembly the transcriptome data. But I did not get good result with MIRA4, Trinity and Velvet assembler. Can you tell me your command line for trinity and general information about your raw read and assembly file? it will help me a lot.Thanks


sabbir




Quote:
Originally Posted by nareshvasani View Post
Hi,

Thanks for your prompt reply.
I really appreciate your suggestion.
Do you think if I put some more input option for butterfly, inchworm, kmer and Chrysalis, it will give me better contig file?

Thanks,
Naresh
sabbir_barj is offline   Reply With Quote
Reply

Tags
bioinformatic analaysis, ion torrent, rnaseq

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 05:44 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO