SEQanswers

Go Back   SEQanswers > Applications Forums > RNA Sequencing



Similar Threads
Thread Thread Starter Forum Replies Last Post
Please HELP! TCGA RNAseq data and Limma Mous Bioinformatics 4 11-22-2014 03:16 AM
library size of TCGA RNASeq data thejustpark Bioinformatics 1 10-31-2013 01:29 PM
TCGA Level 3 RNASeq data duplicate rows for a gene yww RNA Sequencing 1 04-06-2013 07:53 AM
Illumina Pipeline Version t.wieland Bioinformatics 8 06-09-2011 03:03 AM
New version of Solexa GA pipeline tool bioinfosm Bioinformatics 2 05-28-2008 08:51 AM

Reply
 
Thread Tools
Old 10-27-2015, 04:20 AM   #1
Sajna
Member
 
Location: INDIA

Join Date: Oct 2015
Posts: 14
Smile TCGA : RNASeq version 1 pipeline

The following documentation provides good details of the pipeline :
https://confluence.broadinstitute.or...=1363806109000

However, on visiting the the project website at http://seqware.sf.net,
not able to find any data under the files menu. Also browsed UNC website with UNCids, with no results

Please could you guide me?. Where can I obtain the RNASeqversion1
pipeline?
Sajna is offline   Reply With Quote
Old 10-27-2015, 04:33 AM   #2
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,550
Default

Are you looking for the TCGA data from UNC? That is available from TCGA data portal: https://tcga-data.nci.nih.gov/tcga/
GenoMax is online now   Reply With Quote
Old 10-27-2015, 09:16 PM   #3
Sajna
Member
 
Location: INDIA

Join Date: Oct 2015
Posts: 14
Question

I have a few sequence read archive(SRA) studies. I want to perform gene quantification for the studies using TCGA RNASeq version 1 pipeline.

I need the script which could run the entire pipeline for RNASeq version 1 on my sra files.

And as I mentioned earlier the following link provides details of obtaining the pipeline. However on visiting the gitshub page, here is no data(RNASeq version 1 pipeline) listed! I donot want to use RNASeq vwersion 2 right now, want to reuse TCGA RNASeq V1 pipeline!

https://confluence.broadinstitute.or...=1363806109000

Please advice.

Last edited by Sajna; 10-27-2015 at 09:22 PM.
Sajna is offline   Reply With Quote
Old 10-27-2015, 09:21 PM   #4
Sajna
Member
 
Location: INDIA

Join Date: Oct 2015
Posts: 14
Default

And as I mentioned earlier the follwoing link provides details of obtaining the pipeline. However on visiting the gitshub page, here is no data(RNASeq version 1 pipeline) listed! I donot want to use RNASeq vwersion 2 right now, want to reuse TCGA RNASeq V1 pipeline!

https://confluence.broadinstitute.or...=1363806109000
Sajna is offline   Reply With Quote
Old 10-28-2015, 04:08 AM   #5
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,550
Default

At this point in time trying to run SeqWare and version 1 of TCGA RNAseq pipeline would at best be an exercise in futility. You may be better off using new versions of bwa and MapSplice .

That said this file has additional details about software used in v.1 and v.2: https://tcga-data.nci.nih.gov/tcgafi...ESCRIPTION.txt

All the data that was submitted under TCGA was reprocessed using v.2 of the pipeline and that is what should be considered current based on communication from UNC TCGA folks.
GenoMax is online now   Reply With Quote
Old 10-28-2015, 05:08 AM   #6
Sajna
Member
 
Location: INDIA

Join Date: Oct 2015
Posts: 14
Default

Thanks Genomax. I will get into details of version 2 and process using BWA or I will consider Mapsplice for quantification.

Last edited by Sajna; 10-29-2015 at 10:45 AM.
Sajna is offline   Reply With Quote
Old 10-29-2015, 04:59 AM   #7
Sajna
Member
 
Location: INDIA

Join Date: Oct 2015
Posts: 14
Unhappy TCGA Mapsplice RNASeqV2 pipeline : Error: check reads format failed

Hi All,

I am using Mapsplice run (v2.0). My fastq files have the Sanger/Illumina 1.9 format. I removed the blank spaces and also removed length= and now the head of the file looks like this:

head ERR519523_1.fastq
@ERR519523.1:1:100
CAAACCAATGGCTCCACCCGTACCTGGCTCTGCCTCTACCCACCGACATTGCTCCTGTGGTCCTACTCAGAAGTAGTTCAGCACTCAGGACAGCTTCCAC
+ERR519523.1:1:100
CCCFFFFFHHHHHJJIJJJJGHIJGIIJIIJIIIGIGIHIIJJJJGHJJJIJFJIHHHHHDFFFFECCCEEDD>[email protected]>
@ERR519523.2:2:65
TGCATAGAGATAGAAACAGAAAATAGAATGGTGGTTGCAGGGTCTGGAAAGAGAGGAGGAGCGCA
+ERR519523.2:2:65
@@@[email protected]@@C<EEEHCFHH)?FDC<DF9BDHG9B9B;D=BF=FG;C(:5'
@ERR519523.3:3:100
GGACGCATAAGAGTTACAGGCTCTATACACAGGGACTTTCCTTCCTGGAAACCCGGTAGGAAATCCCATTATGGCTGCCTGTTTGCCAAACTATTCCCTT


When I run mapsplice.py script using the following command, I encounter the error :

"pairend read name not end with /1 or /2 the 1th read in /ERR519523/ERR519523_1.fastq
@ERR519523.1:1:100
[FAILED]
Error: check reads format failed"

COMMAND :
python /opt/MapSplice_multi_threads_2.0.1.9/mapsplice.py -c /hg19_chromosomes/ -x /ebwt/humanchridx_M_rCRS -1 /ERR519523_1.fastq -2 ERR519523_2.fastq
[Thu Oct 29 17:31:33 2015] Preparing output location mapsplice_out/

[Thu Oct 29 17:31:33 2015] Beginning Mapsplice run (v2.0)
-----------------------------------------------
bin directory: [/opt/MapSplice_multi_threads_2.0.1.9/bin/]
[Thu Oct 29 17:31:33 2015] Checking for files or directory
[Thu Oct 29 17:31:33 2015] Checking for files or directory
[Thu Oct 29 17:31:33 2015] Checking for files or directory
[Thu Oct 29 17:31:33 2015] Checking for Bowtie index files
[Thu Oct 29 17:31:33 2015] reads all chromo sizes
[Thu Oct 29 17:31:42 2015] check reads format
ERR519523_1.fastq is fastq format
pairend read name not end with /1 or /2
the 1th read in /ERR519523/ERR519523_1.fastq
@ERR519523.1:1:100
[FAILED]
Error: check reads format failed

Please help!!

Last edited by Sajna; 10-29-2015 at 10:45 AM.
Sajna is offline   Reply With Quote
Old 10-29-2015, 05:04 AM   #8
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,550
Default

When you extracted the reads from the SRA file did you use the -F/--origfmt switch to preserve the illumina read ID?
GenoMax is online now   Reply With Quote
Old 10-29-2015, 05:10 AM   #9
Sajna
Member
 
Location: INDIA

Join Date: Oct 2015
Posts: 14
Default

converted the .sra format files to fastq format using latest sratoolkit version with the function fastq-dump srafilenames.sra --split-3 since the data was paired-end.

No other specifications were made.
Sajna is offline   Reply With Quote
Old 10-29-2015, 05:15 AM   #10
Sajna
Member
 
Location: INDIA

Join Date: Oct 2015
Posts: 14
Default

When I converted sra file to fastq using fastq-dump it looked like this :

@ERR519523.1 1 length=100
CAAACCAATGGCTCCACCCGTACCTGGCTCTGCCTCTACCCACCGACATTGCTCCTGTGGTCCTACTCAGAAGTAGTTCAGCACTCAGGACAGCTTCCAC
+ERR519523.1 1 length=100
CCCFFFFFHHHHHJJIJJJJGHIJGIIJIIJIIIGIGIHIIJJJJGHJJJIJFJIHHHHHDFFFFECCCEEDD>[email protected]>
@ERR519523.2 2 length=65
TGCATAGAGATAGAAACAGAAAATAGAATGGTGGTTGCAGGGTCTGGAAAGAGAGGAGGAGCGCA
+ERR519523.2 2 length=65
@@@[email protected]@@C<EEEHCFHH)?FDC<DF9BDHG9B9B;D=BF=FG;C(:5'
@ERR519523.3 3 length=100
GGACGCATAAGAGTTACAGGCTCTATACACAGGGACTTTCCTTCCTGGAAACCCGGTAGGAAATCCCATTATGGCTGCCTGTTTGCCAAACTATTCCCTT

Then I removed blank spaces and replaced with ' :' and 'length=' was removed and the fastq files were sent to mapsplice, but i got the below mentioned error :

"pairend read name not end with /1 or /2 the 1th read in /ERR519523/ERR519523_1.fastq
@ERR519523.1:1:100
[FAILED]
Error: check reads format failed"

Please help...
Sajna is offline   Reply With Quote
Old 10-29-2015, 05:24 AM   #11
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,550
Default

You should have used --split-files. Re-extract your data from the SRA file.

Edit: Let me look at that SRA#.

Edit 2: It appears that the submitters have modified the original illumina fastq read headers in this submission (or they were never submitted to SRA as -F option is only generating a number). After you split the files with just "--split-files" you are going to have to add the /1 and /2 at the end of the fastq headers since MapSplice expects them to be present.

Last edited by GenoMax; 10-29-2015 at 05:39 AM.
GenoMax is online now   Reply With Quote
Old 10-29-2015, 10:43 AM   #12
Sajna
Member
 
Location: INDIA

Join Date: Oct 2015
Posts: 14
Default

Otherwise, I tried the tool that Mapsplice pipeline uses (UNC ubu.jar) for preparing fastq files for Mapsplice. Command to format fastq is as follows:

java -Xmx512M -jar ubu.jar fastq-format --phred33to64 --strip --suffix /1 –in raw_1.fastq --out working/prep_1.fastq >
working/mapsplice_prep1.log

I tried that, however I get the error : Fastq format not recognizable...

I will tryout what you suggested tomorrow morning when at work...and hopefully that should work..lets see

Last edited by Sajna; 10-29-2015 at 10:55 AM.
Sajna is offline   Reply With Quote
Old 10-29-2015, 10:45 AM   #13
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,550
Default

That is correct.
GenoMax is online now   Reply With Quote
Old 11-02-2015, 12:20 AM   #14
Sajna
Member
 
Location: INDIA

Join Date: Oct 2015
Posts: 14
Default

Genomax, it worked!!!! Many Thanks and good day to you.
Sajna is offline   Reply With Quote
Old 11-23-2015, 10:24 PM   #15
Sajna
Member
 
Location: INDIA

Join Date: Oct 2015
Posts: 14
Question TCGA RSEM_ref files

I have used "Mapsplice" to align all the SRA fastq samples successfully, and used bedtools coverage function to retrieve the raw read counts. But then the next task was to combine level 3 data from TCGA with the mapsplice aligned SRA samples for differential expression analysis. Having done that I noticed that the number of DE genes are very high. Referencing back, I understood that the "raw counts" reported by TCGA are expected counts from the RSEM software. Although in the RSEM paper, it is mentioned that edgeR and DESeq can process the RSEM counts, it appears that edgeR requires intergers as input. Well...I have now decided to run RSEM on the SRA Sam/Bam files.

The TCGA mRNA_Seq pipeline detailed at the following URL requires the hg19_M_rCRS_ref.transcripts.fa file for running RSEM-calculate-expression and to Translate to transcriptome coords.
https://webshareex.bioinf.unc.edu/pu...eq_summary.pdf


However the file which should be available from the follwoing URL is missing:

https://webshare.bioinf.unc.edu/publ...transcripts.fa

Also I require the reference mapping file to run RSEM: https://webshare.bioinf.unc.edu/publ...ownToLocus.txt

The file is truncated fromGithub' as well.

Where can I access the files?

Last edited by Sajna; 11-23-2015 at 10:27 PM.
Sajna is offline   Reply With Quote
Old 12-03-2015, 07:22 AM   #16
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,550
Default

Files referred to in the last post by @Sajna are once again available at the same link.
GenoMax is online now   Reply With Quote
Old 12-05-2015, 02:13 AM   #17
Sajna
Member
 
Location: INDIA

Join Date: Oct 2015
Posts: 14
Default

As per the RNASeq version 2 pipeline in the link below, I have been able to obtain outputs till step 11, i.e the transcriptome_alignments_filtered.bam output. Next step is to run rsem.

https://cghub.ucsc.edu/docs/tcga/UNC...eq_summary.pdf

I ran RSEM using the following command : /opt/rsem-1.1.13/rsem-calculate-expression --paired-end --bam --estimate-rspd -p 8 transcriptome_alignments_filtered.bam /home/group_sh/TCGA+SRA_datamergerquantification_RNASeqV2/RSEM_ref_from_TCGA_webshare/hg19_M_rCRS_ref ERR519523.rsem > rsem.log_2 > rsem.log

I get this error in the log file :

rsem-parse-alignments failed! Please check if you provide correct parameters/options for the pipeline!

What is wrong? I donot understand. I am running exactly the same command!!!

Kindly help.
Sajna is offline   Reply With Quote
Old 12-05-2015, 12:12 PM   #18
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,550
Default

Are you using a different version of RSEM than the one in the PDF?
GenoMax is online now   Reply With Quote
Old 12-06-2015, 04:42 AM   #19
Sajna
Member
 
Location: INDIA

Join Date: Oct 2015
Posts: 14
Default

No. I am using the same version mentioned in the UNC mRNASeq version2 document.
Sajna is offline   Reply With Quote
Old 12-06-2015, 12:29 PM   #20
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,550
Default

Have you done first part of step 4 in this document: https://tcga-data.nci.nih.gov/tcgafi...ESCRIPTION.txt
GenoMax is online now   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 05:55 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO