SEQanswers

Go Back   SEQanswers > Applications Forums > RNA Sequencing



Similar Threads
Thread Thread Starter Forum Replies Last Post
Convert paired-end, longer reads to single-end, shorter reads mlodato Bioinformatics 7 01-30-2016 06:47 AM
Creating psuedo paired-end sequencing reads from single-end reads cburke04 Bioinformatics 6 01-14-2015 07:10 PM
Uniquely mapped reads and difference for single end and paired end reads gene_x Bioinformatics 2 01-13-2015 01:55 PM
Velvet 1.2.10 with FASTQs of different read-lengths AND paired- and single-end? Genomics101 Bioinformatics 2 12-31-2013 07:50 AM
paired-end reads mapped to genome.. gene with only one direction of paired-end reads? danwiththeplan Bioinformatics 2 09-22-2011 03:06 AM

Reply
 
Thread Tools
Old 08-03-2015, 06:54 PM   #1
copacetik
Junior Member
 
Location: Boston

Join Date: Aug 2015
Posts: 3
Default Paired end reads with different lengths

Hello All,

I am very new to bioinformatics. I am a wet-lab biologist trying to teach myself about RNA-seq.

Using the sra toolkit, I looked at an RNA-seq study on the GEO database. I downloaded the data as fastq files using "fastq-dump --split-files thedata"

I ended up with thedata_1.fastq and thedata_2.fastq, when I ran these through fastqc, the 1 file had a sequence length of 75, while the 2 file had a sequence length of 25.

Is this a mistake I made? I couldn't find any previous topics that covered this. I assumed using --split-files would show me if it was paired end reads, and if so they should be the same size.

If not, is the data still usable? Since I am just trying to teach myself how to work with this data, it would not be a big deal to abandon it, but that would also not exactly fulfill the goal.

Thanks for any help/advice
copacetik is offline   Reply With Quote
Old 08-03-2015, 11:43 PM   #2
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

That would seem unusual, but then again sometimes people upload weird stuff to GEO. What was the accession number? With that info I or someone else could just double check for you.

If that turns out to be the correct data then it should still be usable. You can still map reads like that and, even if not, you could always just ditch read #2 (though 25bp should suffice).
dpryan is offline   Reply With Quote
Old 08-04-2015, 01:31 AM   #3
Michael.Ante
Senior Member
 
Location: Vienna

Join Date: Oct 2011
Posts: 123
Default

AFAIK, Solid reads came in a 50+25 read pair. I think they extended it later (75+35 or so).
Just have a look, if the reads are coded in base-space (ACGT) or in color-space (0123).
So if you found ABI Solid reads, you can use them, but beware of the color-space coding. TopHat1 could handle those, maybe TopHat2 by now; you need to build the index on color-space and align the color-space reads.
Michael.Ante is offline   Reply With Quote
Old 08-04-2015, 03:54 AM   #4
copacetik
Junior Member
 
Location: Boston

Join Date: Aug 2015
Posts: 3
Default

Thank you for the help, the accession number is SRP061544 and I looked at two samples, SRR2125888 and SRR2125889 and both gave me that same 75+25 situation.

It gave the sequencer as a HiSeq 2500.
copacetik is offline   Reply With Quote
Old 08-04-2015, 04:03 AM   #5
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Well, you did everything correctly and got what they uploaded. The only problem is that what they uploaded is questionable. A HiSeq 2500 produces paired-end reads of the same length, so it won't produce this dataset. My guess is that they screwed up creating the SRA file and that they actually have 2x50bp reads rather than 2x75. You'll probably be able to tell if this is the case when you align the data, since if I'm correct the alignment metrics will be very weird (i.e., a low alignment rate with lots of soft-clipping of bases 51-75 or read #1).
dpryan is offline   Reply With Quote
Old 08-04-2015, 04:04 AM   #6
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,080
Default

SRA listing does indicate this as an asymmetric submission (75bp+25bp). Perhaps there are some clues in the associated publication (if any).
GenoMax is offline   Reply With Quote
Old 08-04-2015, 04:14 AM   #7
HESmith
Senior Member
 
Location: Bethesda MD

Join Date: Oct 2009
Posts: 509
Default

Quote:
Originally Posted by dpryan View Post
A HiSeq 2500 produces paired-end reads of the same length, so it won't produce this dataset.
Actually, the instrument can be programmed for paired ends of different read lengths. For example, one user engineered his barcode on the wrong end of his amplicon library, so we ran 50+10 cycles to allow demultiplexing.
HESmith is offline   Reply With Quote
Old 08-04-2015, 04:17 AM   #8
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Quote:
Originally Posted by HESmith View Post
Actually, the instrument can be programmed for paired ends of different read lengths. For example, one user engineered his barcode on the wrong end of his amplicon library, so we ran 50+10 cycles to allow demultiplexing.
Good point, hopefully whomever uploaded the data can shed some light on things (assuming there's no publication yet).
dpryan is offline   Reply With Quote
Old 08-04-2015, 07:11 AM   #9
copacetik
Junior Member
 
Location: Boston

Join Date: Aug 2015
Posts: 3
Default

Alright, so at least it was not an error on my part. There is no associated publication yet and since I'm only using it for training purposes I probably won't contact the lab. Just a couple followup questions regarding this scenario:
1. Many of the preprocessing tutorials I've read suggest removing reads below a certain length, should I still attempt that? If so, what would be an appropriate length? Most of them suggest removing anything bellow ~35 which of course would not work here. The initial QC shows the data has a lot of adapter reads and skewed GC content, etc, so it seems like this would be appropriate.
2. GenoMax, you mentioned the SRA listing pointing this out, where can I find that? I did not see it anywhere on the site.
3. It seems to me, although maybe I'm misunderstanding this, that since using paired end reads relies on matching the pairs it will only be as good (on average) as the shorter read, is that right? Would it make sense to just use the forward reads and ignore the 25bp reads?

Thanks again for the help.
copacetik is offline   Reply With Quote
Old 08-04-2015, 07:40 AM   #10
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,080
Default

Quote:
Originally Posted by copacetik View Post
Alright, so at least it was not an error on my part. There is no associated publication yet and since I'm only using it for training purposes I probably won't contact the lab. Just a couple followup questions regarding this scenario:
1. Many of the preprocessing tutorials I've read suggest removing reads below a certain length, should I still attempt that? If so, what would be an appropriate length? Most of them suggest removing anything bellow ~35 which of course would not work here. The initial QC shows the data has a lot of adapter reads and skewed GC content, etc, so it seems like this would be appropriate.
2. GenoMax, you mentioned the SRA listing pointing this out, where can I find that? I did not see it anywhere on the site.
3. It seems to me, although maybe I'm misunderstanding this, that since using paired end reads relies on matching the pairs it will only be as good (on average) as the shorter read, is that right? Would it make sense to just use the forward reads and ignore the 25bp reads?

Thanks again for the help.
I am attaching a screenshot of SRA run browser that shows the length of the two reads graphically.

If this dataset is not correctly uploaded (i.e. it is actually 50x50 but uploaded/parsed as a 75x25) you will start finding spurious/no alignments (as Devon mentioned above).

There should be plenty other datasets to select from that look more normal for training.
Attached Images
File Type: png sra.PNG (24.1 KB, 5 views)
GenoMax is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 01:42 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO