Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Converting RNA-Seq bam in fastq

    Hi everyone!
    I have to convert some RNA-Seq bam files into corrensponding paired-end fastq files.
    I tried to use "samtools view" and Picard "SamToFastq"
    Code:
    samtools view -h -o sample.sam sample.bam
    Code:
    java -jar SamToFastq.jar INPUT=sample.sam FASTQ=sample_1.fastq SECOND_END_FASTQ=sample_2.fastq
    It resulted in this error:
    Code:
    Error parsing text SAM file. MRNM not specified but flags indicate mate mapped
    and empty fastq files.

    This is the sample.sam
    Code:
    @HD	VN:1.0	SO:unsorted
    @SQ	SN:chr1	LN:249250621
    @SQ	SN:chr10	LN:135534747
    @SQ	SN:chr11	LN:135006516
    @SQ	SN:chr12	LN:133851895
    @SQ	SN:chr13	LN:115169878
    @SQ	SN:chr14	LN:107349540
    @SQ	SN:chr15	LN:102531392
    @SQ	SN:chr16	LN:90354753
    @SQ	SN:chr17	LN:81195210
    @SQ	SN:chr18	LN:78077248
    @SQ	SN:chr19	LN:59128983
    @SQ	SN:chr2	LN:243199373
    @SQ	SN:chr20	LN:63025520
    @SQ	SN:chr21	LN:48129895
    @SQ	SN:chr22	LN:51304566
    @SQ	SN:chr3	LN:198022430
    @SQ	SN:chr4	LN:191154276
    @SQ	SN:chr5	LN:180915260
    @SQ	SN:chr6	LN:171115067
    @SQ	SN:chr7	LN:159138663
    @SQ	SN:chr8	LN:146364022
    @SQ	SN:chr9	LN:141213431
    @SQ	SN:chrM_rCRS	LN:16569
    @SQ	SN:chrX	LN:155270560
    @SQ	SN:chrY	LN:59373566
    @RG	ID:110624_UNC14-SN744_0134_AD0CVTABXX_8_	PL:illumina	PU:barcode	LB:TruSeq	SM:110624_UNC14-SN744_0134_AD0CVTABXX_8_
    UNC14-SN744_134:8:2102:15138:99673/2	147	chr7	99998918	69	42M2357N8M	=	99998683	-2642	CCAAGGCCTTGCTCTGGGGAGCTTTAAATTTTTTCTTAGGGCTGTTTTCT	IIIGHGGGIJIIIGHAHGHH@JIGJJJIHEJJIJJJJHHHHHFFFFFCCC	XF:Z:CTAC,	RG:Z:110624_UNC14-SN744_0134_AD0CVTABXX_8_	IH:i:1	HI:i:1	NM:i:0	XS:A:-
    UNC14-SN744_134:8:2101:1447:161692/2	147	chr7	99998918	69	42M2357N8M	=	99998797	-2528	CCAAGGCCTTGCTCTGGGGAGCTTTAAATTTTTTCTTAGGGCTGTTTTCT	HGHGHDCHGGGIIDGIIHIHEIHGGJIGGHIIIJJJJHHGHHFFFFF@C@	XF:Z:CTAC,	RG:Z:110624_UNC14-SN744_0134_AD0CVTABXX_8_	IH:i:1	HI:i:1	NM:i:0	XS:A:-
    UNC14-SN744_134:8:2207:13624:39322/2	147	chr7	99998920	69	40M2357N10M	=	99998689	-2638	AAGGCCTTGCTCTGGGGAGCTTTAAATTTTTTCTTAGGGCTGTTTTCTCT	@HF<JIGIJIHCCGD9CIGIHGGJIGDJIGJJJJJJJHHGHHEDDDFCB@	XF:Z:CTAC,	RG:Z:110624_UNC14-SN744_0134_AD0CVTABXX_8_	IH:i:1	HI:i:1	NM:i:0	XS:A:-
    UNC14-SN744_134:8:2108:11461:118679/2	163	chr7	99998929	60	31M2357N19M	=	100001809	2930	CTTTGGGGAGCTTTAAATTTTTTCTTAGGGCTGTTTTCTCTCCTTCCTCC	CCCFFFFFFHHHHJJJJJJJJJIJJIJJJJJJIHIJJIJJIJJJJIJDIH	XF:Z:CTAC,	RG:Z:110624_UNC14-SN744_0134_AD0CVTABXX_8_	IH:i:1	HI:i:1	NM:i:1	XS:A:-
    UNC14-SN744_134:8:1107:2904:31086/1	99	chr7	99998929	60	31M2357N19M	=	100001809	2930	CTTTGGGGAGCTTTAAATTTTTTCTTAGGGCTGTTTTCTCTCCTTCCTCC	BCCFFFFFFHHHHJJJJJJJJJJJJJJIJIJJGHHJJJJJJJJJJJJJJJ	XF:Z:CTAC,	RG:Z:110624_UNC14-SN744_0134_AD0CVTABXX_8_	IH:i:1	HI:i:1	NM:i:1	XS:A:-
    UNC14-SN744_134:8:2107:8382:2405/1	83	chr7	99998936	69	24M2357N26M	=	99998696	-2647	GAGCTTTAAATTTTTTCTTAGGGCTGTTTTCTCTCCTTCCTCCTTTTCCA	JJJIIJJJJIGGJJJJJIJJJJJJJIJJJIHHJIJJJHHHFHFFFFDCCB	XF:Z:CTAC,	RG:Z:110624_UNC14-SN744_0134_AD0CVTABXX_8_	IH:i:1	HI:i:1	NM:i:0	XS:A:-
    UNC14-SN744_134:8:2106:3457:77846/1	83	chr7	99999623	69	42M474N8M	=	99998870	-1277	TCCTGCCTCGGCCATCTGCTGTGCCTGCATCACCCCCAAGCCCTCTTGGC	DDDDDFHJJJJJJJJJJJJJJIJJJJJIGGD?JJJJJHHHHHFFFFFCCC	XF:Z:CTAC,	RG:Z:110624_UNC14-SN744_0134_AD0CVTABXX_8_	IH:i:1	HI:i:1	NM:i:0	XS:A:-
    UNC14-SN744_134:8:2107:3652:145199/2	163	chr7	99999624	69	41M474N9M	=	100001398	2216	CCTGCCTCGGCCATCTGCTGTGCCTGCATCACCCCCAAGCCCTCTTGGCT	CCCFFFFFFGHHHJJJIJJJIJJJJJJJJGHIJGIEIGHIJJJIJIJGIG	XF:Z:CTAC,	RG:Z:110624_UNC14-SN744_0134_AD0CVTABXX_8_	IH:i:1	HI:i:1	NM:i:0	XS:A:-
    UNC14-SN744_134:8:1201:13771:91534/2	163	chr7	99999624	69	41M474N9M	=	100001333	1759	CCTGCCTCGGCCATCTGCTGTGCCTGCATCACCCCCAAGCCCTCTTGGCT	BCCFFFFFHHGHHJHJFIJIGHIIHGIIHIIEIHHHIIJJIJIIGCGIIG	XF:Z:CTAC,	RG:Z:110624_UNC14-SN744_0134_AD0CVTABXX_8_	IH:i:1	HI:i:1	NM:i:0	XS:A:-
    UNC14-SN744_134:8:2103:11276:160481/1	83	chr7	99999642	69	23M474N27M	=	99998948	-1218	TGTGCCTGCATCACCCCCAAGCCCTCTTGGCTTGGTTTTTTGGGTCTGTA	DEBFFFFHFHEB;IIIIIJJIJGGEIIJJIJIJJJIFHHHGHFFFFFCCC	XF:Z:CTAC,	RG:Z:110624_UNC14-SN744_0134_AD0CVTABXX_8_	IH:i:1	HI:i:1	NM:i:0	XS:A:-
    I understood there are some lines with MRNM not specified, such as:
    Code:
    UNC14-SN744_134:8:2206:10660:87358/2	145	chr7	100001077	60	50M	*	0	0	ATCCGCTTCCCTCGGCCTCCCAAAGTGCTGGGATCACAGGCGTGAGCCAC	9:BBAF@5'HEAIJGIGEHF<HEBA;D@?HHGGBCA@AD<?4;FFFF@BB	RG:Z:110624_UNC14-SN744_0134_AD0CVTABXX_8_	IH:i:1	HI:i:1	NM:i:1
    but I don't understand why I cannot retrieve the other corrected reads in output fastq files.

    I also tried to include these two options in Piacard SamToFastq
    Code:
    INCLUDE_NON_PF_READS=TRUE VALIDATION_STRINGENCY=SILENT
    and it resulted in all reads unpaired and empty fastq files.

    Then I tried with another tool, TopHat2 bam2fastqx.
    First, I sorted sample.bam by chr name
    Code:
    samtools sort -n sample.bam sample_sn
    resulting in
    Code:
    @HD	VN:1.0	SO:unsorted
    @SQ	SN:chr1	LN:249250621
    @SQ	SN:chr10	LN:135534747
    @SQ	SN:chr11	LN:135006516
    @SQ	SN:chr12	LN:133851895
    @SQ	SN:chr13	LN:115169878
    @SQ	SN:chr14	LN:107349540
    @SQ	SN:chr15	LN:102531392
    @SQ	SN:chr16	LN:90354753
    @SQ	SN:chr17	LN:81195210
    @SQ	SN:chr18	LN:78077248
    @SQ	SN:chr19	LN:59128983
    @SQ	SN:chr2	LN:243199373
    @SQ	SN:chr20	LN:63025520
    @SQ	SN:chr21	LN:48129895
    @SQ	SN:chr22	LN:51304566
    @SQ	SN:chr3	LN:198022430
    @SQ	SN:chr4	LN:191154276
    @SQ	SN:chr5	LN:180915260
    @SQ	SN:chr6	LN:171115067
    @SQ	SN:chr7	LN:159138663
    @SQ	SN:chr8	LN:146364022
    @SQ	SN:chr9	LN:141213431
    @SQ	SN:chrM_rCRS	LN:16569
    @SQ	SN:chrX	LN:155270560
    @SQ	SN:chrY	LN:59373566
    @RG	ID:110624_UNC14-SN744_0134_AD0CVTABXX_8_	PL:illumina	PU:barcode	LB:TruSeq	SM:110624_UNC14-SN744_0134_AD0CVTABXX_8_
    UNC14-SN744_134:8:1101:1284:144798/1	83	chr7	100276741	21	50M	=	100276627	-164	ATTTTTATTATATTTTCAGTTTTTCCATAAAGGAGCCAATTCCAACNCTG	###############################################CC@	RG:Z:110624_UNC14-SN744_0134_AD0CVTABXX_8IH:i:1	HI:i:1	NM:i:1
    UNC14-SN744_134:8:1101:1284:144798/2	163	chr7	100276627	59	50M	=	100276741	164	CAGGAGGCCCTCATCCTTCTGCTGCCCTGGCGTTGGGGCCTCACCCCTCT	BCCFFFFFHHHHHJJJJJJJJJJJJJJJJJJJIJJHHIIHIJJJJJJJJJ	RG:Z:110624_UNC14-SN744_0134_AD0CVTABXX_8IH:i:1	HI:i:1	NM:i:1
    UNC14-SN744_134:8:1101:1295:171825/1	99	chr7	100210452	69	50M	=	100210588	383	GTCCGGGGCCCCCTGGGCGGGGGTCCCGGGGCGCCCCTCCTCCCTTGGGA	@@BFF>DFHHHGHIJJIJJJJDD7@BBDDBBDBBBDDDDDDDDD8@CCD8	RG:Z:110624_UNC14-SN744_0134_AD0CVTABXX_8IH:i:1	HI:i:1	NM:i:0
    UNC14-SN744_134:8:1101:1295:171825/2	147	chr7	100210588	69	32M197N18M	=	100210452	-383	TAACCCCACAGGAACTGCGCTTCGCTTCCGAGTCCTGTGCACAGCACCTG	AHGIIHHFGIIJJIJJJIIFCAJGHGGJJIGGGIIGGAHHHHDDD=F@@B	XF:Z:GTAG,	RG:Z:110624_UNC14-SN744_0134_AD0CVTABXX_8_	IH:i:1	HI:i:1	NM:i:0	XS:A:+
    UNC14-SN744_134:8:1101:1296:110092/1	65	chr7	100417813	52	50M	*	0	0	CGGCACTGGCAGACGGCTGATCCAATGGTGTTAGAGTGGCTAATAGCTGG	@@@DDDDDHHHHFGADG@AGCBH*?:9D*::B>DHGBFHD9?B#######	RG:Z:110624_UNC14-SN744_0134_AD0CVTABXX_8_	IH:i:1	HI:i:1	NM:i:2
    UNC14-SN744_134:8:1101:1296:110092/2	129	chr7	100417873	57	50M	*	0	0	CAGGACCCTTCTCCTGACAGGGGCTTGAAGGTGCCCTGGGCACTGGCAGG	CCCFFFFFHHHHHJJJGHIJJJJJJIA>GDH?BBHHBDGGB>B98B####	RG:Z:110624_UNC14-SN744_0134_AD0CVTABXX_8_	IH:i:1	HI:i:1	NM:i:3
    UNC14-SN744_134:8:1101:1298:165228/1	83	chr7	100463356	69	50M	=	100459519	-3887	ACACGTTGGTCCTAGGTTTCTACGATGACGCTCCACCGCAGGACCATTTC	IGGJJJJIJJJIJIJJJJGJJIIJJJJJJJJIIHEIIHHHHHFFFFF@@B	RG:Z:110624_UNC14-SN744_0134_AD0CVTABXX_8IH:i:1	HI:i:1	NM:i:0
    UNC14-SN744_134:8:1101:1298:165228/2	163	chr7	100459519	69	15M769N35M	=	100463356	3887	CCCTGGGAGACCTCGACTCCCTGCCCTCGGACCCTGTACAGCCGCAGTAT	CCCFFFFFHHHHHJIIJJJJIIJJJIJJJJJJJJJJHIHIJCHJIHIHHE	XF:Z:GTAG,	RG:Z:110624_UNC14-SN744_0134_AD0CVTABXX_8_	IH:i:1	HI:i:1	NM:i:0	XS:A:+
    UNC14-SN744_134:8:1101:1306:60600/1	99	chr7	100417799	69	50M	=	100419893	2144	GGAAGTACCCGACGCGGCACTGGCAGACGGCTGATCCAATGGTGTTAGAG	BCCFFDDFHHHFHJJJJJJJGIJIIJ;F@FA@B=ACH;B;@C);.;;>C>	RG:Z:110624_UNC14-SN744_0134_AD0CVTABXX_8IH:i:1	HI:i:1	NM:i:0
    UNC14-SN744_134:8:1101:1306:60600/2	147	chr7	100419893	69	50M	=	100417799	-2144	CTCGGCACTTGGTGTTCCCCTCAGCTGCCTCGAACCCCGGAGCACAGCTG	<B>HHECHFIIIHCHGIIIGGEIIJIIJJIJIHFJJJHHHHHFDFFFCCC	RG:Z:110624_UNC14-SN744_0134_AD0CVTABXX_8IH:i:1	HI:i:1	NM:i:0
    and then I used TopHat2 bam2fastx
    Code:
    bam2fastx -q -A -o sample.fastq -P -N sample_sn.bam
    resulting in this error
    Code:
    Error: couldn't retrieve both reads for pair UNC14-SN744_134:8:1101:1284:144798/1. Perhaps the input file is not sorted by name?
    (using 'samtools sort -n' might fix this)
    Could someone explain this issue? Have you got any suggestion?

    Thanks!

  • #2
    If this is TCGA RNA-seq data from UNC then the following would work. Send me a PM if you have any problems.


    In certain circumstances, a small fraction of the sequences and quality scores in these reads are rearranged such that they cannot perfectly reconstruct the original fastq record. To remedy this error we have provided fastq files to CGHUB.

    OR

    A sam2fastq option is available in UBU version 1.2. It is only properly tested against Mapsplice paired end.

    Sample usage:

    Code:
    $ java -Xmx512M -jar ubu.jar sam2fastq --in sorted_by_name.bam --fastq1 1.fastq --fastq2 2.fastq --end1 /1 --end2 /2
    The input BAM should be sorted by name. i.e. with "samtools sort -n"

    The standalone jar file ubu-1.2-jar-with-dependencies.jar is available from the UBU downloads page:

    Comment


    • #3
      Thanks GenoMax! You are right!
      They are TCGA RNA-Seq data, and ubu sam2fastq worked!

      Comment


      • #4
        UBU likes only paired reads in the BAM files

        In case this helps anyone else: when I was converting TCGA RNA-seq reads to fastq format UBU complained about the presence of unpaired reads. The following was my workaround.
        1. Split paired and unpaired bam records.
          Code:
          samtools  view -b -U unpaired.bam -o paired.bam  \
                  -@ 3  -f 1 \
                  $BAM
        2. Sort paired reads by name.
          Code:
          samtools sort \
                  -n -o namesort.bam  -T namesort_pre -@ 3 -m 3G -O bam \
                  paired.bam
        3. Run UBU sam2fastq on paired namesorted reads, outputing --fastq1 and --fastq2
          Code:
          java -jar -Xmx512m ubu-1.3-SNAPSHOT-jar-with-dependencies.jar sam2fastq \
                  --in namesort.bam \
                  --fastq1 r1.fastq \
                  --fastq2 r2.fastq \
                  --mapsplice
        4. Run UBU sam2fastq on unpaired reads, outputting --fastq1 only into an unpaired fastq file.
          Code:
          java -jar -Xmx512m ubu-1.3-SNAPSHOT-jar-with-dependencies.jar sam2fastq \
                  --in  unpaired.bam  \
                  --fastq1 fu.fastq \
                  --mapsplice

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Current Approaches to Protein Sequencing
          by seqadmin


          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
          04-04-2024, 04:25 PM
        • seqadmin
          Strategies for Sequencing Challenging Samples
          by seqadmin


          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
          03-22-2024, 06:39 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, 04-11-2024, 12:08 PM
        0 responses
        30 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 10:19 PM
        0 responses
        32 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 09:21 AM
        0 responses
        28 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-04-2024, 09:00 AM
        0 responses
        52 views
        0 likes
        Last Post seqadmin  
        Working...
        X