We recently sequenced a specific mouse strain. The sequencing data was generated on the 5500 XL platform from the same mate-pair library from a single male mouse liver. We had our sequencing done on three flowchips with each using 6,6,3 lanes respectively and generated in total of 15 lanes of data.
I have few doubts regarding the different terminologies used such as sample, group id for my experiment. I am writing down what I have understood so far. Please correct me if I am wrong.
1) The six lanes in a flowchip are independent. This means that beads belonging to different lanes may have same bead ids. I also noticed this in output csfasta files. I mean two csfasta files (2 lanes) either from the same flowchip or different flowchips have same csfasta header or tag id (for e.g. >96_579_1392) for reads with different sequence.
Now, what i have understood from other resources is that each lane must be assigned a different readgroup id in the bam file. This way even if we merge two different bam files generated from independent lanes later on, the readgroup id will be able to take care of the confusion in a way shown below:
96_579_1392 115 10 ....... RG:Z:lane1 NH:i:1 CM:i:5 NM:i:0 CQ:Z:>;6?@@@==@@@@;@@@?.@@=--@@8=*8@@8*@?@ CS:Z:T1113323122311310213123332020212001
96_579_1392 131 5 ....... RG:Z:lane2 NH:i:0 CM:i:2 NM:i:0 CQ:Z:>;@@;@@@?.@@=--@@8=*8@@8*@?@@0/@@@5;@@ CS:Z:T1131233320202113323122311310212001
Can you tell me if I my understanding of this concept is correct?
2) My second question is related to Sample ID (SM) and Library (LB) tags in the SAM format. According to my understanding, the major organizational units for NGS analysis are lane < Library < Sample < Multiple-samples. In other words, multiple libraries (PE,SE or different insert sizes) for the same sample can be made and sequenced using 1 or more lanes. In our case, we have 1 sample (the mouse strain), 1 library (mate pair) and 15 lanes of data. This means that my 15 sam/bam files should have the same library and sample ID, and different readgroupID.
Am I correct?
Thanks a lot for your time.
I have few doubts regarding the different terminologies used such as sample, group id for my experiment. I am writing down what I have understood so far. Please correct me if I am wrong.
1) The six lanes in a flowchip are independent. This means that beads belonging to different lanes may have same bead ids. I also noticed this in output csfasta files. I mean two csfasta files (2 lanes) either from the same flowchip or different flowchips have same csfasta header or tag id (for e.g. >96_579_1392) for reads with different sequence.
Now, what i have understood from other resources is that each lane must be assigned a different readgroup id in the bam file. This way even if we merge two different bam files generated from independent lanes later on, the readgroup id will be able to take care of the confusion in a way shown below:
96_579_1392 115 10 ....... RG:Z:lane1 NH:i:1 CM:i:5 NM:i:0 CQ:Z:>;6?@@@==@@@@;@@@?.@@=--@@8=*8@@8*@?@ CS:Z:T1113323122311310213123332020212001
96_579_1392 131 5 ....... RG:Z:lane2 NH:i:0 CM:i:2 NM:i:0 CQ:Z:>;@@;@@@?.@@=--@@8=*8@@8*@?@@0/@@@5;@@ CS:Z:T1131233320202113323122311310212001
Can you tell me if I my understanding of this concept is correct?
2) My second question is related to Sample ID (SM) and Library (LB) tags in the SAM format. According to my understanding, the major organizational units for NGS analysis are lane < Library < Sample < Multiple-samples. In other words, multiple libraries (PE,SE or different insert sizes) for the same sample can be made and sequenced using 1 or more lanes. In our case, we have 1 sample (the mouse strain), 1 library (mate pair) and 15 lanes of data. This means that my 15 sam/bam files should have the same library and sample ID, and different readgroupID.
Am I correct?
Thanks a lot for your time.