Identifying nonlinear RNAs from single-molecule CAGEscan data (#181)
Nonlinear RNAs, defined here as transcripts with atypical exon order (e.g. trans-splicing or exon scrambling), have been identified in both tumor and normal cells. They have exciting potential as biomarkers or functional molecules, but they are not well characterized by typical transcriptome measurements.
Here, we used single-molecule CAGEscan to characterize nonlinear RNAs in cervical cancer cell lines (CaSki, HeLa, c33a and SiHa), normal immortalised keratinocyte cell line (HaCaT) and cervical epithelial cell line (W12E). This technique attaches unique molecular identifiers (UMIs), random sequences of bases, to the 5’ end of each RNA, followed by PCR amplification, and paired-end sequencing of the 5’ end together with a random downstream segment. The UMIs enable us to group read-pairs deriving from the same original molecule. To reconstruct the original RNAs, we created a computational workflow including the following steps: grouping by UMI and 5’ end (allowing for sequencing errors), assembly of grouped reads (producing one or more contigs per molecule), mapping the contigs to a reference genome allowing for cis- and trans-splicing, and joining the overlapping contigs for each transcript molecule.
To minimize false discoveries due to library construction and sequencing errors, we focused on the instances that were present in at least two replicate samples out of three replicate samples per each cell line. Preliminary results indicate that the number of nonlinear RNAs were quite similar in the four cervical cancer cell lines and the normal immortalised keratinocyte cell line. Interestingly the number was higher by 2-fold in the normal cervical epithelial cell line. Intra-chromosomal nonlinear transcripts (involving exons from the same chromosome) were more common than inter-chromosomal nonlinear transcripts (involving exons from different chromosomes). In HeLa cells we saw a transcript with atypical exon order where part of the transcript was from a lincRNA and part from adjacent protein coding gene CDH13. Based on preliminary results it seems that it is possible to identify nonlinear RNAs from single-molecule CAGEscan data.