Recently, high-throughput genomic profiling studies have demonstrated the huge dieverity of mutation profiles in cancers. Although the muations are thought play driver roles in cancer development and progression, it is not easy to define driver mutations for cancer progression from the huge number of mutations in genomic data. Recently, large scale public databases such as The Cancer Genome Atlas (TCGA) have been released in public, providing genomic landscapes of sequence varations in numerous cancer types. These large-scale collection of data inevitably generates batch effects introduced by differences in processing at various stages from sample collection to data generation. However, batch effects on the sequence variation and its characteristics have not been studied extensively. Here, in part 1, I evaluated batch effects on somatic sequence variation in pan-cancer TCGA data. In addition, in part 2, to delineate the driver mutations in liver cancer, I analyzed RNA-Seq data from liver cancer patients. By comparing the mutations and transcriptomes between primary and recurrent tumors, I sought to idetntify driver mutations that might be responsible for the recurrence of liver cancer.
Part 1. I systematically evaluated batch effects on somatic sequence variations in pan-cancer TCGA data, revealing 999 somatic variants that were batch-biased with statistical significance (P<0.00001, Fisher’s exact test, false discovery rate ≤ 0.0027). Most of the batch-biased variants were associated with specific sample plates. The batch-biased variants, which had a unique mutational spectrum with frequent indel-type mutations, preferentially occurred at sites prone to sequencing errors, e.g., in long homopolymer runs. Non-indel type batch-biased variants were frequent at splicing sites with the unique consensus motif sequence ‘TTDTTTAGTT’. Furthermore, some batch-biased variants occur in known cancer genes, potentially causing misinterpretation of mutation profiles.
Part 2. Recurrence of hepatocellular carcinoma (HCC) even after curative resection causes dismal outcomes of patients. To delineate the driver events of genomic and transcription alteration during HCC recurrence, I performed RNA-Seq profiling of the paired primary and recurrent tumors from two patients with intrahepatic HCC. By comparing the mutational and transcriptomic profiles, I identified somatic mutations acquired by HCC recurrence including novel mutants of GOLGB1 (E2721V) and SF3B3 (H804Y). By performing experimental evaluation using siRNA-mediated knockdown and overexpression constructs, I demonstrated that the mutants of GOLGB1 and SF3B3 can promote cell proliferation, colony formation, migration, and invasion of liver cancer cells. Transcriptome analysis also revealed that the recurrent HCCs reprogram their transcriptomes to acquire aggressive phenotypes. Network analysis revealed CXCL8 (IL-8) and SOX4 as common downstream targets of the mutants. These reults indicate that the mutations of GOLGB1 and SF3B3 are potential key drivers for the acquisition of an aggressive phenotype in recurrent HCC.
In summary, from above the two studies, I suggest that mutation analysis with careful consideration of sytesmatic biases is needed for correct interpretation of large scale genomic data, and the establishmet of appropriate study designs and analysis strategies is important for identifying driver mutations from cancer genome data.