Development of cancer pathology data model and natural language processing based data conversion methodology

Alternative Title
암 병리 데이터 모델 및 자연어처리기반 데이터 변환 방법론 개발
Author(s)
신다혜
Alternative Author(s)
Da Hye Shin
Advisor
박래웅
Department
일반대학원 의학과
Publisher
The Graduate School, Ajou University
Publication Year
2020-08
Language
eng
Keyword
Cancer pathology reportCommon data modelConvolutional neural networkData modelingDistributed research networkNamed entity recognitionNatural language processing
Alternative Abstract
As per the cancer statistics, the total number of cancer patients in Korea as of 2017 is 232,255, which is 1,019 more compared to 2016. Various tests have been conducted to diagnose and treat cancer; consequently, a large amount of unstructured clinical data, including texts, images, and videos, are produced. The cancer pathology report is an extremely important information source to provide guidance on cancer diagnosis and treatment because it contains information on cancer type, characteristics, and cancer stages. However, this report is usually in the form of free-descriptive texts and such texts should be converted into structured ones such that machines can understand. Therefore, this study aims at developing a cancer pathology data model to create a structured data, and, based on such a data model, develop a cancer pathology data conversion methodology that uses a model based on natural language processing (NLP) for information extraction. For this purpose, we have collected overseas and domestic pathology reports and documents related to breast, thyroid, colorectal, and gastric cancers, whose occurrence is high in Korea. We then analyzed the data system, as well as established a dictionary of vocabulary used in these studies. Accordingly, we developed a cancer pathology model comprising four tables of specimen basic information, specimen common observation information, specimen-specific observation information, and immunohistochemical test information. To extract information from unstructured texts of cancer pathology reports, two types of models have been developed. The first model is called named entity recognition (NER) model, applying the convolutional neural network (CNN) algorithm that uses Spacy, an NLP library. The second model is a hybrid one that was created by adding a rules-based algorithm to the first model. All 1200 studies were randomly selected from Ajou University Hospital’s pathology reports for four types of cancers and the entity annotation was then performed. Then, the model was trained using 960 training data sets and its performance evaluation was performed using 240 test data sets. To assess the model’s generalization possibility, 200 cancer pathology reports of the external institution B and 30 sets of online data were produced and used. The data conversion methodology was first developed by cancer pathology data model as a result of applying the final selected named entity recognition model to four types of cancer pathology reports by Ajou University Hospital, and then by designing the extract, transform, and load process to convert such data into the OMOP Common Data Model (CDM). Moreover, to verify the methodology, cancer pathology data model and common data model were established from which 400 sets were randomly extracted whose accuracy was then manually reviewed by one researcher. The comparison between both named entity recognition models shows that a single model based on CNN has an f1-score of 0.965, which was 0.111 higher than that of the hybrid model. After assessing these models by applying to the external institutions, an f1-score of 0.711 and 0.854 was produced, thus demonstrating its possible application to the external agency. Using a cancer pathology data model, which was established from using the newly developed data conversion methodology, it was confirmed that data generalization is possible regardless of the types of cancer. Moreover, the manual review of the data demonstrated an accurate rate of 96.91%. In this study, the cancer pathology data model and data conversion methodology proposed is highly effective for extracting, storing and utilizing various data in large amounts from different types of cancer pathology studies. This is expected to contribute to promoting precision research for cancer treatment along with the existing clinical data.
URI
https://dspace.ajou.ac.kr/handle/2018.oak/19720
Fulltext

Appears in Collections:
Graduate School of Ajou University > Department of Medicine > 4. Theses(Ph.D)
Files in This Item:
There are no files associated with this item.
Export
RIS (EndNote)
XLS (Excel)
XML

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

Browse