Ahmed, Irfan
정보통신전문대학원 정보통신공학과
The Graduate School, Ajou University
Publication Year
Alternative Abstract
A file-type (such as MP3, DOC and AVI) represents an encoding scheme in order to understand the underlying information in file contents. The classification of contents into their respective file types is a non-trivial task since it is required by many security applications in order perform their functions effectively. For instance, email attachment filtering requires blocking the types of inbound attachment that may contain malicious contents. The current practice of identifying encoding scheme relies on metadata information. For instance, file extensions (combining the file type with the name using a period), magic numbers (keeping the file type information in a file header) and file systems are used to store metadata information. However, these are susceptible to tampering or corruption, for instance, the file-extension can be easily spoofed and the magic numbers can be obfuscated. A more reliable approach may be to analyze the content for type identification. Since content-based schemes use statistical and data mining techniques, they are inaccurate and time-consuming (if compared with metadata techniques). In this dissertation, we propose several techniques to improve the accuracy and detection speed of content-based type identification. First, we propose a divide-and-conquer approach to improve the classification accuracy. We decompose the identification procedure into two steps: In the first step, the similar files in terms of byte pattern frequencies are grouped into several clusters. In the next step, the cluster which contains different file-types is fed to a neural network in order for finer classification. The experiments showed that the classification followed by clustering leads to higher accuracies. Second, we propose a feature selection technique in order to reduce the number of features, since current schemes use all the byte patterns as features. It uses a subset of highly-occurring byte patterns as features assuming that they are sufficient to build the representative model of a file type and classifying files. To evaluate its effectiveness, we applied it to the six most popular classification algorithms (i.e. neural network, linear discriminant analysis, K-means, K-nearest neighbor, decision tree, and support vector machine). On average, the K-nearest neighbor method achieved the optimum accuracy of 90% using only 40% of byte patterns; this reduces 55% of computation time. Furthermore, we propose to use the cosine distance as a similarity metric when comparing the file content rather than the Mahalanobis distance that is popular and has been used by the other related approaches. We show that the cosine similarity (unlike the Mahalanobis distance) retains the classification accuracy on a small number of highly frequent byte patterns which leads to smaller model size and faster detection rate. Third, we propose a content sampling technique, since the current schemes process a whole file to obtain its byte frequency distribution, It uses a small portion of a file to obtain its byte-frequency distribution. To evaluate the effectiveness of this approach, we sample in two ways: 1) initial contiguous bytes (where the frequency is generated from the first few consecutive bytes of a file), 2) a few small blocks in random locations in a file. The scheme is effective for large size files where a relatively small sample can generate the representative byte frequency distribution. For instance, it reduces the sampling size of MP3 files from 5MB to 400KB (without compromising the accuracy). This is a 15 fold size reduction. Furthermore, since the content sampling technique cannot classify small-size contents (such as packet payload), we propose a signature-free content-classification scheme that identifies executable contents in incoming packets. For accurate detection, the proposed scheme analyzes the packet payload in two steps. It first analyzes the packet payload to see if it contains multimedia-type data (binary contents except executables such as avi, wmv, jpg, etc.). If not, in the second step it classifies the payload either as text-type (txt, jsp, asp, etc.) or executable-type data. We propose two-step scheme because we found that characteristics of multimedia-type are different enough from text and executable-type, but the difference between text and executable (code)-type is smaller, which requires different technique to distinguish the two. To evaluate the proposed scheme, we transfer the different types of files (i.e. executable, multimedia and text files) using an FTP server in order to classify them into their respective types. It produces 2.53% and 4.69% of false positive and false negative rates respectively.

Appears in Collections:
Special Graduate Schools > Graduate School of Information and Communication Technology > Department of Information and Communication > 3. Theses(Master)
Files in This Item:
There are no files associated with this item.
RIS (EndNote)
XLS (Excel)

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.