ANALYSIS OF MACHINE LEARNING TECHNIQUES FOR DETECTING MALICIOUS PDF FILES USING WEKA

Farida Keter; ANDREW MANGLE

NABET, NABET 2020 CONFERENCE

Farida Keter, ANDREW MANGLE

Last modified: 2021-02-15

Abstract

The expansion of cloud and connected software and hardware has increased the attack surface of the modern enterprise. The growth in quantity and quality have led to greater possibilities of system vulnerabilities leading to exploits. One of the threat vectors attackers use is embedding malware to Portable Document Format (PDF) files. The popularity and flexibility of these file formats have made PDFs an ideal target for unaware users. Malicious PDFs contain executable code used by attackers to steal company information or disrupt normal business operations. Adobe Acrobat and Reader users can view, create, manipulate, print, and manage files in PDFs shared hence increased risk.The National Technology Security Coalition report (2020) shows that 68% of data breaches occurred through email and 5% successful attacks through PDF files. In 2019, CVE recorded 17,306 software vulnerabilities on the Adobe Acrobat PDF reader. These software vulnerabilities on Adobe Acrobat PDF may lead to unauthorized users controlling the system, resulting in malicious programs, unauthorized access, and confidential data modification. The attacker may also delete data or create user accounts undetected.This study seeks to; identify threats, detect, classify, and create awareness of PDF malware on emails. This paper will present and compare different WEKA machine learning algorithms in malicious PDF detection and propose the best classifier from the analyzed algorithms.

Keywords

Machine Learning, Waikato Environment for Knowledge Analysis (WEKA), Portable Document Format (PDF)