A Comparative Study of Malicious URL Detection Model: CNN vs. Logistic Regression and Gated Recurrent Units

Danie, Umejuru; Augustine, Ugbari

A Comparative Study of Malicious URL Detection Model: CNN vs. Logistic Regression and Gated Recurrent Units

Danie, Umejuru and Augustine, Ugbari (2025) A Comparative Study of Malicious URL Detection Model: CNN vs. Logistic Regression and Gated Recurrent Units. International Journal of Innovative Science and Research Technology, 10 (7): 25jul1306. pp. 3599-3611. ISSN 2456-2165

[A][B][+][-]

Abstract

Malicious URLs pose a serious threat on the world wide web to users all over the world. The challenges emanating from URLs which are malicious are many and very worrisome to internet users globally. This has informed, and thus propels the development of newer models to solve the lingering challenge in the Cyber Security space. These newer notification and detection models are been developed in other to mitigate the gaps and also curb the challenges caused by unknowingly using or clicking further using a malicious URL. This study aims at developing a novel malicious URL detection and notification model by using CNN and further incorporating CNN with penalty term on kernel, weight and bias in other to increase models detection accuracy, reduce time complexity and also address misclassification issues as well as poor prediction accuracy. The CNN with penalty term is being used against Logistic Regression (LR) and Recurrent Gated Units (RNN- GRU) which increased the resilience of the suggested model as well as enhancing classification prediction. The diagnostic tools employed for the proposed model are accuracy, confusion matrix, precision, recall, F1 score, and AUC-ROC. This study outlined a novel method capable of identifying malicious URLs using features primarily obtained from the phishing and real URL addresses. A temporal tokenizer was generated and used for URL text processing which scanned, recognized characters, symbols and redundant tokens. This made it easier to separate specific features from the URL address and return as a list while also identifying directories, keyword arguments, and extensions. Summary of the experimental results shows that the proposed CNN with penalty term (98.2%) fared better than LR and RNN-GRU approaches which yielded a recommendable prediction accuracy of 89.85% and 91.5% respectively.

Documents