CFP last date
20 April 2026
Call for Paper
May Edition
IJCA solicits high quality original research papers for the upcoming May edition of the journal. The last date of research paper submission is 20 April 2026

Submit your paper
Know more
Random Articles
Reseach Article

A Hybrid Structural and TF-IDF-based Machine Learning Framework for Large-Scale Phishing URL Detection

by Handayani, Ety Sutanty, Esti Setiyaningsih
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 187 - Number 90
Year of Publication: 2026
Authors: Handayani, Ety Sutanty, Esti Setiyaningsih
10.5120/ijca2026926600

Handayani, Ety Sutanty, Esti Setiyaningsih . A Hybrid Structural and TF-IDF-based Machine Learning Framework for Large-Scale Phishing URL Detection. International Journal of Computer Applications. 187, 90 ( Mar 2026), 52-59. DOI=10.5120/ijca2026926600

@article{ 10.5120/ijca2026926600,
author = { Handayani, Ety Sutanty, Esti Setiyaningsih },
title = { A Hybrid Structural and TF-IDF-based Machine Learning Framework for Large-Scale Phishing URL Detection },
journal = { International Journal of Computer Applications },
issue_date = { Mar 2026 },
volume = { 187 },
number = { 90 },
month = { Mar },
year = { 2026 },
issn = { 0975-8887 },
pages = { 52-59 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume187/number90/a-hybrid-structural-and-tf-idf-based-machine-learning-framework-for-large-scale-phishing-url-detection/ },
doi = { 10.5120/ijca2026926600 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2026-03-20T22:55:35.631789+05:30
%A Handayani
%A Ety Sutanty
%A Esti Setiyaningsih
%T A Hybrid Structural and TF-IDF-based Machine Learning Framework for Large-Scale Phishing URL Detection
%J International Journal of Computer Applications
%@ 0975-8887
%V 187
%N 90
%P 52-59
%D 2026
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Phishing attacks continue to pose significant cybersecurity risks by exploiting deceptive URLs to obtain sensitive user information, thereby necessitating accurate and scalable automated detection mechanisms. This study proposes a machine learning–based approach for phishing URL classification by integrating structural URL feature extraction with Natural Language Processing (NLP) techniques using Term Frequency–Inverse Document Frequency (TF-IDF). The dataset comprises 822,010 labeled URLs, consisting of 52% legitimate and 48% phishing instances, with prior validation to ensure the absence of missing values. Feature engineering was conducted through two complementary strategies: handcrafted structural features—including URL length, domain length, number of digits, special characters, suspicious keywords, HTTPS usage, and number of subdomains and TF-IDF based textual representation using unigram, bigram, and trigram tokenization. The combined feature set was used to train a Random Forest classifier with optimized hyperparameters, and model evaluation was performed using Stratified 5-Fold Cross Validation to preserve class distribution across training and testing subsets. Performance assessment was conducted using confusion matrix, precision, recall, and F1-score to provide a comprehensive evaluation of detection capability. The experimental findings indicate that the integration of structural and textual features significantly improves classification effectiveness, enabling robust and balanced detection of phishing and legitimate URLs, thus demonstrating the practical applicability of the proposed method for large-scale real-world deployment.

References
  1. S. Safi and M. A. Serhani, “A Systematic Literature Review on Phishing Website Detection Techniques,” Journal of King Saud University – Computer and Information Sciences, vol. 35, no. 6, 2023. DOI: 10.1016/J.JKSUCI.2022.10.017
  2. Q. E. Haq, M. A. Shah, and A. Maple, “Deep Learning-Based Phishing URL Detection,” Applied Sciences, vol. 14, no. 2, 2024. DOI: 10.3390/APP14020789
  3. N. F. Almujahid et al., “Comparative Evaluation of Machine Learning Algorithms for Phishing Site Detection,” PeerJ Computer Science, vol. 10, 2024. DOI: 10.7717/PEERJ-CS.1827
  4. R. Verma and K. Dyer, “On the Characterization of Phishing URLs Using Lexical and Host-Based Features,” Computer Networks, vol. 212, 2022. DOI: 10.1016/J.COMNET.2022.109041
  5. A. K. Jain and B. Gupta, “Machine Learning Based Phishing Detection Using URL Features,” Procedia Computer Science, vol. 218, 2023. DOI: 10.1016/J.PROCS.2023.01.089
  6. S. Aslam et al., “AntiPhishStack: A Stacked Generalization Model for Phishing Detection,” IEEE Access, vol. 12, 2024. DOI: 10.1109/ACCESS.2024.3365123
  7. M. Alazab et al., “Phishing Detection Using Hybrid Deep Learning Techniques,” IEEE Access, vol. 10, 2022. DOI: 10.1109/ACCESS.2022.3145632
  8. I. Altan and S. Karabatak, “Hybrid Phishing Detection Model Using Transformer-Based NLP,” Expert Systems with Applications, vol. 235, 2024. DOI: 10.1016/J.ESWA.2023.121102
  9. W. Guo et al., “Graph-Based Phishing URL Detection,” Computers & Security, vol. 133, 2024. DOI: 10.1016/J.COSE.2023.103379
  10. D. Sahoo, C. Liu, and S. C. H. Hoi, “Malicious URL Detection Using Machine Learning: A Survey,” ACM Computing Surveys, vol. 55, no. 1, 2023. DOI: 10.1145/3487552
  11. A. Bahnsen et al., “Feature Engineering for Phishing Detection: A Large-Scale Evaluation,” Future Generation Computer Systems, vol. 137, 2023. DOI: 10.1016/J.FUTURE.2022.09.031
  12. T. Kim et al., “URLNet: Learning a URL Representation With Deep Learning for Malicious URL Detection,” IEEE Transactions on Information Forensics and Security, vol. 17, 2022. DOI: 10.1109/TIFS.2022.3140912
  13. Y. Fang et al., “Phishing URL Detection With Attention-Based Bidirectional LSTM,” Security and Communication Networks, 2022. DOI: 10.1155/2022/4568723
  14. H. Yuan et al., “Generalization of Phishing Detection Models Using Domain Adaptation,” Computer Networks, vol. 225, 2024. DOI: 10.1016/J.COMNET.2024.109673
  15. M. M. Islam et al., “Comparative Analysis of Machine Learning Algorithms for Phishing URL Detection,” IEEE Access, vol. 11, 2023. DOI: 10.1109/ACCESS.2023.3278914
  16. A. Aljofey et al., “An Effective Phishing Detection Model Based on Character-Level Convolutional Neural Network,” Electronics, vol. 12, no. 5, 2023. DOI: 10.3390/ELECTRONICS12051234
  17. S. Marchal et al., “Off-the-Hook: An Efficient and Usable Client-Side Phishing Detection System,” IEEE Transactions on Computers, vol. 72, no. 4, 2023. DOI: 10.1109/TC.2022.3201456
  18. K. Singh and P. Kumar, “Ensemble Learning for Robust Phishing URL Detection,” Multimedia Tools and Applications, vol. 83, 2024. DOI: 10.1007/S11042-024-15873-2
  19. F. Alharbi et al., “Intelligent Phishing Detection Using Random Forest and Feature Selection Techniques,” IEEE Access, vol. 10, 2022. DOI: 10.1109/ACCESS.2022.3156789
  20. M. Aburrous et al., “Phishing Detection Using Machine Learning: An Empirical Study,” Scientific Reports, vol. 13, 2023. DOI: 10.1038/S41598-023-29814-7
  21. Y. Zhang et al., “Stacked Ensemble Learning for Phishing Website Detection,” IEEE Access, vol. 11, 2023. DOI: 10.1109/ACCESS.2023.3298765
  22. M. R. Karim et al., “Explainable XGBoost-Based Phishing URL Detection,” Applied Sciences, vol. 13, no. 7, 2023. DOI: 10.3390/APP13074321
  23. S. Aljabri, A. Alzahrani, and M. Hussain, “Phishing Website Detection Using Machine Learning and URL-Based Features,” Computers & Security, vol. 108, 2021. DOI: 10.1016/J.COSE.2021.102325
  24. A. K. Jain et al., “Host-Based and Lexical Feature Fusion for Phishing Detection,” Computers & Security, vol. 124, 2023. DOI: 10.1016/J.COSE.2022.102987
  25. Z. Zhang et al., “Lightweight Feature Engineering for Real-Time Phishing Detection,” Computers & Security, vol. 130, 2023.DOI: 10.1016/J.COSE.2023.103252
  26. J. Kim and H. Kim, “Comparative Study of Machine Learning Algorithms for Phishing Detection,” Expert Systems with Applications, vol. 186, 2021. DOI: 10.1016/J.ESWA.2021.115783
  27. H. Yuan et al., “Phishing Detection Based on URL Lexical Analysis,” Applied Sciences, vol. 12, no. 4, 2022. DOI: 10.3390/APP12042045
  28. S. R. Islam et al., “Boosting-Based Ensemble Model for Phishing Detection,” Electronics, vol. 13, no. 2, 2024. DOI: 10.3390/ELECTRONICS13020456
  29. L. Verma and R. S. Choudhary, “Text-Based Phishing Detection Using N-Gram Analysis,” Information Processing & Management, vol. 59, no. 3, 2022. DOI: 10.1016/J.IPM.2021.102858
  30. R. Islam and J. Abawajy, “Efficient Phishing Detection Using Text Mining Techniques,” Computers & Security, vol. 124, 2023. DOI: 10.1016/J.COSE.2022.102974
  31. M. Alqahtani et al., “Hybrid Phishing Detection Using URL and Textual Features,” IEEE Access, vol. 11, 2023. DOI: 10.1109/ACCESS.2023.3278914
  32. S. M. Mousavi, A. Ghaffari, and H. H. S. Javadi, “Comprehensive Phishing Detection Framework Using Ensemble Learning,” Expert Systems with Applications, vol. 213, 2023. DOI: 10.1016/J.ESWA.2022.119150
  33. A. Saleh and M. Alqatawna, “Handling Missing Data in Cybersecurity Datasets,” Computers & Security, vol. 120, 2022. DOI: 10.1016/J.COSE.2022.102791
  34. M. S. Hossain and G. Muhammad, “Data Quality and Preprocessing in Cybersecurity Analytics,” Future Generation Computer Systems, vol. 137, 2023. DOI: 10.1016/J.FUTURE.2022.10.015
  35. R. Patil and S. Sherekar, “URL-Based Phishing Detection Using Feature Extraction Techniques,” Procedia Computer Science, vol. 215, 2022. DOI: 10.1016/J.PROCS.2022.12.045
  36. A. Basnet, A. Sung, and Q. Liu, “Learning to Detect Phishing URLs,” IEEE Transactions on Dependable and Secure Computing, vol. 19, no. 5, 2022. DOI: 10.1109/TDSC.2021.3056543
  37. K. Sahingoz et al., “Machine Learning Based Phishing Detection from URLs,” Applied Soft Computing, vol. 105, 2021. DOI: 10.1016/J.ASOC.2021.107398
  38. J. Lin et al., “Recent Advances in Malicious URL Detection: A Systematic Review,” IEEE Access, vol. 12, 2024. DOI: 10.1109/ACCESS.2024.3371122
  39. H. Chen et al., “CNN-LSTM Hybrid Model for Phishing URL Detection,” Expert Systems with Applications, vol. 230, 2023. DOI: 10.1016/J.ESWA.2023.120123
  40. P. Sharma et al., “Transfer Learning for Cross-Domain Phishing Detection,” Computers & Security, vol. 132, 2024. DOI: 10.1016/J.COSE.2023.103215
  41. M. Almseidin, M. Alsaleem, and M. Al-Kasassbeh, “Detecting Phishing URLs Using Lexical Features and Machine Learning,” International Journal of Advanced Computer Science and Applications, vol. 12, no. 4, 2021.
  42. M. Almutairi et al., “Deep Neural Network-Based Phishing Detection Using URL Embedding,” IEEE Access, vol. 11, 2023. DOI: 10.1109/ACCESS.2023.3301124
  43. B. Alsubaie et al., “Hybrid Deep Learning Framework for Intelligent Phishing Detection,” IEEE Access, vol. 11, 2023. DOI: 10.1109/ACCESS.2023.3286543
  44. A. Jain and P. Gupta, “N-Gram Based Phishing URL Detection,” Information Sciences, vol. 576, 2021. DOI: 10.1016/J.INS.2021.06.048
  45. Y. Li et al., “Adversarial Attacks and Defenses in Malicious URL Detection,” Computers & Security, vol. 134, 2024. DOI: 10.1016/J.COSE.2024.103441
  46. R. Gupta et al., “Feature Selection Techniques for Phishing Detection Systems,” Knowledge-Based Systems, vol. 275, 2024. DOI: 10.1016/J.KNOSYS.2023.110702
  47. T. Fawcett, “An Introduction to ROC Analysis,” Pattern Recognition Letters, vol. 27, no. 8, pp. 861–874, 2006. DOI: 10.1016/J.PATREC.2005.10.010
  48. G. James, D. Witten, T. Hastie, and R. Tibshirani, An Introduction to Statistical Learning, 2nd ed., Springer, 2021. DOI: 10.1007/978-1-0716-1418-1
  49. L. Breiman, “Random Forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, 2001. DOI: 10.1023/A:1010933404324
  50. A. Géron, Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow, 2nd ed., O’Reilly, 2019.
  51. H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Transactions on Knowledge and Data Engineering, vol. 21, no. 9, 2009. DOI: 10.1109/TKDE.2008.239
  52. J. Davis and M. Goadrich, “The Relationship Between Precision-Recall and ROC Curves,” in ICML, 2006. DOI: 10.1145/1143844.1143874
  53. I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning, MIT Press, 2016.
  54. M. Sokolova and G. Lapalme, “A Systematic Analysis of Performance Measures for Classification Tasks,” Information Processing & Management, vol. 45, no. 4, 2009. DOI: 10.1016/J.IPM.2009.03.002
  55. C. Ferri, J. Hernández-Orallo, and R. Modroiu, “An Experimental Comparison of Performance Measures for Classification,” Pattern Recognition Letters, vol. 30, no. 1, 2009. DOI: 10.1016/J.PATREC.2008.08.010
  56. R. Kohavi, “A Study of Cross-Validation and Bootstrap for Accuracy Estimation,” in IJCAI, 1995.
  57. S. Lee et al., “Domain Adaptation in Phishing Detection Using Adversarial Learning,” Information Sciences, vol. 657, 2024. DOI: 10.1016/J.INS.2023.119876
Index Terms

Computer Science
Information Sciences

Keywords

Phishing URL Detection Random Forest TF-IDF URL Feature Extraction Stratified K-Fold Cross Validation Ensemble Learning