EFFICIENT DEEP FEATURES LEARNING FOR VULNERABILITY DETECTION USING CHARACTER N- GRAM EMBEDDING


(Received: 19-Aug.-2020, Revised: 2-Oct.-2020 and 28-Oct.-2020 , Accepted: 5-Nov.-2020)
Deep Learning (DL) techniques were successfully applied to solve challenging problems in the field of Natural Language Processing (NLP). Since source code and natural text share several similarities, it was possible to adopt text classification techniques, such as word embedding, to propose DL-based Automatic Vulnerabilities Prediction (AVP) approaches. Although the obtained results were interesting, they were not good enough compared to those obtained in NLP. In this paper, we propose an improved DL-based AVP approach based on the technique of character n-gram embedding. We evaluate the proposed approach for 4 types of vulnerabilities using a large c/c++ open-source codebase. The results show that our approach can yield a very excellent performance which outperforms the performances obtained by previous approaches.

[1] P. Bojanowski, E. Grave, A. Joulin and T. Mikolov, "Enriching Word Vectors with Subword Information," Trans. Assoc. Comput. Linguist., vol. 5, pp. 135–146, DOI: 10.1162/tacl_a_00051, 2017.

[2] Y. Shin, A. Meneely, L. Williams and J. A. Osborne, "Evaluating Complexity, Code Churn and Developer Activity Metrics As Indicators of Software Vulnerabilities," IEEE Trans. Softw. Eng., vol. 37, no. 6, pp. 772–787, DOI: 10.1109/TSE.2010.81, 2011.

[3] T. Zimmermann, N. Nagappan and L. Williams, "Searching for a Needle in a Haystack: Predicting Security Vulnerabilities for Windows Vista," Proc. of the 3rd Int. Conf. on Software Testing, Verification and Validation (ICST 2010), pp. 421–428, DOI: 10.1109/ICST.2010.32, Paris, France, 2010.

[4] P. Morrison, K. Herzig, B. Murphy and L. Williams, "Challenges with Applying Vulnerability Prediction Models," Proceedings of the 2015 Symposium and Bootcamp on the Science of Security (HotSoS '15), pp. 1–9, DOI: 10.1145/2746194.2746198, 2015.

[5] S. Moshtari and A. Sami, "Evaluating and Comparing Complexity, Coupling and a New Proposed Set of Coupling Metrics in Cross-project Vulnerability Prediction," Proceedings of the 31st Annual ACM Symposium on Applied Computing ( SAC ’16), pp. 1415–1421, DOI: 10.1145/2851613.2851777, 2016.

[6] I. Abunadi and M. Alenezi, "Towards Cross Project Vulnerability Prediction in Open Source Web Applications," Proceedings of the the International Conference on Engineering & MIS 2015 (ICEMIS ’15), pp. 1–5, DOI: 10.1145/2832987.2833051, 2015.

[7] J. Walden, J. Stuckman and R. Scandariato, "Predicting Vulnerable Components: Software Metrics vs. Text Mining," Proc. of the 25th IEEE International Symposium on Software Reliability Engineering (ISSRE), pp. 23–33, DOI: 10.1109/ISSRE.2014.32, Naples, Italy, 2014.

[8] M. Zagane and M. K. Abdi, "Evaluating and Comparing Size, Complexity and Coupling Metrics As Web Applications' Vulnerabilities Predictors," Int. J. Inf. Technol. Comput. Sci., vol. 11, no. 7, pp. 35– 42, DOI: 10.5815/ijitcs.2019.07.05, 2019.

[9] M. Zagane, M. K. Abdi and M. Alenezi, "A New Approach to Locate Software Vulnerabilities Using Code Metrics," Int. J. Softw. Innov., vol. 8, no. 3, pp. 82–95, DOI: 10.4018/IJSI.2020070106, Jul. 2020.

[10] A. Hovsepyan, R. Scandariato, W. Joosen and J. Walden, "Software Vulnerability Prediction Using Text Analysis Techniques," Proceedings of the 4th International Workshop on Security Measurements and Metrics (MetriSec '12), p. 7, DOI: 10.1145/2372225.2372230, 2012. 

[11] B. Turhan and A. Bener, "A Multivariate Analysis of Static Code Attributes for Defect Prediction," Proceedings of the 7th IEEE International Conference on Quality Software (QSIC 2007), pp. 231–237, DOI: 10.1109/QSIC.2007.4385500, 2007.

[12] H. Abandah and I. Alsmadi, "Call Graph Based Metrics to Evaluate Software Design Quality," Int. J. Softw. Eng. and Its Appl., vol. 7, no. 1, pp. 1–12, 2013.

[13] T. Hall, S. Beecham, D. Bowes, D. Gray and S. Counsell, "A Systematic Literature Review on Fault Prediction Performance in Software Engineering," IEEE Transactions on Software Engineering, vol. 38, no. 6. pp. 1276–1304, DOI: 10.1109/TSE.2011.103, 2012.

[14] B. Turhan, A. Bener and T. Menzies, "Nearest Neighbor Sampling for Cross Company Defect Predictors," Proceedings of the 1st International Workshop on Defects in Large Software Systems (DEFECTS’08), p. 26, DOI: 10.1145/1390817.1390824, 2008.

[15] T. Menzies, J. Greenwald and A. Frank, "Data Mining Static Code Attributes to Learn Defect Predictors," IEEE Trans. Softw. Eng., vol. 33, no. 1, pp. 2–14, DOI: 10.1109/TSE.2007.10, 2007.

[16] Z. Li et al., "VulDeePecker: A Deep Learning-based System for Vulnerability Detection," Proceedings of Network and Distributed System Security Symposium, DOI: 10.14722/ndss.2018.23158, 2018.

[17] T. Shippey, D. Bowes and T. Hall, "Automatically Identifying Code Features for Software Defect Prediction: Using AST N-grams," Inf. Softw. Technol., vol. 106, pp. 142–160, DOI: 10.1016/j.infsof.2018.10.001, Feb. 2019.

[18] H. K. Dam, T. Tran, T. T. M. Pham, S. W. Ng, J. Grundy and A. Ghose, "Automatic Feature Learning for Predicting Vulnerable Software Components," IEEE Trans. Softw. Eng., pp. 1–1, DOI: 10.1109/TSE.2018.2881961, 2019.

[19] C. Catal, "Can We Predict Software Vulnerability with Deep Neural Network ?" Proc. of the 19th Int. Multiconference INFORMATION SOCIETY- IS, no. October, pp. 19–22, Ljubljana, Slovenia, 2016.

[20] C. Catal, A. Akbulut, E. Ekenoglu and M. Alemdaroglu, "Development of a Software Vulnerability Prediction Web Service Based on Artificial Neural Networks," Proc. of the Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 59–67, DOI: 10.1007/978-3-319-67274-8_6, 2017.

[21] J. Walden, J. Stuckman and R. Scandariato, "Web Apps Vulnerability Dataset," [Online], Available: http://seam.cs.umd.edu/webvuldata, 2014.

[22] M. Zagane, M. K. Abdi and M. Alenezi, "Deep Learning for Software Vulnerabilities Detection Using Code Metrics," IEEE Access, vol. 8, pp. 74562–74570, DOI: 10.1109/ACCESS.2020.2988557, 2020.

[23] M. Zagane and M. K. Abdi, "Code Mmetrics Dataset (PU)," [Online]. Available: https://github.com/codemetricsdaset/slice_codemetricsdataset/.

[24] F. Tip, "A Survey of Program Slicing Techniques," J. Program. Lang., vol. 5399, no. 3, pp. 1–65, 1995.

[25] M. Weiser, "Program Slicing," IEEE Trans. Softw. Eng., vol. SE-10, no. 4, pp. 352–357, DOI: 10.1109/TSE.1984.5010248, Jul. 1984.

[26] J. Silva, "A Vocabulary of Program Slicing-based Techniques," ACM Comput. Surv., vol. 44, no. 3, pp. 1–41, DOI: 10.1145/2187671.2187674, Jun. 2012.

[27] R. Russell et al., "Automated Vulnerability Detection in Source Code Using Deep Representation Learning," Proceedings of the 17th IEEE International Conference on Machine Learning and Applications (ICMLA), pp. 757–762, DOI: 10.1109/ICMLA.2018.00120, Orlando, USA, 2019.

[28] Z. Li, D. Zou, S. Xu, H. Jin, Y. Zhu and Z. Chen, "SySeVR: A Framework for Using Deep Learning to Detect Software Vulnerabilities," arXiv:1807.06756v2, pp. 1–13, DOI: 10.21227/fhg0-1b35, Jul. 2018.

[29] Z. Li, D. Zou, J. Tang, Z. Zhang, M. Sun and H. Jin, "A Comparative Study of Deep Learning-based Vulnerability Detection System," IEEE Access, vol. 7, pp. 103184–103197, DOI: 10.1109/ACCESS.2019.2930578, 2019.

[30] S. Liu et al., "CD-VulD: Cross-Domain Vulnerability Discovery Based on Deep Domain Adaptation," IEEE Trans. Dependable Secur. Comput., pp. 1–1, DOI: 10.1109/TDSC.2020.2984505, 2020.

[31] T. Mikolov, K. Chen, G. Corrado and J. Dean, "Efficient Estimation of Word Representations in Vector Space," Proc. of the 1st International Conference on Learning Representations (ICLR 2013), arXiv:1301.3781v3, [Online], Available: https://storage.googleapis.com/pub-tools-public-publication- data/pdf/41224.pdf, 2013.

[32] C. Tomas Mikolov, "Word2Vec.," Google Inc., Mountain View, [Online], Available: https://code.google.com/archive/p/word2vec/. 

[33] Y. Kim, "Convolutional Neural Networks for Sentence Classification," Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1746–1751, DOI: 10.3115/v1/D14-1181, Doha, Qatar, 2014.

[34] S. Liu, G. Lin, Q.-L. Han, S. Wen, J. Zhang and Y. Xiang, "DeepBalance: Deep-Learning and Fuzzy Oversampling for Vulnerability Detection," IEEE Trans. Fuzzy Syst., pp. 1–1, DOI: 10.1109/TFUZZ.2019.2958558, 2019.

[35] X. Ban, S. Liu, C. Chen and C. Chua, "A Performance Evaluation of Deep-learnt Features for Software Vulnerability Detection," Concurrency Computation, vol. 31, no. 19, DOI: 10.1002/cpe.5103, 2019.

[36] T. M. P. Bojanowski, E. Grave and A. Joulin, "fastText," Library for Efficient Text Classification and Representation Learning, [Online], Available: https://fasttext.cc/.

[37] X. Du et al., "LEOPARD: Identifying Vulnerable Code for Vulnerability Assessment through Program Metrics," Proceedings of the 41st International Conference on Software Engineering (ICSE '19), vol. 2019-May, pp. 60–71, DOI: 10.1109/ICSE.2019.00024, Jan. 2019.

[38] K. Pan, S. Kim and E. Whitehead, Jr., "Bug Classification Using Program Slicing Metrics," Proc. of the 6th IEEE International Workshop on Source Code Analysis and Manipulation, pp. 31–42, DOI: 10.1109/SCAM.2006.6, 2006.

[39] J. Wieting, M. Bansal, K. Gimpel and K. Livescu, "Towards Universal Paraphrastic Sentence Embeddings," Proc. of the 4th International Conference on Learning Representations (ICLR 2016), pp. 1- 19, [Online], Available: https://arxiv.org/pdf/1511.08198.pdf, 2016.

[40] S. Arora, Y. Liang and T. Ma, "A Simple But Tough-to-beat Baseline for Sentence Embeddings," Proc. of the 5th International Conference on Learning Representations (ICLR 2017), pp. 1-16, [Online], Available: https://openreview.net/pdf?id=SyK00v5xx, 2019.

[41] Z. Li, D. Zou, S. Xu, H. Jin, Y. Zhu and Z. Chen, "SeVC and SyVC Dataset," [Online], Available: https://github.com/SySeVR/SySeVR/.

[42] T. M. P. Bojanowski, E. Grave, A. Joulin, "fastText Documentation," [Online], Available: https://fasttext.cc/docs/.

[43] DL4J, "Deep Learning for Java," [Online], Available: https://deeplearning4j.org/, 2020.

[44] GitHub, "Char N-gram Embedding Dataset for DL-based AVP," [Online], Available: https://github.com/dzresearcher/char_n-gram_embedding_dataset_for_DL_AVP.

[45] S. Lang, F. Bravo-Marquez, C. Beckham, M. Hall and E. Frank, "WekaDeeplearning4j: A Deep Learning Package for Weka Based on Deeplearning4j," Knowledge-Based Syst., vol. 178, pp. 48–50, DOI: 10.1016/j.knosys.2019.04.013, Aug. 2019.

[46] Machine Learning Group at the University of Waikato, "Weka API Online Doc," [Online], Available: http://weka.sourceforge.net/doc.dev/.

[47] GitHub, "Online Documentation of the Wekadeeplearning4j Java API," [Online], Available: https://waikato.github.io/wekaDeeplearning4j/.