A REVIEW ON THE SIGNIFICANCE OF MACHINE LEARNING FOR DATA ANALYSIS IN BIG DATA

(Received: 2-Aug-2019, Revised: 26-Oct-2019 , Accepted: 16-Nov-2019)

Authors Vishnu Vandana Kolisetty, Dharmendra Singh Rajput*,

Keywords #Big data #Machine learning #Data analysis #Big data implications #Big data challenges.

Abstract Big data revolution is changing the lifestyle in terms of working and thinking environments through facilitating improvement in vision finding and decision-making. But, big data science's technical dilemma is that there is no knowledge that can administer and analyze large amounts of actively increasing data and pull out valuable information. As data around the world grows rapidly and its distribution with real-time processing continues, traditional tools for automated machine learning have become inadequate. However, conventional machine learning (ML) approaches have been extended to meet the needs of other applications, but with increased information or large data knowledge bases, there are significant challenges for ML algorithms for big data analysis. This paper aims to facilitate understanding the importance of ML in the analysis of large data. It contributes to understanding the implications and challenges in big data computational complexity, classification imperfection and data heterogeneity. It discusses the capability to mine value from large-scale data for decision- making and predictive analysis through data transformation and knowledge extraction. It will suggest the impact of big data on real-time data analysis and discuss the extent to which machine learning can be used to analyze large data through machine learning in big data analysis. It will also suggest the meaning and opportunity from the point of view of encouraging feature research development in the field of ML using big data.

References

[1] L. Xiang, G. Zhao, Q. Li, W. Hao and F. Li, "TUMK-ELM: A Fast Unsupervised Heterogeneous Data Learning Approach," IEEE Access, vol. 6, pp. 35305-35315, 2018.

[2] W. Raghupathi and V. Raghupathi, "Big Data Analytics in Healthcare: Promise and Potential," Health Information Science Systems, vol. 2, no. 1, pp. 1-10, 2014.

[3] O. Y. Al-Jarrah, P. D. Yoo, S. Muhaidat, G. K. Karagiannidis and K. Taha, "Efficient Machine Learning for Big Data: A Review," Big Data Research, vol. 2, no. 3, pp. 87-93, Sep. 2015.

[4] ABI Research, "Billion Devices Will Wirelessly Connect to the Internet of Everything in 2020," [Online], Available: https://www.abiresearch.com/press/more-than-30-billion-devices-will-wirelessly-conne/, 2013.

[5] P. Zikopoulos and C. Eaton, Understanding Big Data: Analytics for Enterprise-class Hadoop and Streaming Data, McGraw-Hill Osborne Media, 2011.

[6] H. Liu, A. Gegov and M. Cocea, "Unified Framework for Control of Machine Learning Tasks Towards Effective and Efficient Processing of Big Data," Springer Data Science and Big Data: An Environment of Computational Intelligence, pp. 123–140, 2017.

[7] X. Wu, X. Zhu, G.-Q. Wu and W. Ding, "Data Mining with Big Data," IEEE Transactions on Knowledge and Data Engineering, vol. 26, no. 1, pp. 97–107, 2014.

[8] S. Suthaharan. "Big Data Classification: Problems and Challenges in Network Intrusion Prediction with Machine Learning," ACM SIGMETRICS Performance Evaluation Review, vol. 41, no. 4, pp. 70–73, 2014.

[9] H. Tong, "Data Classification: Algorithms and Applications," Taylor and Francis Group, pp. 275–286, 2015.

[10] K. Shvachko, H. Kuang, S. Radia and R. Chansler. "The Hadoop Distributed File System," Proc. of the 26th IEEE Symposium on Mass Storage Systems and Technologies, pp. 1–10, 2010.

[11] R. Narasimhan and T. Bhuvaneshwari, "Big Data - A Brief Study," International Journal of Science Eng. Research, vol. 5, no. 9, pp. 350-353, 2014.

[12] W. Fan and A. Bifet, "Mining Big Data: Current Status and Forecast to the Future," SIGKDD Explorations Newslett., vol. 14, no. 2, pp. 1-5, Dec. 2012.

[13] Y. Demchenko, P. Grosso, C. De Laat and P. Membrey, "Addressing Big Data Issues in Scientific Data Infrastructure," Proc. of the International Conference on Collaboration of Technol. Systems (CTSs), pp. 48-55, 2013.

[14] M. Ali-ud-din Khan, M. F. Uddin, N. Gupta and N. Gupta, "Seven V's of Big Data Understanding: Big Data to Extract Value," Proc. Zone Conference Amer. Soc. Eng. Education, pp. 1-5, 2014.

[15] C. L. Philip Chen and C.-Y. Zhang, "Data-intensive Applications, Challenges, Techniques and Technologies: A Survey on Big Data," Information Science, pp. 314-347, 2014.

[16] X. Jin, B.W. Wah, X. Cheng and Y. Wang, "Significance and Challenges of Big Data Research," Big Data Research, vol. 2, pp. 59-64, 2015.

[17] I. W. Tsang, J. T. Kwok and P.-M. Cheung, "Core Vector Machines: Fast SVM Training on Very Large Data Sets," Journal Machine Learning Research, vol. 6, pp. 363-392, 2005.

[18] N. Japkowicz and S. Stephen, "The Class Imbalance Problem: A Systematic Study," Intell. Data Analysis, vol. 6, no. 5, pp. 429-449, 2002.

[19] M. Ghanavati, R. K.Wong, F. Chen, Y.Wang and C.-S. Perng, "An Effective Integrated Method for Learning Big Imbalanced Data," Proc. of IEEE International Congr. on Big Data, pp. 691-698, 2014.

[20] C. Zhu, L. Cao, Q. Liu, J. Yin and V. Kumar, "Heterogeneous Metric Learning of Categorical Data with Hierarchical Couplings," IEEE Transaction Knowl. Data Eng., vol. 30, no. 7, pp. 1254-1267, Jul. 2018.

[21] H. Liu and H. Motoda, "Instance Selection and Construction for Data Mining," Springer, New York, vol. 608, 2013.

[22] H. A. Mahmoud and A. Aboulnaga, "Schema Clustering and Retrieval for Multi-domain Pay-as-you-go Data Integration Systems," Proc. of ACM SIGMOD International Conference on Management of Data, pp. 411-422, 2010.

[23] A. Kadadi, R. Agrawal, C. Nyamful and R. Atiq, "Challenges of Data Integration and Interoperability in Big Data," Proc. of IEEE International Conference on Big Data, pp. 38-40, USA, 2014.

[24] N. Ayat, H. Afsarmanesh, R. Akbarinia and P. Valduriez, "Uncertain Data Integration Using Functional Dependencies," Amsterdam: Informatics Institute, University of Amsterdam, 2012.

[25] D. A. Berry and B.W. Lindgren, Statistics: Theory and Methods, 2nd Edition, International Thomson Publishing Company, 1996.

[26] M. M. Najafabadi, F. Villanustre, T. M. Khoshgoftaar, N. Seliya, R.Wald and E. Muharemagic, "Deep Learning Applications and Challenges in Big Data Analytics," Journal of Big Data, vol. 2, no. 1, pp. 1-21, 2015.

[27] S. R. Sukumar, "Machine Learning in the Big Data Era: Are We There Yet?," Proc. of the 20th ACM SIGKDD Conference on Knowl. Discovery and Data Mining, Workshop Data Science, pp. 1-5, 2014.

[28] J. Qiu, Q.Wu, G. Ding, Y. Xu and S. Feng, "A Survey of Machine Learning for Big Data Processing," EURASIP Journal Adv. Signal Process., vol. 67, pp. 1-16, 2016.

[29] M. A. Hearst, S. T. Dumais, E. Osman, J. Platt and B. Scholkopf. "Support Vector Machines," IEEE Intelligent Systems and Their Applications, vol. 13, no. 4, pp. 18–28, 1998.

[30] S. K. Murthy, "Automatic Construction of Decision Trees from Data: A Multi-disciplinary Survey," Data Mining and Knowledge Discovery, Kluwer Academic Publishers, vol. 2, no. 4, pp. 345–389, 1998.

[31] L. Wan, M. Zeiler, S. Zhang, Y. Le Cun and R. Fergus. "Regularization of Neural Networks Using Drop-connect," Proceedings of the 30th International Conference on Machine Learning (ICML-13), pp. 1058–1066, 2013.

[32] X. -W. Chen and X. Lin, "Big Data Deep Learning: Challenges and Perspectives," IEEE Access, vol. 2, pp. 514-525, 2014.

[33] A. K. Jain, "Data Clustering: 50 Years Beyond K-means," Pattern Recognition Letters, vol. 31, no. 8, pp. 651–666, 2010.

[34] T.-H. T. Nguyen and V.-N. Huynh, "A k-Means-like Algorithm for Clustering Categorical Data Using an Information Theoretic-based Dissimilarity Measure," Proceedings of the 9th International Symposium on Foundations of Information and Knowledge Systems (FoIKS), vol. 9616, pp. 115-130, 2016.

[35] M. D. Assuncao, R. N. Calheiros, S. Bianchi, M. A. S. Netto and R. Buyya, "Big Data Computing and Clouds: Trends and Future Directions," Journal of Parallel Distributed Computing, vol. 79, pp. 3-15, 2015.

[36] S. B. Kotsiantis. "Supervised Machine Learning: A Review of Classification Techniques," Informatica, vol. 31, pp. 249–268, 2007.

[37] O. Okun and G. Valentini, "Supervised and Unsupervised Ensemble Methods and Their Applications," Studies in Computational Intelligence Series, vol. 126, 2008.

[38] H. Zou and T. Hastie. "Regularization and Variable Selection via the Elastic Net," Journal of the Royal Society Series, vol. 67, no. 2, pp. 301–320, 2005.

[39] L. Breiman, "Random Forests," Machine Learning, vol. 45, no. 1, pp. 5–32, Oct. 2001.

[40] Y. Bengio, A. Courville and P. Vincent, "Representation Learning: A Review and New Perspectives," IEEE Transaction Pattern Analysis Mach. Intell., vol. 35, no. 8, pp. 1798-1828, 2013.

[41] R. Salakhutdinov and G. E. Hinton, "Deep Boltzmann Machines," Proc. of the International Conference Artif. Intell. Statist., pp. 448-455, 2009.

[42] G. Hinton, "Deep Belief Nets," Encyclopaedia of Machine Learning, pp. 267-269, 2010.

[43] J. Read, F. Perez-Cruz and A. Bifet, "Deep Learning in Partially-labeled Data Streams," Proc. of the 30th Annu. ACM Symp. Appl. Computer, pp. 954-959, 2015.

[44] IMARTICUS, "What Is Machine Learning and Does It Matter?," [Online], Available: "https://imarticus.org/what-is-machine-learning-and-does-it-matter/".

[45] S. M. Basha and D. S. Rajput, "A Roadmap towards Implementing Parallel Aspect Level Sentiment Analysis" Multimedia Tools and Applications, Springer, vol 78, no. 1, pp 1-30, Jan. 2019.

[46] D. S. Rajput, R. S. Thakur and G. S. Thakur, "A Computational Model for Knowledge Extraction in Uncertain Textual Data Using Karnaugh Map Technique," International Journal of Computing Science and Mathematics, InderScience, vol. 7, no. 2, pp. 166-176, 2016.

[47] S. M. Basha and D. S.Rajput, "A Supervised Aspect Level Sentiment Model to Predict Overall Sentiment on Twitter Documents," International Journal of Metadata, Semantics and Ontologies, InderScience, vol. 13, no. 1, pp. 33-41, 2018.

[48] S. M. Basha, D. S. Rajput and V. Vandhan, "Impact of Gradient Ascent and Boosting Algorithm in Classification," International Journal of Intelligent Engineering and Systems, vol. 11, no. 1, pp. 41-49, 2018.

[49] D. S. Rajput, "Review on Recent Developments in Frequent Item Set Based Document Clustering, Its Research Trends and Applications," International Journal of Data Analysis Techniques and Strategies, InderScience, vol. 11, no. 2, pp. 176-195, 2019.

[50] S. M. Basha and D. S. Rajput "Parsing Based Sarcasm Detection from Literal Language in Tweets," Recent Patents on Computer Science, vol. 11, no. 1, pp. 62-69, 2018.

[51] F. Ö. Catak and M. E. Balaban, "A Map Reduce Based Distributed SVM Algorithm for Binary Classification," Turkish Journal of Electrical Engineering & Computer Sciences, vol. 24, pp. 863-873, 2016.

[52] L. Demidova, E. Nikulchev and Yu. Sokolova, "The SVM Classifier Based on the Modified Particle Swarm Optimization," International Journal of Advanced Computer Science and Applications, vol. 7, no. 2, pp. 16-24, 2016.

[53] J. Tian, H. Rong and T. Zhao, "Hybrid Safety Analysis Method Based on SVM and RST: An Application to Carrier Landing of Aircraft," School of Reliability and Systems Engineering, vol. 80, pp. 56-65, Dec. 2015.

[54] L. Wang, G. Wang and C. A. Alexander, "Natural Language Processing Systems and Big Data Analytics," International Journal of Computational Systems Engineering, vol. 2, no. 2, pp. 76–84, 2015.

[55] A. Cuzzocrea, I.-Y. Song and K. C. Davis, "Analytics over Large-scale Multidimensional Data: The Big Data Revolution", Proceedings of the ACM 14th International workshop on Data Warehousing and OLAP, pp. 101-104, 2011.

[56] V. Agneeswaran, "Big-data - Theoretical, Engineering and Analytics Perspective," Big Data Analytics, Springer, vol. 7678, pp. 8-15, 2012.

[57] M. Chen, S. Mao and Y. Liu, "Big Data: A Survey," Mobile Networks and Applications, Springer, vol. 19, no. 2, pp. 171-209, 2014.

[58] H. Li and X. Lu, "Challenges and Trends of Big Data Analytics", Proc. of the 9th International Conference on P2P, Parallel, Grid, Cloud and Internet Computing, pp. 566-567, 2014.

[59] N. Khan, I. Yaqoob, I. A. T. Hashem et al., "Big Data: Survey, Technologies, Opportunities and Challenges," The Scientific World Journal, vol. 2014, Article ID 712826, pp. 1-18, 2014.

[60] D. Jothimani, A. K. Bhadani and R. Shankar, "Towards Understanding the Cynicism of Social Networking Sites: An Operations Management Perspective," Procedia - Social and Behavioural Sciences, vol. 189, pp. 117–132, 2015.

[61] M. Blount, M. Ebling, J. Eklund, A. James, C. McGregor, N. Percival, K. Smith and D. Sow, "Real-time Analysis for Intensive Care: Development and Deployment of the Artemis Analytic System," IEEE Engineering in Medicine and Biology Magazine, vol. 29, no. 2, pp. 110-118, 2010.

[62] Apache Hadoop, "Hadoop Releases, Apache Software Foundation," [Online], Available: https://hadoop.apache.org/.

[63] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker and I. Stoica, "Spark: Cluster Computing with Working Sets," USENIX Workshop on Hot Topics in Cloud Computing (HotCloud).

[64] Apache Storm, "Apache Storm, Apache Software Foundation," [Online], Available: http://storm.apache.org/.

[65] Apache Cassandra, "Apache Cassandra, The Apache Software Foundation," [Online], Available: http://cassandra.apache.org.

[66] M. Hofmann and R. Klinkenberg, RapidMiner: Data Mining Use Cases and Business Analytics Applications, CRC Press, Taylor and Francis Group, A Chapman & Hall Book, 2013.

[67] Apache Hive, "Apache Hive, Apache Software Foundation," [Online], Available: http://hadoop.apache.org/hive.