HAML-IRL: OVERCOMING THE IMBALANCED RECORD-LINKAGE PROBLEM USING HYBRID ACTIVE MACHINE LEANING


(Received: 14-Sep.-2024, Revised: 14-Nov.-2024 , Accepted: 24-Nov.-2024)
Traditional active machine-learning (AML) methods employed in Record Linkage (RL) or Entity Resolution (ER) tasks often struggle with model stability, slow convergence and handling imbalanced data. Our study introduces a novel hybrid Active Machine Learning approach to address RL, overcoming the challenges of limited labeled data and imbalanced classes. By combining and balancing informativeness, which selects record pairs to reduce model uncertainty and representativeness, it is ensured that the chosen pairs reflect the overall dataset patterns. Our hybrid approach, called Hybrid Active Machine Learning for Imbalanced Record Linkage (HAML-IRL), demonstrates significant advancements. HAML-IRL achieves an average 12% improvement in F1-scores across eleven real- world datasets, including structured, textual and dirty data, when compared to state-of-the-art AML methods. Our approach also requires up to 60%- 85% fewer labeled samples depending on the datasets, accelerates model convergence and offers superior stability across iterations, making it a robust and efficient solution for real-world record-linkage tasks.

[1] Y. Aassem, I. Hafidi and N. Aboutabit, "Exploring the Power of Computation Technologies for Entity Matching," Proc. of Emerging Trends in ICT for Sustainable Development, Part of the book series: Advances in Science, Technology & Innovation, pp. 317–327, Springer, 2021.

[2] L. Alami, Y. Aassem and I. Hafidi, "KF-Swoosh: An Efficient Spark-based Entity Resolution Algorithm for Big Data," Journal of Physics, Conference Series: Proc. of the Int. Conf. on Mathematics & Data Science (ICMDS), vol. 1743, p. 012005, Khouribga, Morocco, Jan. 2021.

[3] P. Christen, D. Vatsalan and Q. Wang, "Efficient Entity Resolution with Adaptive and Interactive Training Data Selection," Proc. of the 2015 IEEE Int. Conf. on Data Mining, Atlantic City, USA, 2015.

[4] B. Zhang, D. Yang, Y. Liu and Y. Zhang, "Graph Contrastive Learning with Knowledge Transfer for Recommendation," Engineering Letters, vol. 32, no. 3, pp. 477–487, 2024.

[5] M. Jabrane, I. Hafidi and Y. Rochd, "An Improved Active Machine Learning Query Strategy for Entity Matching Problem," Proc. of the Int. Conf. of Machine Learning and Computer Science Applications, Part of the Book Series: Lecture Notes in Networks and Systems, vol. 656 pp. 317–327, 2023.

[6] J. Mourad, T. Hiba, R. Yassir and H. Imad, "ERABQS: Entity Resolution Based on Active Machine Learning and Balancing Query Strategy," Journal of Intelligent Information Systems, vol. 62, pp. 1347-1373, Mar. 2024.

[7] M. Jabrane, H. Tabbaa, A. Hadri and I. Hafidi, "Enhancing Entity Resolution with a Hybrid Active Machine Learning Framework: Strategies for Optimal Learning in Sparse Datasets," Information Systems, vol. 125, p. 102410, Nov. 2024.

[8] A. Primpeli, C. Bizer and M. Keuper, "Unsupervised Bootstrapping of Active Learning for Entity Resolution," Proc. of European Semantic Web Conference, The Semantic Web, Part of the Book Series: Lecture Notes in Computer Science, vol. 12123, pp. 215–231, Springer, 2020.

[9] K. Qian, L. Popa and P. Sen, "Active Learning for Large-scale Entity Resolution," Proc. of the 2017 ACM on Conf. on Information and Knowledge Management, pp. 1379-1388, DOI: 10.1145/3132847.313294, 2017.

[10] S. Sarawagi and A. Bhamidipaty, "Interactive Deduplication Using Active Learning," Proc. of the 8th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD '02), pp. 269 – 278, DOI: 10.1145/775047.7750, 2002.

[11] S. Tejada, C. A. Knoblock and S. Minton, "Learning Domain-independent String Transformation Weights for High Accuracy Object Identification," Proc. of the 8th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD '02), pp. 350-359, DOI: 10.1145/775047.775099, 2002.

[12] V. V. Meduri, L. Popa, P. Sen and M. Sarwat, "A Comprehensive Benchmark Framework for Active Learning Methods in Entity Matching," Proc. of the 2020 ACM SIGMOD Int. Conf. on Management of Data, pp. 1133 – 1147, DOI: 10.1145/3318464.3380597, 2020.

[13] R. Wu, S. Chaba, S. Sawlani, X. Chu and S. Thirumuruganathan, "ZeroER: Entity Resolution Using Zero Labeled Examples," Proc. of the 2020 ACM SIGMOD Int. Conf. on Management of Data, pp. 1149 – 1164, DOI: 10.1145/3318464.3389743, 2020.

[14] A. Jain, S. Sarawagi and P. Sen, "Deep Indexed Active Learning for Matching Heterogeneous Entity Representations," Proc. of the VLDB Endowment, vol. 15, no. 1, pp. 31–45, 2021.

[15] R. Dharavath and A. K. Singh, "Entity Resolution-based Jaccard Similarity Coefficient for Heterogeneous Distributed Databases," Proc. of the 2nd Int. Conf. on Computer and Communication Technologies, Advances in Intelligent Systems and Computing, vol. 379, pp. 497–507, Sept. 2015.

[16] V. I. Levenshtein, "Binary Codes Capable of Correcting Deletions, Insertions and Reversals," Soviet Physics-Doklady, vol. 10, pp. 707–710, 1965.

[17] M. A. Jaro, "Advances in Record-linkage Methodology As Applied to Matching the 1985 Census of Tampa, Florida," Journal of the American Statistical Association, vol. 84, no. 406, pp. 414–420, 1989.

[18] J. Chen, Z. Qin and J. Jia, "A Weighted Mean Subtractive Clustering Algorithm," Information Technology Journal, vol. 7, no. 2, pp. 356–360, 2008.

[19] S. Das, A. Doan, P. S. G. C., C. Gokhale, P. Konda, Y. Govind and D. Paulsen, "The Magellan Data Repository," [Online], Available: https://sites.google.com/site/anhaidgroup/projects/data.

[20] D. Hand and P. Christen, "Using the F-measure for Evaluating Record Linkage Algorithms," Statistics and Computing, vol. 28, no. 3, pp. 539–547, 2017.

[21] Y. Li, J. Li, Y. Suhara, A. Doan and W.-C. Tan, "Effective Entity Matching with Transformers," The VLDB Journal, vol. 32, pp. 1215-1235, 2023.

[22] S. Li and H. Wu, "Transformer-based Denoising Adversarial Variational Entity Resolution," Journal of Intelligent Information Systems, vol. 61, pp. 631-650, 2023.

[23] S. Mudgal et al., "Deep Learning for Entity Matching: A Design Space Exploration," Proc. of the 2018 Int. Conf. on Management of Data (SIGMOD '18), pp. 19-34, DOI: 10.1145/3183713.3196926, 2018.

[24] P. Petrovski and C. Bizer, "Learning Expressive Linkage Rules from Sparse Data," Semantic Web, vol. 11, no. 3, pp. 549–567, 2020.

[25] G. Papadakis, N. Kirielle, P. Christen and T. Palpanas, "A Critical Re-evaluation of Benchmark Datasets for (Deep) Learning-based Matching Algorithms," ArXiv: 2307.01231, 2023.

[26] R. Chen, Y. Shen and D. Zhang, "GNEM: A Generic One-to-Set Neural Entity Matching Framework," Proc. of the Web Conf. 2021, DOI: 10.1145/3442381.3450119 Ljubljana, Slovenia, 2021.

[27] D. Chen, Y. Lin, W. Li, P. Li, J. Zhou and X. Sun, "Measuring and Relieving the Over-smoothing Problem for Graph Neural Networks from the Topological View," Proc. of the AAAI Conf. on Artificial Intelligence, vol. 34, no. 4, pp. 3438–3445, 2020.

[28] M. Friedman, "The Use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis of Variance," Journal of the American Statistical Association, vol. 32, no. 200, pp. 675–701, 1937.

[29] R. L. Iman and J. M. Davenport, "Approximations of the Critical Region of the fbietkan Statistic," Communications in Statistics - Theory and Methods, vol. 9, no. 6, pp. 571–595, 1980.

[30] P. B. Nemenyi, Distribution-free Multiple Comparisons, PhD Thesis, Princeton University, 1963.

[31] S. Herbold, "Autorank: A Python Package for Automated Ranking of Classifiers," Journal of Open Source Software, vol. 5, p. 2173, Apr. 2020.