CHARACTERIZATION OF SHARED-MEMORY MULTI-CORE APPLICATIONS


(Received: 2015-11-29, Revised: 2016-01-21 , Accepted: 2016-02-01)
The multicore processor architectures have been gaining increasing popularity in the recent years. However, many available applications cannot take full advantage of these architectures. Therefore, many researchers have developed several characterization techniques to help programmers understand the behavior of these applications on multicore platforms and to tune them for better efficiency. This paper proposes an on-the-fly, configuration-independent characterization approach for characterizing the inherent characteristics of multicore applications. This approach is fast, because it does not depend on the details of any specific machine configuration and does not require repeating the characterization for every target configuration. It just keeps track of memory accesses and the cores that perform these accesses through piping memory traces, on-the-fly, to the analysis tool. We applied this approach to characterize eight applications drawn from SPLASH-2 and PARSEC benchmark suites. This paper presents the inherent characteristics of these applications, including memory access instructions, communication characteristics patterns, sharing degree, invalidation degree, communication slack and communication locality. The results show that two of the studied applications have high parallelization overhead, which are Cholesky and Fluidanimate. The results also indicate that the studied applications of SPLASH-2 have higher communication rates than the studied applications of PARSEC and these rates generally increase as the number of used threads increases. Most of the sharing and invalidation occurs in small degrees. However, two of SPLASH-2 applications have significant fraction of communication with high sharing degrees involving four or more threads. Most of the applications have some uniform communication component and the initial thread is generally involved in more communication compared to the other threads.

[1] K. Olukotun, B. A. Nayfeh, L. Hammond, K. Wilson and K. Chang, "The Case for a Single-chip Multiprocessor," ACM Sigplan Notices, vol. 31, no. 9, pp. 2–11, 1996.

[2] D. Geer, "Chip Makers Turn to Multicore Processors," Computer, vol. 38, no. 5, pp. 11–13, 2005.

[3] C. van Berkel, "Multi-core for Mobile Phones," in Design, Automation Test in Europe Conference Exhibition, pp. 1260–1265, 2009.

[4] G. Blake, R. G. Dreslinski and T. Mudge, "A Survey of Multicore Processors," Signal Processing Magazine, IEEE, vol. 26, no. 6, pp. 26–37, 2009.

[5] B. A. Mahafzah, "Performance Assessment of Multithreaded Quicksort Algorithm on Simultaneous Multithreaded Architecture," The Journal of Supercomputing, vol. 66, no. 1, pp. 339–363, 2013.

[6] B. A. Mahafzah, "Parallel Multithreaded IDA* Heuristic Search: Algorithm Design and Performance Evaluation," International Journal of Parallel, Emergent and Distributed Systems, vol. 26, no. 1, pp. 61–82, 2011.

[7] G. A. Abandah and E. S. Davidson, "Origin 2000 Design Enhancements for Communication Intensive Applications," in Proc. of the International Conference Parallel Architectures and Compilation Techniques (PACT’98), pp. 30–39, 1998.

[8] J. Dongarra, S. Moore, P. Mucci, K. Seymour and H. You, "Accurate Cache and TLB Characterization Using Hardware Counters," in Computational Science-ICCS 2004, Springer, pp. 432–439, 2004.

[9] M. Bhadauria, V. M. Weaver and S. A. McKee, "Understanding PARSEC Performance on Contemporary CMPs," in IEEE Int’l Symp.Workload Characterization, pp. 98–107, 2009.

[10] M. Ferdman, A. Adileh, O. Kocberber, S. Volos, M. Alisafaee, D. Jevdjic, C. Kaynak, A. D. Popescu, A. Ailamaki and B. Falsafi, "Clearing the Clouds: A Study of Emerging Scale-out Workloads on Modern Hardware," ACM SIGARCH Computer Architecture News, vol. 40, no. 1, pp. 37–48, 2012.

[11] Z. Jia, L. Wang, J. Zhan, L. Zhang and C. Luo, "Characterizing Data Analysis Workloads in Data Centers," in IEEE Int’l Symp. Workload Characterization (IISWC), pp. 66–76, 2013.

[12] W. E. Cohen and B. A. Mahafzah, "Statistical Analysis of Message Passing Programs to Guide Computer Design," in Proceedings of the IEEE Thirty-First Hawaii International Conference on System Sciences , vol. 7, pp. 544–553, 1998.

[13] S. R. Alam, R. F. Barrett, J. A. Kuehn, P. C. Roth and J. S. Vetter, "Characterization of Scientific Workloads on Systems with Multi-core Processors,” in IEEE International Symposium on Workload Characterization, pp. 225–236, 2006.

[14] L. Chai, Q. Gao and D. K. Panda, "Understanding the Impact of Multicore Architecture in Cluster Computing: A Case Study with Intel Dual-core System," in 7th IEEE Int’l Symp. Cluster Computing and the Grid, 2007, pp. 471–478.

[15] G. A. Abandah, Reducing Communication Cost in Scalable Shared Memory Systems, Ph.D. dissertation, The University of Michigan, 1998.

[16] A. Jaleel, R. S. Cohn, C.-K. Luk and B. Jacob, "CMP$im: A Pin-based on-the-fly Multi-core Cache Simulator,” in Proc. 4th Annual Workshop on Modeling, Benchmarking and Simulation, pp. 28–36, 2008.

[17] G. Contreras and M. Martonosi, "Characterizing and Improving the Performance of Intel Threading Building Blocks," Proc. of the IEEE International Symposium on Workload Characterization (IISWC 2008), pp. 57–66, 2008.

[18] A. Bhattacharjee and M. Martonosi, "Characterizing the TLB Behavior of Emerging Parallel Workloads on Chip Multiprocessors," in 18th Int’l Conf. Parallel Architectures and Compilation Techniques, pp. 29–40, 2009.

[19] T. Dey, W. Wang, J. W. Davidson and M. L. Soffa, "Characterizing Multi-threaded Applications Based on Shared-resource Contention," in IEEE Int’l Symp. Performance Analysis of Systems and Software, pp. 76–86, 2011.

[20] R. Natarajan and M. Chaudhuri, "Characterizing Multi-threaded Applications for Designing Sharing-aware Last-level Cache Replacement Policies," in IEEE International Symposium on Workload Characterization, pp. 1–10, 2013.

[21] G. A. Abandah and E. S. Davidson, "Configuration Independent Analysis for Characterizing Shared-memory Applications," in Proc. of the 12th International Parallel Processing Symp. (IPPS), pp. 485–491, 1998.

[22] C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi and K. Hazelwood, "Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation," SIGPLAN Not., vol. 40, no. 6, pp. 190–200, 2005.

[23] Intel, "Pin-A Dynamic Binary Instrumentation Tool," https://software.intel.com/en-us/articles/pin-a-dynamic-binaryinstrumentation-tool/, 2015, [Online; accessed 22-March-2015].

[24] M. S. Mohammed, Hardware Configuration-independent Characterization of Multi-core Applications, Master’s Thesis, The University of Jordan, Amman, 2015.

[25] S. C. Woo, M. Ohara, E. Torrie, J. P. Singh and A. Gupta, "The SPLASH-2 Programs: Characterization and Methodological Considerations," in ACM SIGARCH Computer Architecture News, vol. 23, no. 2, pp. 24–36, 1995.

[26] C. Bienia, S. Kumar, J. P. Singh and K. Li, "The PARSEC Benchmark Suite: Characterization and Architectural Implications," in Proc. of the 17th Int’l Conf. Parallel Architectures and Compilation Techniques, pp. 72–81, 2008.