![]() |
3rd Workshop on |
![]() |
Sunday, March 20, 2005
Hotel Valencia Santana Row, San Jose, California
in conjunction with the
IEEE/ACM
International Symposium on Code Generation and Optimization (CGO'05)
Managed Runtime Environments (MREs) (aka virtual execution environments or simply runtime systems) have evolved in functionality and complexity for over 40 years. MREs, such as the JVM and CLI, have absorbed functionality once only available from the operating system, and at the same time MREs support diverse and highly dynamic application configurations. While current MREs are clearly effective for many applications, opportunities remain to improve their design and broaden their applicability. In my talk, I focus on current issues with runtime systems and consider where hardware and software trends are likely to take these systems in the future. I consider MREs from the perspectives of performance, reliability, and ease of use, drawing on published experiences from both the CLI and Java communities.
These experiences suggest important design directions for future MREs, including ways to improve support for modularity, error handling, concurrency, and componentization. One of the important future challenges for MREs is to demonstrate that they are up to the task of implementing the lowest-level system software, a domain where they are needed. The Singularity Project, at Microsoft Research, is investigating this challenging problem.
Most virtual machine implementations employ generational garbage collection to manage dynamically allocated memory. Studies have shown that these generational schemes work efficiently in desktop-like applications where most objects are short-lived. The performance of generational collectors, however, has not been studied in the context of distributed systems.
Given the increasing popularity of such systems and the remote objects they introduce, providing insights into their memory allocation behavior could have a large impact on the design of future garbage collection techniques, and in the implementation of such systems as well.
This work presents the first step in such research effort. First, we introduce a profiling and analysis technique to identify local and remote objects through remotable connectivity. Second, we show the results of applying such technique to a sample program to assess its associated object's lifespan in the presence of generational garbage collection . Last, we utilize this early findings to highlight opportunities for garbage collection optimization in distributed environments.
Garbage Collection has been around for decades and for nearly just as long people have been complaining about GC pauses. Making a fast and concurrent (i.e., low-pause) GC that is also effective and non-fragmenting (i.e., precise and relocating) is fiendishly difficult. Despite decades of GC research and years of dilligent engineering, no mainstream JVM has a concurrent collector as the default GC. Yet all that is lacking is a simple hardware read-barrier, an instruction executed by regular Java threads to enforce the GC invariants. With a hardware read-barrier, Azul has built a fast, low-pause, precise, relocating and above all: a simple GC. This talk presents Azul's GC algorithm and hardware, and shows it's effectiveness on a variety of benchmarks.
Advances in software and hardware technology and the recent trends towards increased virtualization and standardization are rapidly adding to the complexity and the number of layers in the execution stack. Although performance problems that result from the interactions among the execution layers are commonly observed, it is still not well understood exactly how these layers interact and how their interactions can be optimized for added performance. The goal of our research on continuous program optimization (CPO) is to provide a unifying framework to support a whole-system approach to program optimization that cuts across all layers of the execution stack opening up new optimization opportunities. At the core of our CPO framework is a performance and environment monitoring (PEM) infrastructure that gathers data across the entire execution stack, from the hardware, through the operating system, runtime and middleware layer, to the application. Consuming this vertically integrated PEM data, a CPO agent that runs across the layers can then affect changes and improve performance. We have designed a platform-independent monitoring API (PEMAPI) to support efficient customization and aggregation of the event stream for both on-line and off-line consumption of the data by CPO agents.
This talks present a prototype implementation of the CPO framework and demonstrate its effectiveness for tuning application performance through two clients. The first client is a performance visualizer that displays the integrated event stream to the user to explore performance data gathered across the entire system. The second client is a CPO agent for optimizing memory behavior of an application by automatically making large page decisions. Using PEM events germane to large pages, the CPO agent performs a cost-benefit analysis to predict the most beneficial arenas in the application's data to map to large pages. We show that our CPO agent can automatically achieve the performance improvements from large pages that until now would require manual programmer intervention and in-depth knowledge of both the application and the operating system's large page policies.
The BEA JRockit Java(TM) VM uses hardware feedback from the Intel(R) Itanium(R) 2 processor to enable dynamic profile-guided optimization in the JRockit JIT compiler. In this talk we will describe the JRockit DPGO implementation, which includes the Itanium 2 processor performance monitoring unit (PMU), a software layer that dynamically collects and delivers PMU data to the JVM, creation of profiles from PMU data, and use of profiles in JIT optimizations.
Improvements in Virtual Machines (VMs) and Just-in-Time (JIT) compiler technology that runs Java programs have produced tremendous performance improvements, if one measures by the commonly referenced SPECjbb2000 and SPECjvm98 benchmarks. These huge performance improvements are one of the reasons behind the widespread adoption of the Java programming language by the programming community. But these standard benchmarks are now 5 and 7 years old, respectively, and do not reflect the coding styles employed by many Java developers designing middleware applications running on application servers. In particular, the object-oriented design features made easily available in the Java language are exercised to a far greater degree in middleware applications compared to their use in older benchmarks. While the optimization frameworks developed to target programs like SPECjbb2000 and the SPECjvm98 suite are still beneficial and necessary for good general performance, the widespread use of the object-oriented features in Java, as well as the highly multi-threaded environment where middleware applications typically execute, have made optimization at the VM and JIT compiler level much more challenging. In my presentation, I will elaborate on some of these aspects and explain why common middleware application designs make optimization more difficult for VMs and JIT compilers. I will also describe some implementation-level challenges that must be addressed in any optimization targeting these types of applications.