Compiler and Architecture Seminar 2002

Israel

Home | Products & services | Support & downloads | My account

Select a country
IBM Research Home
IBM Haifa Labs Home
IBM Haifa Labs Leadership Seminars
Compiler and Architecture Seminar 2002
Invitation
Preliminary Program
Visitors information
Confirmed participants
Compilers in HRL

Feedback

Abstracts

Design of a C compiler for EZchip Network Processor (Abstract)
Gil Timnat, EZchip

EZchip provides a network processor, called NP1, which integrates several high-speed processors on one chip. Four types of programmable processors called TOPs (Task Optimized Processors) are used to perform the four main tasks of packet processing, i.e. parse, search, resolve and modify.

TOPparse identifies and extracts various headers, fields and keywords in the frame. TOPsearch uses the parsed fields as keys for performing lookups in the relevant routing, classification and policy tables. TOPresolve makes forwarding and QoS decisions. TOPmodify modifies packet contents, and performs overwrite, add or insert operations anywhere in the packet. Each TOP performs its particular task and passes its results (e.g. messages, keys, headers and pointers) to the next TOP stage for further processing. Multiple instances of each TOP type process frames at each pipeline stage to enable packet processing at 10-Gigabit wire-speed. All TOPs of the same type execute the same code, i.e. the same program.

Programs can be written using EZchip's assembler (EZasm) or C (EZc). Performance is crucial- to maintain the required wire-speed, the processing time for a typical frame that a TOP may require is less than 100 clock cycles. This means a typical path executed in the program is short, and the entire program is small. It also means the C compiler must emit efficient code, or the code will be of very little use.

Also, the TOPs are designed for a specific purpose, so Ezc had better provide good support for this. E.g. it does not provide floating point arithmetic at all, but does provide extensive support for bit manipulation.

Thus the constraints on the compiler are that it produce efficient code for several CPUs with a very special architecture and instruction set that are tailored for frame content manipulation and table lookup. The lecture explains the compiler design, including some of the key design choices that had to be made.

Simple Inter-procedural Register Allocation in Compiler for Network Processor (Abstract)
Mostafa Hagog, IBM Haifa Research Lab

The inter-procedural register allocation problem has concerned compiler designers since the beginning. Many algorithms for inter-procedural register allocation exist. In our work, we introduce a simple and interesting inter-procedural register allocation for the PowerNP C compiler, which is a network processor. As in any other embedded system, the programs for the network processor are compiled as a whole. This makes the inter-procedural register allocation a requirement and not only a "nice to have" option. In our work, we implement inter-procedural register allocation in an intra-procedural compiler (GCC). The collection of the global view information and the application of the inter-procedural allocation is made in the existing GCC with slight changes. According to the feedback we received from using our compiler, it competes favorably with the hand written assembly code. We believe that this accomplishment is due to the inter-procedural register allocation.

Compiler for a DSP Architecture (Abstract)
Ayal Zaks, IBM Haifa Research Lab

The Embedded Low-Power Digital Signal Processor project is an ongoing effort, within IBM Research and Development, to advance the state of the art in power-efficient programmable DSP architectures. This effort grows from the understanding that the important matter is an architecture that provides a balanced optimization of programmability in high-level language, power consumption, performance, development cost (hardware and software), and production cost (chip and system). In order to achieve these usually conflicting optimization goals, the design of the eLite DSP architecture and its implementations covers aspects ranging from algorithms, applications, and high-level language compiler, down to circuit-level technology. The resulting architecture is a multiple-issue statically scheduled processor, with a heterogeneous set of register files spread throughout specialized units. Parallelism is achieved by executing multiple instructions operating on different registers (VLIW), in conjunction with single instructions operating on different registers (single instruction multiple disjoint data, SIMdD), as well as single instructions operating on packed data (single instruction multiple packed data SIMpD). A novel indirect register addressing mechanism enables the dynamic composition of vectors with four elements selected from a large multiported register file.

We will present this diverse design environment, the unique architectural features mentioned above, and describe how an optimizing compiler can exploit these features to generate code that is comparable to hand-optimized code.

Next Generation Mobile Processor (Abstract)
Ronny Ronen, Intel

Intel's Banias processor is the first microprocessor designed specifically for the requirements of tomorrow's mobile PCs. In this presentation we first highlight the general trends in processor microarchitecture and then provide an overview of the Banias processor, describing its power-aware micro architecture features.

Self-Stabilizing Microprocessor (Abstract)
Yinnon Haviv, Ben-Gurion University

Soft-errors are changes in memory value caused by cosmic rays. The computer architecture community has been coping with soft-errors since the early 1980s. Recent advances in technology that lead to decreases in feature size, power usage and micro-cycle period, enhance the influence of soft-errors.

Self-stabilizing systems can be started in an arbitrary (possibly corrupted) state. The extensive interest in academia and industry in the area (see, for example, self-healing, automatic-recovery, automatic-repair, autonomic-computing) is due to the need for robust systems that can cope with faults.The implementation of a self-stabilizing system is only possible when the micro-processor that executes the program is self-stabilizing.

In this work, we combine traditional redundancy approaches that are based on error correction (by space or time redundancy) with a fallback mechanism that copes with soft-errors and ensures that the microprocessor will stabilize to a desired behavior.

Architectural Implications of Multiple Clock Domains and Synchronization in High Performance Systems on Chips (SoC) (Abstract)
Ran Ginosar, Technion

Clocking on digital VLSI chips isn't what it used to be. And it's not what we teach in engineering schools! Large chips are broken into smaller "clock domains." Clock frequency and phase differences among the domains make on-chip, inter-domain communications difficult and tricky. We will review the phenomena and the various solutions, including synchronizations and asynchronous interconnects. We will focus on some architectural implications of multiple on-chip clock domains, including Globally Asynchronous / Locally Synchronous (GALS) and asynchronous architectures.

Future Challenges in Computer System Design in the e-Business on Demand Era (Abstract)
Kemal Ebcioglu, IBM T.J. Watson Research Center

As a result of the proliferation of e-Business, and technology improvements in the Internet infrastructure, traditional in-house IT shops are increasingly being replaced by the e-Utility computing model, where computational resources will be rented to businesses over the Internet via massive, geographically distributed server farms. Unlike the current co-location model in today's server farms, where each customer has separate hardware in its own cage, the e-Utility model will need to share hardware pervasively among different customers in a server farm, in order to achieve greater economies of volume and better resource allocation, while maintaining secure isolation for each customer. In future server farms, virtualization technology will not be applied just to servers (in the form of Virtual Machines and Logical Partitions as we know them); Instead, entire IT shops -- graphs containing server nodes (web servers, database servers, possibly of different architectures), network nodes (firewalls, load balancers,...) and storage nodes -- will need to be virtualized and mapped to a highly shared hardware infrastructure, through novel techniques. Also, the computational resources allocated to a customer will need to be increased and decreased on demand, e.g., to adapt to the rapidly changing load on a customer's web site. This talk will discuss some of the future architecture, microarchitecture, and virtualization challenges, toward achieving goals such as low power, small footprint and simple microprocessor design in tomorrow's server farms.

Intrathreads - Techniques for parallelizing of sequential code (Abstract)
Alex Gontmakher, Technion

In many cases, code can be parallelized by explicit division into threads, but the mechanisms of thread creation, syncronization and communication have an overhead that is higher than the benefits that can be gained from parallelization. We present an architecture that assists in using this parallelism.

The architecture is based on threads running transparently inside one OS thread context and can be regarded as just an advanced control flow mechanism. These threads (called intrathreads for their being contained totally inside the processor) use the same register file to minimize per-thread resources and as a communication medium. In addition, a special set of one-bit condition registers provides a low-latency synchronization mechanism. Due to its low resource overhead, the architecture is easy to implement on existing microprocessors. Intrathreads are especially fitting to the IA-64 architecture due to such features as large register file, rotating register window concept, etc.

Several techniques have been developed to apply intrathreads to optimize existing serial code. Benchmarks show that nearly-linear speedups are possible in specific code chunks and tens of percents of speedup can be achieved on real programs. Ultimately, these techniques will be applied automatically by the compiler in addition to regular code optimizations.

A New Parallel and Fully Recursive Multifrontal Supernodal Sparse Cholesky (Abstract)
Dror Irony, Gil Shklarski, Sivan Toledo, Tel-Aviv University

We describe the design, implementation, and performance of a new parallel sparse Cholesky factorization code. The code uses a supernodal multifrontal factorization strategy. Operations on small dense submatrices are performed using new dense-matrix subroutines that are part of the code, although the code can also use the BLAS and LAPACK. The new code is recursive at both the sparse and the dense levels, it uses a novel recursive data layout for dense submatrices, and it is parallelized using Cilk, an extension of C specifically designed to parallelize recursive codes. We demonstrate that the new code performs well and scales well on SMP's. The scalability and high performance that the code achieves imply that recursive schedules, blocked data layouts, and dynamic scheduling are effective in the implementation of sparse factorization codes.

Post-Link Optimization Technology (Abstract)
Gad Haber, IBM Haifa Research Lab

Post-link optimization algorithms perform global analysis on the entire executable file (including statically linked library code) and complement the compiler's optimizations. Since the executable file to be optimized will not be re-linked, the compiler and linker conventions do not need to be preserved, thus allowing aggressive optimizations that are not available to optimizing compilers. The challenge of post-link optimizers lay in the fact that post-link code is less informative, yet it is very large compared to a single compilation unit. Therefore, a mechanism that allows one to perform a light-weight and aggressive optimizations in the face of incomplete information is valuable.

We will present FDPR (Feedback Directed Program Restructuring), a profile-based post-link optimization tool developed in the IBM Haifa Research Lab. The general operation of FDPR will be described along with some of its latest optimizations. FDPR is suitable for very large programs or DLLs (dynamic link libraries). When used on large subsystems that operate in a multiprocessing setting, FDPR provides significant performance improvements.

Fragmented Line Caches (Abstract)
Bishara Shomar, Interl - Technion

Modern CPU's use cache memories to improve their performance. Modern cache memories tend to increase the cache line size, in an effort to exploit More spatial locality, thus improving the system performance. As a result, the average cache-area-utilization become to be very low (between 28% - 38%). These observations lead us to the fragmented cache-line concept. This work introduces a new data-cache structure that has two types of cache-lines: a long line and a fragmented line. Long lines are being used to handle memory areas with high spatial locality, and fragmented lines are being used to handle cache-lines parse locality. Each cache-line can dynamically Alternate between these two line types depending on the program (data) behavior. The fragmented cache-line is transparent to user, but using the new cache structure, improves, in average, the miss-rate up to 35%, the bus traffic by 20%, and the cache area utilization by 45% thus increasing the CPU performance by 6%.


About IBM \| Privacy \| Terms of use \| Contact