TOPIC: COARSE GRAINED SIMD ARCHITECTURE
With the growing request of playing multimedia with better quality, especially over portable media, efficient algorithms for audio and/or video data transfer and processing have been developed.
These algorithms have the characteristics of data-intensive computation of high complexity. For these applications, two extreme approaches to the implementation are software running on a general purpose processor and hardware in the form of ASIC. In the case of general purpose processor, it is flexible enough to support various applications but may not provide sufficient performance to cope with the complexity of application. In the case of ASIC, we can optimize best in terms of power and performance but only for a specific application. With a coarse-grained reconfigurable architecture, we can take advantage of the two approaches. This architecture has higher performance level than general purpose processor and wider applicability than ASIC.
Many kinds of coarse-grained reconfigurable architecture have been proposed with the increasing interests in reconfigurable computing in recent years. Most of the reconfigurable architectures consist of a reconfigurable array and a processor to execute entire application. Data-intensive, regular kernel code segments are executed on a reconfigurable array and control- intensive, irregular code segments are executed on a processor. Morphosys consists of Tiny_RISC processor, RC(Reconfigurable Cell) array, frame buffer, context memory, and DMA controller. While Tiny_RISC processor controls overall system, RC array, which is an 88 array of ALUs, performs 16-bit operations based on SIMD programming model. XPP-based Configurable System-on-Chip Architecture  consists of an XPP-core (4x4 or 8x8 reconfigurable array), one LEON processor, and several SRAM type memory modules. For the main communication bus AHB from ARM  is chosen. ADRES  is an architecture template instead of a fixed architecture. An XML-based architecture description language is used to define the overall topology, supported operation set, resource allocation, timing, and even internal organization of each RC. ADRES tightly couples a VLIW processor and a reconfigurable matrix. The reconfigurable matrix is used to accelerate the dataflow- like kernels in a highly parallel way, whereas the VLIW processor executes the non-kernel code by exploiting instruction- level parallelism.
SIMD versus Loop Pipelining
We can consider two different models for mapping loops onto coarse-grained reconfigurable architecture - SIMD and loop pipelining. SIMD computation model is efficient for computation intensive,data-parallel applications requiring less context words to configure reconfigurable processing elements . Since data load and computation are temporarily separated in this model, array elements are not efficiently utilized. In the case of loop pipelining, different operations in a loop can be executed simultaneously in a pipeline . With this flexibility, data load and computation can be simultaneously executed and all reconfigurable array elements can be efficiently used. In some loops, the performance of pipelining is roughly the same as the performance of SIMD.