vlsi阵列处理器 这本书里我记得有介绍,不过这本书比较艰涩,我没看完。 阵列处理器里面每个处理器叫做一个处理单元(Process Element)简称PE, 由一定数量PE构成一个处理器阵列,将算法映射到阵列处理,如果每个单元坐标 是PE(x)则就是一维阵列,如果是PE(x, y)就是二位阵列,若PE(x, y, z)就是一个 立方体阵列。假如超过三维,每个PE坐标是(x, y, z, w, u, v …)就称超立方体 阵列。相关的举例介绍请参考脉动阵列理论。 The hypercube architecture is based on the hypercube connection of processing elements,each including a processor and a memory, and coordinates their computations by sending messages to each other [167] , [12], [74]. It differs from a shared-memory multi- processor,in which processors are connected to a shared memory through a switching network or a common memory bus. Each processing element of the hypercube architecture operates as an independent computer. The main feature of the hypercube architecture is in its way of interconnecting processor elements. The hypercube topology has several interesting properties. One of the most interesting property is the communication path length. Any message sent by a processor element can reach the destination in no more than log n hops in the n-dimensional hypercube architecture.That is, the maximum length of the path (the number of edges on the path) a message may follow is log n. It is a good property for implementing a variety of algorithms that require many communications between processing elements. When many processing elements send messages, messages may collide at one node for a single edge. Then all but one messages are delayed for the use of the edge, or they may be sent along the other edges and take more hops than required theoretically. Each processing element is an independent computer that runs its own copy of an operating system and operates asynchronously. When programs run on processing elements to solve a single problem, they must synchronize at some points. If there are many such synchronization points, then the performance of parallel execution is destroyed. Since the synchronization overhead is not small in this type of architecture, it is suited for coarse-grain parallel execution, not for the fine-grain parallel execution. In order to obtain asynchronous coarse-grain parallelism, we have to make a program so that the number of synchronizations between processing elements should be kept minimum, while the number of operations between synchronization points should be kept maximum. A message goes through processing elements on its path to the destination. Each processing element has to use some CPU time to route messages. The more messages come, the more time the processing element has to spend for routing, degrading the performance. Special hardware can be added to a processing element not to interrupt tasks running on it. However, when the tasks come to a synchronization point, they have to send or receive messages. Still, its performance is affected by the messages passing by that processing element. There is no way to avoid the degradation of performance due to routing messages in this architecture. From the implementation point of view, the implementation of the n-dimensional hypercube is limited by the number of links each processing element has, if n is large. For example,if n = 10, then each of 2^10 = 1024 processing elements has 10 links. If each link is 8 bits,16 bits, 24 bits, or 32 bits wide, then the total number of bits for each processing element is 80, 160, 240, or 320, respectively. Since each connection is one-to-one, the total number of wires (one wire per bit) amounts to 40,960 for 8-bit links, 81,920 for 16-bit links, and 122, 880 for 24-bit links, and 163,840 for 32-bit links. Since some control signals and parity bits are usually associated with each link, the total amount of wiring is more than that number for each case. If we use a single-bit link, the amount of wiring is small, but it takes more time to transfer data.In general, the hypercube architecture is expected to show an effcient communication capability because of the property of the hypercube connection. However, it is not an easy task to control the precise timings, because of the asynchronous parallel execution and in uence of message routing. Moreover, it is the coarse-grain architecture, which is not suited for arithmetic-level parallelism. It is very diffcult to implement the architecture with a multiple-byte link for multiple-byte communications.
版权声明:本文为blueplain原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。