Implementation of turbo encoder in LTE

Abstract: -

LTE (Long Term Evolution) is the upcoming standards towards 4G, which is designed to increase the capacity and throughput performance when compared to UMTS and WiMax. Turbo codes are the recent development in the area of forward error correction. Turbo encoders are the main elements in now-a-days communication system to greatly achieve data reception with minor errors. This thesis deals with the implementation of turbo encoder in LTE standard using DSP TMS320C6416 to achieve the throughput performance. C coding is done in the Code Composer Studio IDE and implemented it in the TMS320C6416 DSP Starter Kit (DSK). Clock cycles are figured out from the running time and calculated the clock frequencies. Finally, the throughput performance is calculated from those parameters and evaluates the result. This result is compared with other standards such as UMTS and WiMax.

Chapter 1: Introduction

Turbo codes are the current development in the area of Forward Error Correction coding, which is widely used in communication industry. Turbo encoders are the main elements in now-a-days communication system to greatly achieve data reception with minor errors. The turbo coding basis is to establish redundancy in the transmitted data across the channel. This redundant data is used to recover the original data from the received data. Turbo codes are used to approach channel capacity, theoretical maximum in reliable communication is possible at the given code rate for the channel noise. It is used in satellite communication and other applications in which the transfer of reliable information over bandwidth in the presence of noise.

Turbo encoder is compatible with three different standards such as The 3rd Generation Partnership Project(3GPP), 3GPP2 and Consultative Committee for Space Data System (CCSDS). 3GPP and 3GPP2 standards are mainly used with WCDMA applications. Each encoder is a different entity as interleaver and entirely different control logic. [1]

This project is by LTE standard. LTE (Long Term Evolution-3.9G) is a newer development of high performance air interface for communication systems. It is been designed to improve the capacity and speed of the mobile networks.

3GPP LTE Turbo Encoder, which will implement the turbo convolutional encoding, scheme defined in specification of 3GPP LTE. It has a 3GPP LTE interleaver block and supports all 188-block sizes in the range of 40-6144, which is being permitted by the specification. This is based upon the double-buffered symbol memory scheme for the performance of maximum throughput and simplifies the integration into the customer's system architecture by providing flexible control options. These features of 3GPP LTE Turbo Encoder are to fully obeying the 3GPP LTE specifications and contain 3GPP LTE interleaver and Bit accurate model to speed simulation. [2]

Implementation of turbo encoder is performed in DSP TMS320C6416 processor. DSP is a programmable implementing option, which is suitable for numerically intensive tasks. This is more flexible than Applications Specific Integrated Circuits (ASICs) and Field Programmable Gate Arrays (FPGAs).

The generated signals from digital signal processor are complex sum of many individual sine waves. The exact amplitude, frequency and phase of the wave form are calculated with Fourier Transforms. The intrinsic parallelism exists in such operation makes them an ideal candidate for Very Long Instruction Word (VLIW) architecture.

The C6x chips used in the project operate on a 256-bit (very large) instruction, which combines 8 32-bit instructions per cycle, over two data paths. This chip is created with 0.18u CMOS technology; it achieves 2000 MIPS in TI's (Texas Instruments) testing, at speeds up to 1 Gigaflop. Texas Instruments have met great success particularly in the field of embedded, real-time image processing and in wireless communications. C6x chip allows provider to undergo great reduction in the size (nearly 75%) of their wireless base stations. [3]

The main motivation behind this project is to gain deeper understanding of knowledge about turbo codes and learning ways to use appropriate standards and processor and to do optimised implementation of the turbo encoder. Thus efficient implementation meets the real-time constraints in my active area of project.

1.1 Goal of the thesis

The aim of the thesis is to implement the turbo encoder using DSP TMS320C6416 processor in Long Term Evolution (LTE) standard in order to improve the capacity and throughput performance.

1.2 Objective of the thesis

  • Compare and contrast the implementation of turbo encoding on different standards/processors.
  • Analyse and investigate the capacity and throughput performance of turbo encoders on different models.
  • Implement the code in DSP TMS320C6416 processor.
  • Simulate and test the implemented code.
  • Compare different standards/processors and evaluate pros and cons.

1.3 Organisation of thesis

The thesis is organised as follows.

Chapter 2 deals with classification of standards, description of Long Term Evolution and its specification overview and the performance goals of LTE.

Chapter 3 deals with overview of turbo codes, description of forward error correction and its classification of codes and finally with the design goals of LTE.

Chapter 4 deals with the architecture of LTE turbo encoder and its structure, LTE turbo encoder implementation and illustration of LTE turbo encoder Look Up Table (LUT).

Chapter 5 deals with the general overview of DSP processors, generic architecture of TMS320C6X DSP Processor and its internal bus structure, TMS320C6416 features, block diagram of C6416 DSK, functional overview of TMS320C6416 DSK and its board layout and finally with the DSP applications.

Chapter 6 deals with the Code Composer Studio IDE and its features

Chapter 7 provides with the C coding for LTE turbo encoder

Chapter 8 deals with the throughput performance calculation from the parameters such as block size and clock cycles.

Chapter 9 deals with comparison of fixed-point processor with floating point processor in terms of arithmetic format, data width, power consumption and management, memory organization, speed and finally with the summary, then with the comparison of LTE vs. WiMax in terms of peak data rate and latency and then finally with the comparison of LTE vs. UMTS in terms of efficiency, flexibility, latency and cost.

1.4 Literature Review

According to "Maurizio Martina, Member, IEEE, Mario Nicola, Member, IEEE, and Guido Masera, Senior Member, IEEE" The maximum throughput performance that can be achieved by UMTS for the block length of 5114 is 2Mbps and also the maximum throughput achieved by WiMax with the maximum block length of 2400 is up to 70Mbps. [4]

LTE can achieve high throughput performance than UMTS and WiMax with even smaller block sizes ranges from 40 to 6114.One of the powerful error correction mechanisms that made a great impact on channel coding is turbo code in the recent years. It outperforms all other coding by achieving Shannon limit error correction by using simple component codes and large interleavers.

Now-a-days turbo codes are achieving good performance nearer to the best theoretical values. While convolutional codes are performed in terms of energy efficiency and channel capacity twice as bad as theoretical bound suggested.

The channel code design will have trade off between bandwidth efficiency and energy efficiency. Lower rate codes (i.e. bigger redundancy) will correct more errors. If more errors are corrected then the communication system can operate with less transmit power, transmit over long distances, use small antennas and transmit at a higher data rate. These will make code energy efficient. Also lower rate codes have large overhead and more bandwidth consumption. Hence turbo code significantly more by lowering the code rate when compared to the convolutional codes. [5]

The fundamental difference between the above two codes is the performance will improve by increasing constraint length but it is a small value for turbo codes which remains constant. Furthermore it will achieve significant coding gain at lower coding rates.

The purpose of the interleaver is to randomize burst patterns to be decoded correctly. It will help to maximize the distance of turbo code. There are different kinds of interleavers are exist and it relies on use and need. [6]

Chapter 2: Classification of standards and overview of Long Term Evolution

2.1 Classification of standards

0G - Mobile Telephone before cellular

1G - Analogue Cellular

2G - Digital Cellular

2.5G - High Speed Packet Data

3G - WCDMA Multimedia

3G - HSPA Faster than 3G, not as fast as LTE

3.9G - LTE - Ultra Broadband Packet Data

2.2 Description of Long Term Evolution

LTE (Long Term Evolution) is the project of 3GPP (Generation Partnership Project) operating under one of the trademarked associations within the partnership, the European Telecommunications Standards Institute.

LTE is upcoming standards towards 4G (4th Generation), which is designed to increase the capacity and throughput performance of mobile networks. LTE is marked as 3.9G as it is not fully obeying with 4G requirements. The main merits with LTE are low latency, high throughput, an improved end user experience and a simple architecture that results in low operating costs.

LTE gives services over the real 3rd generation (3G) requirements but at the same time it won't gives service levels for 4th generation (4G) requirements and it is also called as beyond 3G.

Now-a-days the growth of mobile data usage and the origin of new applications like mobile TV, web 2.1, contents of streaming encouraged 3rd Generation Partnership Project (3GPP) to effort on Long Term Evolution (LTE). It is the present standard after GSM/EDGE and UMTS/HSPA technologies.

LTE is not like HSPA (High Speed Packet Access) that was provided with in Release 99 UMTS architecture, 3GPP is stating a new packet core called Evolved Packet Core (EPC) network architecture to sustain E-UTRAN via reduction in the number of network elements, increased redundancy, simpler functionality especially hand over to fixed line and wireless access technologies gives the service provider the capability to provide a seamless mobility experience.

The aggressive performance of LTE depends on the physical layer technologies like Orthogonal Frequency Division Multiplexing (OFDM) and Multiple-Input Multiple-Output (MIMO) systems, correct antennas to attain these target levels. The main target is to reduce the system complexities and provide adaptable spectrum deployment in new or existing frequency spectrum and enabling the co existence with 3GPP Radio Access Technologies (RATs). LTE is backed by 3GPP service providers and other interested parties aim to finish and agree EUTRAN standards and the EPC. [7]

2.3 Performance goals for LTE

E-UTRA is used to support different kinds of services like File Transfer Protocol (FTP), online gaming, web browsing, real time video, VoIP, video streaming, push to talk and push to view. Hence LTE is designed to provide high data rate and low latency system. The UE's bandwidth is expected to be 20 MHz for both transmission and reception. The service provider may install cells in any of the bandwidths.

Apart from the above said LTE metrics, it is also aimed at minimizing the cost and power consumption at the same time ensuring backward compatibility and cost effectiveness. It is also targeting the enhanced multicast services, enhanced Quality of Service (QoS), reducing the number of options and redundant features in the architecture.

LTE spectral efficiency in the DownLink (DL) is 3 to 4 times of that of HSDPA Release 6 and the LTE spectral efficiency in the UpLink (UL) is 2 to 3 times of that of HSUPA Release 6. The handovers in the LTE are designed to be seamless and is intended to reduce the interruption time.

The evolution of LTE is expected to significantly increase the throughput, sector capacity and reduce latency. LTE is used to support IP based traffic with Quality of Service (QoS). [7]

Chapter 3: Forward Error Correction mechanism

3.1 Forward Error Correction

Forward Error Correction (FEC) is a system of error control for data transmission in which the sender appends some redundant data to the message, which is called as error correction code. This permits the receiver to find and rectify errors without asking the sender to send additional data. The merit of this technique is that the retransmission of data can be neglected. It is employed in the situation where the retransmission is costly. The maximum errors corrected are determined before by the design of the code. Different error correcting codes are used for different conditions.

FEC is achieved by adding redundant data bit with the transmitted data using an algorithm, which is predetermined. Each redundant bit is a complex function of several original information bits. This can or cannot appear in the encoded output. Systematic output codes are the unmodified inputs where as Non-systematic codes are not. [8]

Error Correction Codes: -

Conventional codes

Trellis code

Block codes




Concatenated codes


Turbo conventional code

Turbo product code

3.2 Turbo Codes

Turbo codes are a high performance forward error correction at a given code rate. These codes are especially used in deep space satellite communication and the application, which requires reliable transformation of information over the communication links in the presence of data corrupting noise. At present, these codes are competing with Low Density Parity Check (LDPC) codes, which produce similar performance.

The principle of turbo code permits near approach to Shannon limit, which describes the maximum capacity of the channel. The turbo codes invention and the performance of these turbo codes initiated a renaissance of channel coding in practical applications. By finding the right settings for a particular system, the exceptional performance of the turbo codes can be further improved. The one important property of any turbo code system is the structure of the interleaver, which performs a permutation of input bits. Differential evolution and genetic algorithms can also optimize interleaver of the turbo code.

Turbo code implementation is by a parallel concatenation of two recursive systematic convolutional encoder codes depend on pseudo-random permutation (the interleaver). The encoder performs a long bit information frame. The interleaver to produce permuted frame interleaves this input bit. The first encoder RSC1 encodes the original input and the interleaved frame (permuted frame) is encoded by RSC2. Then the two encoded bits are merged together with the real input bits to produce the output.

Turbo Product Codes are building on block codes and not on convolutional codes. It is built on 2 dimensional or 3 dimensional arrays of extended hamming codes. Single iteration encoding process.2 dimensional product code minimum distance is a square of constituent code and for 3 dimensional it is cubed. For 2 dimensional, the minimum distance is 16 and for 3 dimensional, the minimum distance is 64.

Extended hamming code could rectify one single bit error only but turbo code can handle a burst of errors by product code array. Data scrambling improves burst error performance, which tolerates 384 burst errors in every code block. This represents 9.4% of the bits in the block. 3 D has even more performance by handling 1024 burst errors in every code block. This represents 25% of the bits in the block.

Turbo product code has an exceptional performance at high code rates with no puncturing required. It is less complexity compared to coding gain, which has lower cost and low power consumption. It presents important improvement over concatenated Reed-Solomon. Turbo product codes exist as standard products and licensable cores. The low cost turbo encoder supports the code rate from 1/5 to 19/20 with no puncturing required. The change of code on the fly gives support changing channel and there is no or in other words zero latency with no tail biting required. It is easily adaptable to many constellations. [9]

3.3 Design Goals of FEC

System Design Goals of FEC are as follows: -

  • Improve data rate
  • Improve data reliability
  • Reduce transmission energy required
  • Reduce required bandwidth
  • Reduce system cost and complexity.

Forward Error Correction is helping to achieve these goals. Forward error correction is the addition of message via encoding prior to the transmission. "Code rate is the rate of data bits to the total bits transmitted". "Error correction capability is the minimum distance of the error correction code which characterizes the code strength". "The error correcting capability, "t", of a code is defined as the maximum number of guaranteed correctable errors per codeword".

Error correction is been used because the designers could select between the levels of increased data reliability, decreased system costs otherwise improve in range. The coding gains can [8]

  • Decrease the required bandwidth by 50% or
  • Improve range by 40% or
  • Improve the data throughput by the factor of 2 or
  • Antenna size been decreased by 30% or
  • Decreased transmitted power by the factor of 2

Chapter 4: LTE Turbo Encoder

4.1 LTE turbo encoder Architecture

The 3GPP LTE specification uses Parallel concatenated convolutional code (PCCC) in 3GPP LTE Turbo encoding. There are two encoders' namely upper encoder and lower encoder. An internal interleaver separates the two encoders. There are 3 outputs generated. The systematic output is the replica of input Xk and two outputs from two encoders. Conventional encoder encodes an information data that is coming from the input and interleaved data bits are encoded by another convolutional encoder. [10]

LTE turbo encoder is implemented with two 8-state constituent encoders with one turbo code internal interleaver. The interleaver function is to permute high weight code words in one encoder into low weight code words for the other encoder.

4.2 Structure of LTE Turbo Encoder

The 3GPP Turbo encoder complies with 3GPP LTE specification and the interleaver block sizes can be chosen at run time. The code rate used in 3GPP LTE turbo encoder is 1/3 and the other codes can also be accomplished by external rate matching. Double buffering is used to allow the encoder to get the data while doing previous data/block. The RSC encoder produces output of a systematic bit (input bit) and two parity bits for each bit of encoded data. [10]

The 8-state constituent code transfer function for PCCC is given by g0 (D)=1+D2+D3 and g1 (D)=1+D+D3.The preliminary values of the shift registers of 8-state constituent encoders are all set to zeros while starting to encode the input bits. To allow a new data block to be shifted, the data path is double-buffered while encoding the previous block. This double buffering technique used here will reduce the delay in I/O operation and improves the overall performance. It also makes use of the hardware as much as possible. The output data will be 3 bits long. The number of clock cycles consumed to encode an entire block of data is knows as the encoding delay (D). The loading delay is different from the encoding delay that requires the same number of clock cycles as the block size K to load the input data to the input buffer. The encoding delay will not include loading delay. The encoding latency is the time taken by the encoder to encode an entire block. [10]

4.3 3GPP RSC Turbo Encoder Implementation

For the given the same input sequence, the recursive encoder produces the code word output with more weight than the non-recursive output code word. Hence this results in fewer code words with lower weights and thus increases better performance. For 3GPP turbo encoder, the main purpose of implementing RSC encoder is to exploit the recursive nature of the encoders. [11]

The 3GPP LTE turbo encoder contains two constituent encoders with an interleaver separated by it. The following figure shows the computational diagram of 3GPP turbo recursive systematic code (RSC) encoder. Each one of the RSC encoder contains a forward path with the transfer function of 1+D+D3 and a feedback path with the transfer function of 1+D2+D3.

In the RSC encoder, we have k input delay registers. In contrast with NRC codes, the input to every register is generated from the current input and from the state of the register.[]We precompute the address of the interleaver and store them in a memory. The stored interleaved address is used for encoding as well as decoding. We have to store the precomputed interleaver address in L3 memory if the codeword size is big otherwise we store them in L1 memory. For larger code words, we use the window method to encode the bits and will use direct memory access to transfer data such as inputs and interleave addresses, which is needed from L3.

From the above diagram, we output one systematic bit Xi and two parity bits Yi and Zi. The parity bit Zi will not depend directly on actual input bit bi but it depends on b'i. The b'i is from the interleaver buffer at index i. For the given input message block B of N bits, we carry out interleaving of total block B at once or we carry out address computation to get interleaved bit b'i from the block B for each input bit index and put in as an interleaved block B' and access b'i linearly with index i. For 3GPP,we compute the addresses and place them in a memory one time for all N bits before start to encode the multiple message data blocks. Through this way, we will reduce the complexity of the interleaver address generation of 3GPP turbo encoding.

For simple coding, we encode bit by bit. We produce 3 its output for each input bits. Normally the input data bits are accessed from the memory in terms of 8-bit bytes as the data size in which the processor access from memory is a byte. We have to pack the coded bits and place them in memory location for other processing and transmission after coding.

Look up table (LUT) has been used here for interleaver address instead of computing fly. Hence we are spending only cycles for look up table access. It takes three cycles for interleaving one data bit. Hence 3GPP involves less cycle to perform the operation and thus reduces computational complexity. Hence LUT increases the computational efficiency. [12]

4.4 LTE turbo encoder Look Up Table (LUT)

3GPP Turbo encoder is an expensive module with higher bit rates if we are not properly implemented. Here the turbo encoder is splitting into two parts. First part is dealing with encoding of bits and the second part is dealing with interleaving of the data bits.

The computation of 5-bit offset to the 32-bit entry look up table with three bits input and the remaining current state bits. We have to extract the current state from the look up table output of previous coding and we have to shift the state bits accordingly.

We can avoid the shift and extract operations for state bits by properly designing the look up table. If two bytes are used for look up table entry and storing the state bits in the shifted position as shown in the following diagram then we will save 50% of offset cycle's calculation. [12]

The computational load in the implementation of turbo code algorithm is relatively modest because of estimating in memory required for doing the tasks. With the look up table for turbo encoding, we store the precomputed encoding information and hence it needs less data memory for storing the interleaved data. It will be costly if we are doing the computation interleaver address on the fly rather than the doing with the precomputed values. By doing so, we are able to perform turbo encoding by using only 18% of processor MIPS otherwise it will consume more than 55% of MIPS. Hence look up table is the efficient way to store the precomputed values requires for computation to reduce the overall memory when compared to simple methods. [112]

Chapter 5: TMS320C6416 DSP processor

5.1 General Overview of DSP processors

DSP processors are the microprocessors, which are designed to perform digital signal processing-the mathematical manipulation of digital signals. Digital signal processing is one of the vital technologies in fast growing application areas like wireless communication, industrial control and audio and video processing. Many DSP-capable processors have enlarged more along with the growing popularity of DSP applications from the time when first commercially successful DSP chip introduced in the early 1980s.

Now-a-days DSP processors are sophisticated devices with remarkable capabilities. Many DSP processors share basic features, which are designed to give high performance, numerically demanding and repetitive tasks. One of the main features of DSP is the ability to do one or more MAC (Multiply- Accumulate) operation in a single instruction cycle. This is very useful in DSP algorithms, which involve computation of a vector dot product like correlation, digital filters ad Fourier transforms. DSP adds multiply-accumulate hardware into core data path of the processor to accomplish a single cycle MAC. For allowing multiply-accumulate operations being performed in parallel some DSP processors give two or more multiply accumulates unit. Arithmetic overflow is the generation of numbers more than the highest value that the accumulator of the processor could hold. DSP processor permits sets of multiply accumulate operations to continue without the arithmetic overflow possibilities. It will support extra guard bits in the accumulator.

Second feature of DSP processor is the ability to finish many accesses to memory in a single instruction cycle. This provides the processor to fetch the instruction during operands fetch and/or placing the result of previous operation to memory. In the case of vector dot product calculation for a FIR filter, many DSP processors will be able to provide a MAC during the loading of sample data and coefficient for the next MAC. Many restriction has been imposed on single cycle multiple memory access. One of the memory locations accessed should reside on chip. DSP processors support multiple on chip buses, some times multiple independent memory banks and multi ported on chip memories to provide concurrent access of multiple memory locations.

The third feature is useful to increase arithmetic processing on DSP processors. After the configuration of appropriate addressing register, the address generation unit work in the background without involve the core data path of the processor, creating the operand access address in parallel with the arithmetic instructions execution. But in general-purpose processor, it needs addition of cycles to produce address, which is used to load operands. Address generation units of the DSP processor provide a selection of addressing modes narrowed to DSP applications. The "register indirect addressing with post increment" is used in repetitive computation of data placed in memory sequentially. Modulo addressing is used for simplifying the use of circular buffers.

DSP algorithms comprise repetitive computation and many DSP processors give exceptional support for proficient looping. Repeat or Loop instructions are used to permit the programmer to implement for-next loop without using any instruction cycles to update and check the loop counter or split back to the top of the loop.

Lastly, to permit low cost, more performance input and output, many DSP processors include one or more parallel or serial I/O interface and specific I/O handling mechanisms such as direct memory access (DMA) and low overhead interrupts to permit data transfers to continue with almost no intervention from the remaining processor. Because of the popular DSP functions like speech coding and audio processing led designers to think of implementing DSP on general-purpose processors such as microcontrollers and desktop CPUs. Hence most of the general-purpose processor makers are appending signal processing capabilities to their chips.

Sometimes, system designers like to use general-purpose processor rather than a DSP processor. Even though it needs many instructions to do operations, which could be done by only one DSP processor instruction, some general-purpose processors will run at high clock speeds. But at the same time it lacks features, which simplifies DSP programming, development of software is more difficult than on DSP processors and will result in inappropriate code that is complicated to sustain. If general-purpose processors are used for signal processing only, then they are cost effective when compared to DSP chips, which are designed especially for the task. Hence most of the system designers will remain to use traditional DSP processors for most of the DSP demanding applications. [13]

Most components of a signal processing system that are implemented on a DSP processor are computationally more efficient. Choosing the DSP processor is an application dependent. This depends on their choice, which includes performance, power consumption, cost, interfacing capabilities, integration capabilities, time to market and ease of use etc.

5.2 TMS320C6X DSP Processors

The TMS320C6x processor's family that is manufactured by Texas Instruments is built for achieving speed. For the intensive applications such as digital video are designed for Million Instructions Per Second (MIPS). Many versions of the processors belonging to the family differ in speed, packaging, power consumption, memory, time, peripherals and cost. For example, the fixed-point processor C6416 version operates at the rate at 600 MHz i.e. in 1.67 ns cycle time, which delivers a peak performance of 4800 MIPS. In the same way, C6713-225 version operates at the rate at 225 MHz i.e.4.4 ns cycle time, which delivers a peak performance of 1350 MIPS.

The following figure shows the generic block diagram of C6x architecture. The central processing unit (CPU) of C6x comprises of eight functional units that are divided into two sides A and B. Each side has 4 units such as .M unit (used for multiplication operation),

S unit (used for branch, bit manipulation, and arithmetic operations), a .L unit (used for logical and arithmetic operations and a .D unit (used for loading, storing, and arithmetic operations). ADD more than one unit can do instruction. Each slide consists of sixteen 32-bit registers. Through these registers interaction should be done with the CPU. [14]

As shown in the following figure, internal bus comprises of 32-bit program address bus, a 256-bit program data bus comprises eight 32-bit instructions, two 32-bit load data buses (LD1 and LD2), two 32-bit data address buses and two 32-bit store data buses (ST1 and ST2). In addition to that there are 32-bit direct memory access (DMA) address bus and 32-bit DMA data. 20-bit address bus and a 32-bit data bus access the off-chip or external memory.

The peripherals on a typical C6x processor include External Memory Interface (EMIF), DMA, Boot Loader, Multi-channel Buffered Serial Port (McBSP), and Host Port Interface (HPI), Timer, and Power Down unit. EMIF provides the necessary timing for accessing external memory. DMA allows the movement of data from one place in memory to another place without interfering with the CPU operation. Boot Loader boots the code from off-chip memory or HPI to the internal memory. McBSP provides a high-speed multi-channel serial communication link. HPI allows a host to access the internal memory. Timer provides two 32-bit counters. The Power Down unit is used to save power for durations when the CPU is inactive. [13]

For Pipelined CPU, there are generally three basic steps to perform the instruction. Steps include fetching, decoding and execution. DSP CPUs are designed to pipeline in order to increase the throughput.


Fewer clock cycles are required by a pipe lined CPU to finish the same number of instructions. The architecture of C6x is based on the Very Long Instruction Word (VLIW) architecture. Several instructions are captured and processed concurrently in those architectures.

The C64x is a DSP core with higher MIPS power operating at higher clock rates. The operating range is in the range of 300-1000 MHz clock rates, giving a processing power of 2400-8000 MIPS. [14]

TMS320C6416 is a fixed point which has the highest-performance fixed-pointDSP generation in the DSP platform developed by Texas Instruments (TI), making theseDSPsan outstanding choice for multichannel and multifunction applications. C6x provides cost effective solutions to high performance programming challenges with the performance of up to 5760 Million Instructions Per Second (MIPS) at a clock rate of 720 MHz.

These processors hold the numerical capability of array processors and the flexible operation of high-speed controllers. This processor will have four 16-bit multiply-accumulates (MACs) for each cycle for a total of 2880 millionMACsper second (MMACS). The C64xmay produce eight 8-bit MACs per cycle for a total of 5760 MMACS or four 16-bit multiply-accumulates (MACs) per cycle for a total of 2880 millionMACs per second (MMACS).

This processor has hardware logic on-chip memory with application specific and also it has two high performance embedded coprocessors namely coprocessors [Viterbi Decoder Coprocessor (VCP) and Turbo Decoder Coprocessor (TCP)] which drastically speed up the coding operations on-chip. TCP is designed to support all polynomials and rates designed by Third-Generation Partnership Projects (3GPP) with turbo interleaver and fully programmable frame length.

The C64x has a powerful and diverse set of peripherals and uses a two-level cache-based architecture. The C64x uses a two-level cache-based architecture and has a powerful and diverse set of peripherals Level 1 data cache (L1D) is a 128-Kbit 2-way set-associative cache and the Level 1 program cache (L1P) is a 128-Kbit direct-mapped cache. The Level 2 memory/cache (L2) comprises of an 8-Mbit-memory space, which is shared between program, and data space.L2 memorymay be configured as combinations of cache (up to 256K bytes) and mapped memory.

The peripheral set comprises three multichannel-buffered serial ports (McBSPs); an operations PHY interface for Asynchronous Transfer Mode and 8 bit universal test; a user configurable 16 bit or 32 bit host interface; three 32 bit general purpose timers; general purpose input/output port; peripheral component interconnect (PCI) and two glue less external memory interfaces which both are capable of interfacing to synchronous and asynchronous memories and peripherals. The C64x provides a full set of development tools such as an advanced C compiler with C64x-specific enhancements, an assembly optimizer to abridge scheduling and programming. [15]

5.3 TMS320C6416 Features

v "Highest-Performance Fixed-Point Digital Signal Processors (DSPs)

  • Instruction cycle time of 1.39-2ns
  • Clock rate of 500-720 MHz
  • Twenty eight operations per cycle
  • Eight 32-bit instructions per cycle
  • 4000-5760 MIPS
  • It is software compatible with C62x
  • It is device pin compatible

v Advanced Very Long Instruction Word (VLIW) DSP Core

  • There are six ALUs; each Supports Single 32-Bit, Dual 16-Bit, or Quad 8-Bit Arithmetic per Clock Cycle
  • Two Multipliers Support Four 16 x 16-Bit Multiplies per Clock Cycle or Eight 8 x 8-Bit Multiplies per Clock Cycle
  • Code size will be reduced by instruction packing
  • Non-Aligned Load-Store Architecture
  • Sixty four 32-Bit General-Purpose Registers
  • All Conditional Instructions

v Instruction Set Features

  • Byte-Addressable (8-/16-/32-/64-Bit Data)
  • 8-Bit Overflow Protection
  • Clear, Set, field extract
  • Bit-Counting, Normalization, Saturation

v Viterbi Decoder Coprocessor (VCP)

  • Supports Over 600 7.95-Kbps AMR
  • Programmable Code Parameters

v Turbo Decoder Coprocessor (TCP)

  • Supports up to 7 2-Mbps or 43 384-Kbps 3GPP (6 Iterations)
  • rogrammable Turbo Code and Decoding Parameters

v L1/L2 Memory Architecture

  • 128K-Bit (16K-Byte) L1P Program Cache (Direct Mapped)
  • 128K-Bit (16K-Byte) L1D Data Cache (2-Way Set-Associative)
  • 8M-Bit (1024K-Byte) L2 Unified Mapped RAM/Cache (Flexible Allocation)

v Two External Memory Interfaces (EMIFs)

  • One 64-Bit (EMIFA), One 16-Bit (EMIFB)
  • Glue less Interface to Asynchronous Memories (SRAM and EPROM) and Synchronous Memories (SDRAM, SBSRAM, ZB)". [16]

5.4 Block diagram of C6416 DSK

The C6416 DSK is a standalone low cost development platform, which enables the users to calculate and make application for TI C64xx DSP family. It also provides the hardware reference design for TMS320C6416 DSP. Application notes, logic equations and schematics are obtainable to simplify hardware development and decrease time to market. The following figure is the block diagram of C6416 DSK. [16]

The DSK comes with the on board devices which suits large variety of application environments. The key features comprise:

  • TI TMS320C6416 DSP which operates at 600MHz
  • Stereo codec AIC23
  • Synchronous Dynamic Random Access Memory (DRAM) of 16MB
  • Non-volatile flash memory of 512 KB
  • DIP switches and four user accessible LEDs
  • Boot configurable options
  • Configuration of software board through registers implemented in Complex Programmable Logic Device (CPLD)
  • Expansion connectors
  • Emulation of JTAG through on-board JTAG emulator with external emulator or Universal Serial Bus (USB) host interface
  • Power supply voltage-single (+5V)

5.5 Functional Overview of TMS320C6416 DSK

The DSP on the 6416 DSK interfaces of DSP to on-board peripherals via one of the two buses, the 8-bit wide EMIFB and the 64-bit wide EMIFB. The CPLD, flash and SDRAM are connected to one of the buses. EMIFA can also be connected to daughter card expansion connectors. This is used for third party add-in boards. AIC23 on-board codec permits the DSP to receive and transmit the analog signals. McBSP2 is used for data and McBSP1 is used for the codec control interface. Four 3.5mm audio jacks are used for analog I/O which corresponds to line input, line output, microphone input and headphone output. The codec may use the line input or the microphone as the active input. The analog output is given to both headphone with adjustable gain and the line out connectors with fixed gain. McBSP2 and McBSP1 could be re-routed to the expansion connectors in software.

CPLD, which is a programmable logic device, used to employ glue logic, which ties the board components together. It has a register interface which is register based will let the user to configure the board by reading and writing to the CPLD registers. The DSK has 4 LEDs and 4 positions DIP switch which provides the user with interactive feedback. Two of them are obtained by reading and writing to the CLPD registers. For powering the board, an included 5V external power supply is used. On-board switching voltage regulators supply 3.3V I/O supplies and 1.4V DSP core voltage. Until these supplies are within operating specifications, the board is held in reset. On the expansion interface, a separate regulator powers the 3.3V lines. Communication between Code Composer and DSK will be through an embedded JTAG emulator with a USB host interface. Through the external JTAG connector, the DSK may also be used with an external emulator. [17]

5.6 Applications

DSP processors are used in Varity of applications from radar systems to consumer electronics. Obviously, one processor can't meet all the needs or even most of the applications. Hence, for selecting the processor, the designer's first task is to weigh the importance of performance, cost, ease of development, integration, power consumption and some other factors for the applications as well.

The large applications in terms of dollar volume for the digital signal processors are not expensive, huge volume embedded systems such as portable digital audio players, cellular telephones and disk drives in which DSPs are used for servo control. In the above applications, integration and cost are vital. Power consumption is significant for portable battery powered products. Even though these applications have the development of custom software to run DSP, the ease of development is normally less important. Large volumes of manufacturing validate enlarging more development effort.

Next vital class of applications requires data processing of huge volumes with complex algorithms for special needs. In the case of sonar and seismic exploration, the production volumes are less but product design is large and more complex with the algorithms require more demanding. Consequently, designers support processors with high performance and good ease of use with the support of multi processor configurations. In some cases, without designing their own hardware, designers integrate such systems by off the shelf development boards, and reducing the software development tasks by utilizing existing function libraries as the primary of their application software. [17]

Chapter 6: Code composer studio

6.1 Description

Code Composer Studio is an Integrated Development Environment (IDE) for ARM (Advanced RISC (reduced instruction set computer) Machine) and/or DSP code for TMS320 DSP processor family and other processors such as MSP430 (Mixed Signal Processor) from TI (Texas Instruments). The current version is based on customized version of Eclipse IDE. It comprises precise instruction set simulators, and supportsJTAG (Joint Test Action Group) based debugging. It includes a set of tools used to grow and debug embedded applications. It comprises source code editor, project build environment, debugger, profiler, simulators, TI's device families and many more features. Code composer studio is one of the main components of eXpressDSP software and Development tools that reduces integration time and development for DFSP software. [18] Code Composer Studio IDE v3.x is the first intelligent development environment to give application development for multi user, multi processor and multi site projects. [19]

.2 Features of Code Composer Studio

The CCS IDE offers a single user interface allows you through every step of the application development flow. Interfaces and familiar tools permit users to get started faster than ever and append functionality to their applications. [18]

CCS 4 is depends on the Eclipse open source software framework. The Eclipse software framework is used for several different applications however it was initially developed as an open framework for producing development tools. It provides an excellent software framework for building software development environments and it becomes a standard framework used by several embedded software sellers. It provides the merits of the Eclipse software framework with advanced embedded debug capabilities from TI results in a convincing feature rich development environment for embedded developers. [18]

Code composer studio adds all target tools and host in a combined environment to comprehensible DSP system configuration and application design. This easy access of development environment provides DSP designers of all experience levels provides whole access to all phases of the code development process. It is an open architecture which provides third parties and TI to elaborate the IDE functionality by flawlessly plugging-in supplementary specialized tools. The environment adds conventional tools for building, editing, code profiling, debugging and project management. These tools perform with further advanced features also added into code composer studio IDE user interface such as multi processor support, signal probing, system and data visualization and a c based scripting language for the customization and automated testing. [19]

Chapter 7: C coding for LTE turbo encoder

#include "dsk6416.h"

#include "DSK6416_aic23.h"








/* Defining the packet size of turbo code */


/* turbo interleaver lookup table for n = 3 */

unsigned int turbo_lookup_table_3[32] =

{ 1, 1, 3, 5, 1, 5, 1, 5,

3, 5, 3, 5, 3, 5, 5, 1,

3, 5, 3, 5, 3, 5, 5, 5,

1, 5, 1, 5, 3, 5, 5, 3


/* turbo interleaver lookup table for n = 4 */

unsigned int turbo_lookup_table_4[32] =

{ 5, 15, 5, 15, 1, 9, 9, 15,

13, 15, 7, 11, 15, 3, 15, 5,

13, 15, 9, 3, 1, 3, 15, 1,

13, 1, 9, 15, 11, 3, 15, 5};

/* turbo interleaver lookup table for n = 5 */

unsigned int turbo_lookup_table_5[32] =


27, 3, 1, 15, 13, 17, 23, 13,

9, 3, 15, 3, 13, 1, 13, 29,

21, 19, 1, 3, 29, 17, 25, 29,

9, 13, 23, 13, 13, 1, 13, 13


/* turbo interleaver lookup table for n = 6 */

unsigned int turbo_lookup_table_6[32] =


3, 27, 15, 13, 29, 5, 1, 31,

3, 9, 15, 31, 17, 5, 39, 1,

19, 27, 15, 13, 45, 5, 33, 15,

13, 9, 15, 31, 17, 5, 15, 33


/* turbo interleaver lookup table for n = 7 */

unsigned int turbo_lookup_table_7[32] =


15, 127, 89, 1, 31, 15, 61, 47,

127, 17, 119, 15, 57, 123, 95, 5,

85, 17, 55, 57, 15, 41, 93, 87,

63, 15, 13, 15, 81, 57, 31, 69


/* for storing the new addresses after turbo interleaving */

int turbo_interleaved_addr[MAX_TURBO_PACKET_SIZE];

/* length of the new addresses after turbo interleaving */

int turbo_interleaved_len;

/* shift register used in first encoder */

int turbo_encoder_shift_reg_1[3];

/* shift register used in second encoder */

int turbo_encoder_shift_reg_2[3];

int SetTurboInterleaver(int packet_size)


unsigned int z;

unsigned int bit_number;

unsigned int n_turbo;

unsigned int n_mask;

unsigned int count;

unsigned int cnt_msb_n;

unsigned int cnt_lsb_5;

unsigned int cnt_lsb_5_rev;

unsigned int lookup_bit_n;

unsigned int *p_lookup_table;

unsigned int addr_new;

/* calculate n_turbo and n_mask */

n_turbo = packet_size - 6;

switch ( packet_size )


case 256:

z = 3;

n_mask = 0x07;

p_lookup_table = turbo_lookup_table_3;


case 512:

z = 4;

n_mask = 0x0F;

p_lookup_table = turbo_lookup_table_4;


case 1024:

z = 5;

n_mask = 0x1F;

p_lookup_table = turbo_lookup_table_5;


case 2048:

z = 6;

n_mask = 0x3F;

p_lookup_table = turbo_lookup_table_6;


case 4096:

z = 7;

n_mask = 0x7F;

p_lookup_table = turbo_lookup_table_7;



return 0;


bit_number = 0;

count = 0;

/* Interleaved address generation */

while ( bit_number < n_turbo )


/*Getting the 5 LSBs of the counter */

cnt_lsb_5 = count & 0x1F;

/* Getting n bits from the lookup table */

lookup_bit_n = p_lookup_table[cnt_lsb_5];

/* reversing the 5 LSBs */

cnt_lsb_5_rev = ((cnt_lsb_5 & 0x1) << 4) | ((cnt_lsb_5 & 0x2) << 2) | (cnt_lsb_5 & 0x4) | ((cnt_lsb_5 & 0x8) >> 2) | ((cnt_lsb_5 & 0x10) >> 4);

/* Getting the n MSBs of the counter */

cnt_msb_n = (count >> 5) & n_mask;

/* Add 1 to n MSBs and get n LSBs */

cnt_msb_n = (cnt_msb_n + 1) & n_mask;

/* Multiply n lookup bits and get n LSBs */

cnt_msb_n = (cnt_msb_n * lookup_bit_n) & n_mask;

/* Generating new interleaved address */

addr_new = (cnt_lsb_5_rev << z) | cnt_msb_n;

/* Verify the validity of the new address */

if ( addr_new < n_turbo )


turbo_interleaved_addr[bit_number] = addr_new;



/* incrementing the counter */



turbo_interleaved_len = n_turbo;

return 1;


/*Turbo encoder initialization*/

int SetTurboEncoder(int pkt_size)

/*packet size-no. of bits in single packet.*/


int m;

for (m = 0; m < 3; m++)


turbo_encoder_shift_reg_1[m] = 0;

turbo_encoder_shift_reg_2[m] = 0;


if ( !SetTurboInterleaver(pkt_size) )


return 0; /*If ok return 1 or return 0*/


return 1;


/*Run the recursive systematic convolutional encoder with one bit input and three bits output */

void RunRecConvEncoder(int shift_reg[], int bit_input, int bit_output[])


int recur_sum;

recur_sum = bit_input ^ shift_reg[1] ^ shift_reg[2];

bit_output[0] = bit_input;

bit_output[1] = recur_sum ^ shift_reg[0] ^ shift_reg[2];

bit_output[2] = recur_sum ^ shift_reg[0] ^ shift_reg[1] ^ shift_reg[2];

shift_reg[2] = shift_reg[1];

shift_reg[1] = shift_reg[0];

shift_reg[0] = recur_sum;


/*Running turbo encoder*/

int RunTurboEncoder(int bit_input[], int length_input, int encode_rate, int bit_output[], int *ptr_length_output)


int s;

int valid_length;

int length_output;

int encoded_bit_1[3]; /*For 3 bits generation*/

int encoded_bit_2[3];

valid_length = length_input - 6;

length_output = 0;

/*Using switch case for different code rates*/

switch ( encode_rate )


/* Encoding rate of 1/2 */

case 1:

/* Encode the first valid length bits */

for (s = 0; s < valid_length; s++)


RunRecConvEncoder(turbo_encoder_shift_reg_1, bit_input[s], encoded_bit_1);

RunRecConvEncoder(turbo_encoder_shift_reg_2, bit_input[turbo_interleaved_addr[s]], encoded_bit_2);

if ( s % 2 == 0 )


bit_output[length_output++] = encoded_bit_1[0];

bit_output[length_output++] = encoded_bit_1[1];




bit_output[length_output++] = encoded_bit_1[0];

bit_output[length_output++] = encoded_bit_2[1];



/*Deliver the encoded tail bits of first encoder */

for (s = 0; s < 3; s++)


bit_output[length_output++] = turbo_encoder_shift_reg_1[1] ^ turbo_encoder_shift_reg_1[2];

bit_output[length_output++] = turbo_encoder_shift_reg_1[0] ^ turbo_encoder_shift_reg_1[2];

turbo_encoder_shift_reg_1[2] = turbo_encoder_shift_reg_1[1];

turbo_encoder_shift_reg_1[1] = turbo_encoder_shift_reg_1[0];

turbo_encoder_shift_reg_1[0] = 0;


/*Deliver the encoded tail bits of second encoder */

for (s = 0; s < 3; s++)


bit_output[length_output++] = turbo_encoder_shift_reg_2[1] ^ turbo_encoder_shift_reg_2[2];

bit_output[length_output++] = turbo_encoder_shift_reg_2[0] ^ turbo_encoder_shift_reg_2[2];

turbo_encoder_shift_reg_2[2] = turbo_encoder_shift_reg_2[1];

turbo_encoder_shift_reg_2[1] = turbo_encoder_shift_reg_2[0];

turbo_encoder_shift_reg_2[0] = 0;



/* Encoding rate of 1/3 */

case 2:

/* Encode the first valid length bits */

for (s= 0; s < valid_length; s++)


RunRecConvEncoder(turbo_encoder_shift_reg_1, bit_input[s], encoded_bit_1);

RunRecConvEncoder(turbo_encoder_shift_reg_2, bit_input[turbo_interleaved_addr[s]], encoded_bit_2);

bit_output[length_output++] = encoded_bit_1[0];

bit_output[length_output++] = encoded_bit_1[1];

bit_output[length_output++] = encoded_bit_2[1];


/*Deliver the encoded tail bits of first encoder */

for (s = 0; s < 3; s++)


bit_output[length_output++] = turbo_encoder_shift_reg_1[1] ^ turbo_encoder_shift_reg_1[2];

bit_output[length_output++] = turbo_encoder_shift_reg_1[1] ^ turbo_encoder_shift_reg_1[2];

bit_output[length_output++] = turbo_encoder_shift_reg_1[0] ^ turbo_encoder_shift_reg_1[2];

turbo_encoder_shift_reg_1[2] = turbo_encoder_shift_reg_1[1];

turbo_encoder_shift_reg_1[1] = turbo_encoder_shift_reg_1[0];

turbo_encoder_shift_reg_1[0] = 0;


/*Deliver the encoded tail bits of second encoder */

for (s = 0; s< 3; s++)


bit_output[length_output++] = turbo_encoder_shift_reg_2[1] ^ turbo_encoder_shift_reg_2[2];

bit_output[length_output++] = turbo_encoder_shift_reg_2[1] ^ turbo_encoder_shift_reg_2[2];

bit_output[length_output++] = turbo_encoder_shift_reg_2[0] ^ turbo_encoder_shift_reg_2[2];

turbo_encoder_shift_reg_2[2] = turbo_encoder_shift_reg_2[1];

turbo_encoder_shift_reg_2[1] = turbo_encoder_shift_reg_2[0];

turbo_encoder_shift_reg_2[0] = 0;



/* Encoding rate of 1/4 */

case 3:

/* Encode the first valid length bits */

for (s = 0; s < valid_length; s++)


RunRecConvEncoder(turbo_encoder_shift_reg_1, bit_input[s], encoded_bit_1);

RunRecConvEncoder(turbo_encoder_shift_reg_2, bit_input[turbo_interleaved_addr[s]], encoded_bit_2);

if ( s % 2 == 0 )


bit_output[length_output++] = encoded_bit_1[0];

bit_output[length_output++] = encoded_bit_1[1];

bit_output[length_output++] = encoded_bit_1[2];

bit_output[length_output++] = encoded_bit_2[2];




bit_output[length_output++] = encoded_bit_1[0];

bit_output[length_output++] = encoded_bit_1[1];

bit_output[length_output++] = encoded_bit_2[1];

bit_output[length_output++] = encoded_bit_2[2];



/*Deliver the encoded tail bits of first encoder */

for (s = 0; s< 3; s++)


bit_output[length_output++] = turbo_encoder_shift_reg_1[1] ^ turbo_encoder_shift_reg_1[2];

bit_output[length_output++] = turbo_encoder_shift_reg_1[1] ^ turbo_encoder_shift_reg_1[2];

bit_output[length_output++] = turbo_encoder_shift_reg_1[0] ^ turbo_encoder_shift_reg_1[2];

bit_output[length_output++] = turbo_encoder_shift_reg_1[0] ^ turbo_encoder_shift_reg_1[1] ^ turbo_encoder_shift_reg_1[2];

turbo_encoder_shift_reg_1[2] = turbo_encoder_shift_reg_1[1];

turbo_encoder_shift_reg_1[1] = turbo_encoder_shift_reg_1[0];

turbo_encoder_shift_reg_1[0] = 0;


/*Deliver the encoded tail bits of second encoder */

for (s = 0; s < 3; s++)


bit_output[length_output++] = turbo_encoder_shift_reg_2[1] ^ turbo_encoder_shift_reg_2[2];

bit_output[length_output++] = turbo_encoder_shift_reg_2[1] ^ turbo_encoder_shift_reg_2[2];

bit_output[length_output++] = turbo_encoder_shift_reg_2[0] ^ turbo_encoder_shift_reg_2[2];

bit_output[length_output++] = turbo_encoder_shift_reg_2[0] ^ turbo_encoder_shift_reg_2[1] ^ turbo_encoder_shift_reg_2[2];

turbo_encoder_shift_reg_2[2] = turbo_encoder_shift_reg_2[1];

turbo_encoder_shift_reg_2[1] = turbo_encoder_shift_reg_2[0];

turbo_encoder_shift_reg_2[0] = 0;




return 0;


Chapter 8: Throughput Performance Calculation

Throughput: -

According to "Maurizio Martina, Member, IEEE, Mario Nicola, Member, IEEE, and Guido Masera, Senior Member, IEEE" The maximum throughput performance that can be achieved by UMTS for the block length of 5114 is 2Mbps and also the maximum throughput achieved by WiMax with the maximum block length of 2400 is up to 70Mbps. [4]

In any 3GPP LTE encoder, there should be at least four idle clock cycles at the output because of the time needed for the tail bits.

Clock time is the amount of time needed for the signal to pass through one complete cycle. It is represented in ns (nano seconds). It is also called as cycle length.

Clock frequency is the reciprocal of cycle time and hence it is the number of clock cycles, which occurs every second and it is represented in bits/sec or MHz or GHz. It is also called as clock speed or clock rate.

In general, the maximum throughput for the given block size K is given by,

Maximum Throughput = (K/(K+4)) x clock frequency (bits/s)/MHz

[From 3GPP LTE encoder throughput calculation]

K- Block size

From the obtained result, the calculated clock time is 5.12ns.

Hence clock frequency = 1/clock time

= 1/5.12ns

= 1/5.12 x 10 ^ -9

= 195310000 Hz

= 195.31 MHz

For K=40,

Maximum Throughput = (K/(K+4)) x clock frequency (bits/s)/MHz

= (40/(40+4)) x 195.31 x 10^6

= 177.55 Mbps

For K=2400,

Maximum Throughput = (K/(K+4)) x clock frequency (bits/s)/MHz

= (2400/(2400+4)) x 195.31 x 10^6

= 194.98 Mbps

For K=5114,

Maximum Throughput = (K/(K+4)) x clock frequency (bits/s)/MHz

= (5114/(5114+4)) x 195.31 x 10^6

= 195.15 Mbps

For K=6144,

Maximum Throughput = (K/(K+4)) x clock frequency (bits/s)/MHz

= (6144/(6144+4)) x 195.31 x 10^6

= 195.18 Mbps

For the clock frequency of 195.31 MHz, the maximum throughput that can be achieved varies from 177.55 Mbps to 195.18 Mbps for the minimum block size of 40 to the maximum block size of 6144. Hence these figures evidence that the maximum throughput achieved by LTE is higher than UMTS and WiMax standards.

Chapter 9: Comparisons

9.1 Comparison of Fixed-Point processors vs. Floating-Point processors

Selecting the DSP processor is purely depending on the applications, one processor will perform good for some applications but bad choice for others. The number of features described below various from one DSP to another in choosing the processor.

Arithmetic Format: -

The most basic features of a programmable DSP processor are the kind of natural arithmetic used in the processor. TMS320C6416 DSP processor use fixed point arithmetic in which numbers are represented as integers or fractions in predetermined range normally from -1.0 to +1.0. Other processors such as TMS320C6713/45/47 use floating point arithmetic in which values are denoted by mantissa and the exponent is denoted as mantissa x2 exponent. The mantissa is normally a fraction with in the range of -1.0 to +1.0 and the exponent is an integer, which represents binary point shifting number of places left or right to get the value denoted.

Floating-point processors (TMS320C6713/45/47) have general mechanism and more flexible than fixed-point processors. By having floating point, application designers can have access to broader dynamic range and hence floating point DSP processors are normally easier to program than the fixed point but at the same time floating point processors are more expensive and high power consumption. The huge cost high power consumption reflects complex circuitry, which is required for floating point processors that signifies big silicon die. By keeping this in mind, for the implementation of turbo encoder has been programmed through fixed-point processor TMS320C6416 to reduce price and especially lower the power consumption. Some programmers don't concern about dynamic range and precision which leads the advantage of floating point processors but for the fixed point processor TMS320C6416 the programmers should be cautious at many stages of their program to make sure the enough numeric precision with restricted dynamic range.

Because of the low cost and power consumption many huge volume embedded applications use fixed-point processors. The programmers and algorithm designer should calculate the dynamic range and precision demand of their application through simulation or by analytical and after that append scaling function into the code if needed. Floating-point processors have merits if the simplicity of development is more vital than unit cost. By using software routines it is feasible to perform floating-point arithmetic on fixed-point arithmetic, which follows the performance of floating point device. But software routines normally are more costly in terms of processor cycles. So floating point emulation is rarely being used. A high proficient technique to build the fixed-point processor's numeric range is to use block floating-point in which a set of numbers with diverse mantissas but one, general exponent is performed as a set of data. Even though some processors have hardware property to guide in its implementation, block-floating point is normally managed in software.

Data Width: -

All floating-point DSPs such as TMS320C6713/45/47 use 32-bit data word. But for the fixed-point DSPs, the data word size is 16 bits. The size of the data have highly impact on the number of package pins required and on the external memory device size which is connected to the DSP as well as size of the chip. So system designers endeavour to use smallest word size chip, which their application could tolerate. There is a trade-off between complexity of development and word size as with the choice of fixed and floating-point chips. If the programmer uses TMS320C6416 fixed point processor, he/she can do double precision 32-bit arithmetic operations by grouping the combination of instructions. If the application requires high precision for small part of code, the programmer can use double precision arithmetic. If many applications need high precision then choosing a processor with a big data word size is a good idea.

Power Consumption and Management: -

Now-a-days DSPs are more used in handy applications such as portable audio players and cellular phones in which power consumption is a big concern. Hence most of the processor sellers are minimizing power supply voltages and appending power management features to provide the system programmers the major influence on power consumption of the processor.

Power Management features: -

Minimizing voltage operation in which sellers give low voltage type of DSP processors usually in the range of 1.8-3.3V. At the same clock rate, the processors use very less power than 5V identical processors. Many DSPs element turn off processor clock however some part of the processor minimizes power consumption.

Sometimes unmasked interrupt would shift the processor back from sleep mode or sometimes some external interrupt lines would wake the processor. Some DSPs provide clock frequency of the processor to vary by software control to provide minimum clock speed needed for a certain task. Programmer will be allowed by DSPs to disable peripherals, which are not in use. Irrespective of power management features, it is hard to get significant power consumption figures by design engineers for DSPs. This is due to the factor of three relying on instructions it processes.

Memory Organization: -

Processor's memory subsystem organization creates huge significance on its performance. MAC and DSP operations are basic to most of the signal processing algorithms. Fast MAC operation needs two data words and one instruction word at every instruction cycle. There are several paths to attain this by separate instruction and data memories (Harvard architecture), includes multiple memory access per instruction cycle and instruction caches.

Size of the supported memory includes on and off chip is another concern. Many fixed-point DSPs such as TMS320C6416 are designed at embedded system market in which memory requires being small. Hence these processors have small external data bus and small medium of 4K-64K on chip memories.

Fixed-point DSPs have 16 bit buses or less which limits the sum of easily accessible external memory. Some floating-point offer less or no on chip memory but has big external buses. All the features such as size, memory organization and amount of external buses are strongly application dependent.

Speed: -

Determine the suitability of a processor for a specified application is its execution speed. There are many paths to determine the processor speed. The more basic is the processor's instruction cycle time, which is the amount of time needed to perform the fastest instruction on the processor. Instruction execution time is the amount of work done by one-instruction changes from one processor to another. Some DSP processors have VLIW architectures that multiple instruction been issued and executed in one cycle. These processors use easy instructions to do few works than instruction of conventional DSP processors. When compare MIPS ratings with VLIW processors and conventional processors will be deceptive due to basic difference in their instruction set.

Key points of caution on processor speed: -

One is to be aware while comparing processor speed in "millions of operations per second" (MOPS) or "millions of floating-point operations per second" (MFLOPS) figures due to different processor sellers with having different thoughts of operation. Most of the floating point processors have MFLOPS rating twice of their MIPS rating due to implementable "floating-point multiply operation in parallel with floating point addition operation."

Another one is to be aware while comparing processor clock rates. Depending upon the processor DSP input clock might be the same frequency as instruction rate of the processor or it will be more than two times more than the instruction rate. Moreover most of the DSP chips have clock doublers or Phase Locked Loop (PLLs), which permits the use of low frequency external clock to produce the required high frequency clock on-chip.

Summary: -

The right DSP relies on the application; an excellent for one application but may be poor for another. The above said criteria such as arithmetic format, data width, speed, cost, power consumption, ease of development etc. are system designers' decision and should be based on their application. Two trends in DSP processor design. One is our expectation to see more DSP processors narrowed for some specific huge volume applications such as portable digital audio players, cellular phones etc.

The processor would append several system functions such as LCD controllers and A/D converters to reduce cost of the product and parts count. Another one is that EDA (Electronic Design Automation) tools progress in elegance; system designer would locate it easy to amend DSP cores and append custom peripherals to build highly specialized, cost-effective solutions for huge volume applications. [12]

9.2 Comparison of LTE vs. WiMax

LTE will achieve peak data rate of 160Mbps in downlink and 50 Mbps in uplink. The latency is 10ms, which is less when compared to WiMax. [20]

Latency is an important point in some online services. In these cases, they have to deliver services in a good way hence they have to keen their attention to the latency parameter. The data should arrive their destination in correct time in order to achieve real time sensation during the data exchange.

Few examples of online services: -

Online gaming: During the competitive gaming, it is big difference between players if someone has 50ms less than others.

Video conferencing: The video conferencing should have better performance without delays during conversation. The solution for high latency is the use of buffers. But LTE do not require any special requirements. Also, WiMax services like VoIP is worse because of the bigger overhead it is using. The videoconferences will have better performance, without delays in the conversation. There is a solution to a high latency: the use of buffers, that's true. But there is a point here for LTE, because it does not need special improvements. Also, WiMax uses a bigger overhead in the packets, and that is worse for services like VoIP.

In terms of power consumption, LTE uses SC-FDMA; hence it needs lower power consumption. It leads to a longer battery life. [21]

9.3 Comparison of LTE vs. UMTS

The main features of LTE are compared with UMTS networks as follows.

Improved air interface permits enhanced data rates: -

LTE is created on new radio access network based on OFDM technology. As specified in 3GPP Release 8, the LTE air interface adds multiple access scheme and OFDMA based modulation for down link, with SC-FDMA for uplink. In LTE, OFDMA spectral efficiency is more increased with more order modulation schemes and experienced FEC schemes like convolutional coding, tail biting, turbo coding etc.

The effect of radio interface features are extensively increased radio performance produce more than four times the mean throughput of HSPA. The effect of the radio interface is considerably increased radio performance; produce more than four times the average throughput of HSPA. The performance of HSPA use technology like MIMO and the features are division of 3GPP Release 7. LTE provides higher improvements in all over performance and efficiency via complex MIMO and antenna configurations. The capacity of LTE will develop with developments declared in forthcoming Releases permitting advanced LTE to satisfy the needs of advanced IMT.

High spectral efficiency: -

Higher spectral efficiency of LTE permits operators to provide large number of customers in that spectral allocation along with minimized cost of delivery per bit.

Radio planning flexibility: -

LTE provides best performance in a cell size maximum up to 5 km. It is having the capacity of delivering efficient performance up to 30 km radius with added restricted performance up to 100 km.

Minimized Latency: -

LTE provides a more responsive user experience by minimizing round trip times to 10ms or less when compared to round trip times of HSPA around 40-50ms. This allows real time interactive services such as multiplayer gaming and high quality audio/video conferencing.

All-IP environment: -

One of the important features of LTE is "its transition to a flat all-IP based core network and open interfaces". Many 3GPP standardisation works aims the translation of active core network architecture to all-IP system. This plan within 3GPP is declared as System Architecture Evolution (SAE), which is recently called as Evolved Packet Core (EPC). This provides high flexible service provisioning with easy networking with fixed and non-3GPP networks.

EPC is built on TCP/IP protocols hence it provides services includes messaging, rich media, voice and video. This routed to all packet architecture also allows improved interworking with fixed and wireless communication networks.

Existence with legacy standards and systems: -

Even though if there is no LTE coverage in some areas, LTE users must be able to make calls from their terminal and can access to basic data services. Hence LTE permits soft seamless service handover in areas of WCDMA, HSPA or GPRS/GSM/EDGE. Moreover it provides inter-system and intra-system handovers however inter-domain handovers between circuit switched and packet switched sessions.

Additional cost reduction capabilities: -

The launch of features such as self-optimising networks (SON) or multi vendor RAN (MVR) must assist to minimize opex and support the potential to recognize low costs per bit. [22]

  • UMTS need for higher data rates-----This can be achieved by the air interface of 3GPP LTE
  • UMTS need high quality of services----LTE gives high quality of services which reduce control plane significantly and reduces round trip delay 3GPP LTE
  • UMTS need simpler infrastructure------This can be achieved by LTE by having simple architecture, reduces number of network elements [24]

Chapter 11: References

[1] Lattice Corporation (2009), "Turbo Encoder".

[2] FPGA Blog (2007-2009), Xilinx 3GPP LTE Turbo Encoder and 3GPP LTE Turbo Decoder

[3] Texas Instruments TMS320C6x, "VLIW: Very Long Instruction Word"

[4] Maurizio Martina, Mario Nicola, Guido Masera, IEEE Members, "A Flexible UMTS-WiMax Turbo Decoder Architecture, IEEE Transactions on circuits and systems, vol. 55,No.4, April 2008

[5] Emilia Kasper, "Turbo Codes"

[6] Satyantan Choudhury, "Modelling and Simulation of a Turbo Encoder for wireless Communication Systems"


[8] David Williams "Turbo Product Code Tutorial", 01 May 2000

[9] Shenyang, "Optimization of Turbo Codes by Differential Evolution and Genetic Algorithms", 12 August 2009

[10] ALTERA,"AN 505:3GPP Turbo Reference Design", Jan 2010

[11] Fu-hua Huang," Chapter 3. Turbo Code Encoder",1997

[12] Hazarathaiah Malepati and Yosi Stein, "Turbo encoders boost efficiency of a femtocell's DSP", April 2009.

[13] Berkeley Design Technology, Inc. "Choosing a DSP Processor", 1996-2000

[14] Chapter 8-"DSP Implementation Platform: TMS320C6x Architecture and Software Tools",%20Lab-8%20-88771d79f7788ca847d88398ba40ef9a.pdf

[15] Datasheet Directory, Embedded Processing and DSP-"Fixed-Point Digital Signal Processor

[16] Texas Instrument ,"TMS320C6416 Datasheet"

[17] DSP Development System, "TMS320C6416 DSK Technical Reference ", 2003

[18] Texas Instruments, "Code Composer Studio (CCStudio) Integrated Development Environment (IDE)"

[19] SHELDON Instruments," Code Composer Studio"

[20] Dr.-Ing. Carsten Ball Siemens Networks, "LTE and WiMax Technology and Performance Comparison", April 2007

[21] Mario Eguiluz Alebicto, "Analysis: Wimax & LTE", Feb 2010

[22] A white paper from the UMTS forum," Towards Global Mobile Broadband- Standardising with the future of mobile communications with LTE (Long Term Evolution)", Feb 2008 ",com.../task.../Itemid,12/

[23] Resource and Analysis for electronics Engineers," 3G LTE Tutorial - 3GPP Long Term Evolution"

[24] Shuttgort,Research & Innovation," UMTS System Performance", June 2006

List of Acronyms

ASIC          Applications Specific Integrated Circuits

ARM          Advanced Reduced instruction set computer Machine

CCS          Code Composer Studio

CDMA          Code Division Multiple Access

CCSDS          Consultative Committee for Space Data System

CPU          Central Processing Unit

CPLD          Complex Programmable Logic Device

DRAM          Dynamic Random Access Memory

DSP          Digital Signal Processor

DMA          Direct Memory Access

EDGE          Enhanced Data for Global System for Mobile Communication Evolution

EPC          Evolved Packet Core

EMIF          External Memory Interface

FEC          Forward Error Correction

FTP          File Transfer Protocol

FPGA          Field Programmable Gate Array

FIR          Finite Impulse Response

GSM          Global System for Mobile Communication

GPRS          General Packet radio service

HPI          Host Port Interface

HSPA          High Speed Packet Access

HSDPA          High-Speed Downlink Packet Access

HSUPA          High-Speed Uplink Packet Access

JTAG          Joint Test Action Group

L1D          Level 1 Data Cache

L1P          Level 1 Program Cache

LTE          Long Term Evolution

LUT          Look Up Table

LDPC          Low Density Parity Check

MOPS          Million of Operations Per Second

MFLOPS          Millions of Floating-Point Operations Per Second

MIPS          Million Instructions Per Second

Mbps          Mega bits per second

McBSP          Multi-channel Buffered Serial Port

MAC          Multiply Accumulates

MMAC          Million Multiple Accumulates

MIMO          Multiple-Input Multiple-Output

NRC          Non Recursive Code

OFDM          Orthogonal Frequency Division Multiplexing

PCCC          Parallel concatenated convolutional code

PCI          peripheral component interconnect

QoS          Quality of Service

RTT          Round Trip Time

RAT          Radio Access Technologies

RSC          Recursive Systematic Code

RSIC          Reduced Instruction Set Computer

SAE          System Architecture Evolution

SDRAM          Synchronous Dynamic Random Access Memory

SC-FDMA          Single-carrier Frequency Division Multiple Access

SON          Self-Optimising Networks

TMS          Texas Management System

TI          Texas Instruments

TCP          Turbo Decoder Coprocessor

TDD          Time Division Duplex

UMTS          Universal Mobile Telecommunications System

USB          Universal Serial Bus

VoIP          Voice over Internet Protocol

VCP          Viterbi Decoder Coprocessor

VLIW          Very Long Instruction Word

WCDMA          Wideband Code Division Multiple Access

WLAN          Wireless Local Area Network

WiMax          Worldwide Interoperability for Microwave Access

Wi-Fi          Wireless Fidelity

3GPP          3 Generation Partnership Project

Please be aware that the free essay that you were just reading was not written by us. This essay, and all of the others available to view on the website, were provided to us by students in exchange for services that we offer. This relationship helps our students to get an even better deal while also contributing to the biggest free essay resource in the UK!