fir filter architectures for fpgas and asics
the , , , , , and blocks all use the same fir filter architectures to implement their algorithms.
these blocks provide filter implementations that make trade-offs between resources and throughput. the filter implementations also consider vendor-specific hardware details of the dsp blocks when adding pipeline registers to the architecture. these differences in pipeline register locations help fit the filter design to the dsp blocks on the fpga. for a filter implementation that matches multipliers, pipeline registers, and pre-adders to the dsp configuration of your fpga vendor, specify your target device when you generate hdl code.
the filter implementations remove multipliers for zero-valued coefficients, such as in half-band filters and hilbert transforms. when you use scalar input data, the filters share multipliers for symmetric and antisymmetric coefficients. frame-based filters do not support symmetry optimization.
the fir filter implementations implement efficient complex multiplier architectures and support frame based input by using polyphase filters that share hardware resources across subfilters .
the architecture diagrams on this page assume a transfer function that has l coefficients (before optimizations that share multipliers for symmetric or antisymmetric or remove multipliers for zero-valued coefficients). n represents the number of cycles between valid input samples.
filter structure | blocks | settings |
---|---|---|
fully parallel systolic architecture |
|
|
fully parallel transposed architecture |
| set filter structure to direct form transposed . |
partly serial systolic architecture (1 < n < l) |
|
|
fully serial systolic architecture (n ≥ l) |
|
|
complex multipliers
if either data or coefficients are complex but not both, the filter blocks implement one filter to calculate the real output and a second filter to calculate the imaginary part. this implementation results in two multipliers for each filter tap.
when both the data and coefficients are complex, the block implements three filters in parallel. the diagram shows the filter implementation for complex input data x = xr i×xi and complex coefficients w = wr i×wi.
when you specify coefficients from a parameter, wr wi and wr-wi are pre-calculated, so this implementation uses 3 dsp blocks for each filter tap, plus the input adder and two output adders. the input to each filter tap multiplier grows by one bit.
when you use programmable coefficients, the filter uses 2 more adders for each filter tap. these adders calculate the coefficients wr wi and wr-wi.
frame-based input data
the discrete fir filter, fir decimator, fir interpolator, channelizer, and channel synthesizer blocks accept frame-based input data to support gigasamples-per-second throughput. when you apply frame-based input data, the fir filter implements a polyphase decomposition of your filter coefficients into v subfilters, where v is the size of the input vector. the frame-based filter increases throughput and uses more hardware resources than the scalar-input case. frame-based filters do not implement symmetry optimization.
for a filter with a 1-by-2 input vector, [y0 y1]
, the diagram shows the polyphase decomposition into two subfilters that implement this equation.
each subfilter takes scalar input and is implemented with the architecture you selected, either direct form systolic
or direct form transposed
. if the subfilters have different latencies due to different numbers of coefficients, or zero-value coefficient optimization, then the implementation includes internal delays to align the output samples. you cannot use frame-based input with the serial systolic architecture.
when you use frame-based input with programmable coefficients, the output may not match sample-for-sample with the output in scalar mode. this difference is because of the internal timing of applying each sample in the input vector to the subfilters. changes in the input coefficients effectively occur at different individual input samples than they do in scalar mode.
fully parallel systolic architecture
this filter architecture is a fully parallel systolic architecture with optimizations for symmetry or anti-symmetry and zero-valued coefficients. the latency depends on the coefficient symmetry and is displayed on the block icon.
when symmetric pairs of coefficients have equal absolute values, they share one dsp block. this pair-sharing enables the implementation to use the pre-adder in xilinx® and intel® dsp blocks. the top half of the diagram shows a symmetric filter without the pair coefficient optimization. the bottom half of the diagram shows the architecture using the pair coefficient optimization.
fully parallel transposed architecture
the fully parallel transposed architecture minimizes multipliers by sharing multipliers for any two or more coefficients that have equal absolute values. it also removes multipliers for zero-valued coefficients. the latency of the filter is six cycles when you use scalar input. this latency does not change with coefficient values.
the top half of the diagram shows the theoretical architecture for a partly-symmetric filter without the equal-absolute-value coefficient optimization. the bottom half of the diagram shows the transposed architecture as implemented using the equal-value coefficient optimization. if the coefficients are antisymmetric, the output adder becomes a subtraction.
partly serial systolic architecture (1 < n
< l)
the partly serial implementation uses m = ceil(l/n)
systolic cells. each cell consists of a delay line, coefficient lookup table, and dsp (multiply-add) block. the coefficients are spread across the m lookup tables. the computation performed by each dsp block is serialized. input samples to the block must be scalar and at least n cycles apart. the latency of the block is m ceil(l/m) 5
.
if all the coefficients in the lookup table for a multiplier are zeros or powers of two, the implementation does not include that multiplier. the powers of two multiplications are implemented as shifts.
the block implements a ram-based delay line that uses fewer resources than a register-based implementation. uninitialized ram locations can result in x
values at the start of your hdl simulation. you can avoid x
values by initializing the ram from your test bench, or by enabling the initialize all ram blocks configuration parameter. this parameter sets the ram locations to0
for simulation and is ignored by synthesis tools.
fully serial systolic architecture (n ≥ l)
when you choose a serialization factor such that n ≥ l, the block implements a fully serial systolic architecture. for real coefficients and real input, the filter uses a single dsp (multiply-add) block with a delay line and a lookup table for all l coefficients. input samples must be at least n cycles apart. the latency of the filter is l 5
.