# Resolve picoseconds using FPGA techniques

**Abstract**

A low area and high stability Time-to-Digital Converter (TDC) circuit implemented with a shifted-clock sampling technique is proposed. It is tested on a Xilinx Virtex-5 ML507 device. The results demonstrate the TDC achieves a dynamic range of 10 µs, 64 ps resolution, differential nonlinearity (DNL) of less than 0.3 LSB, and integral nonlinearity (INL) better than 0.6 LSB.

**Introduction**

Time to digital converters often find use in high energy physical experiments such as exploring subatomic level fine structure in fixed target experiments and collision experiments[1-2], where they need accuracy in the picosecond level. Designs are usually implemented with Application Specific Integrated Circuits (ASICs) for their high precision and stability compared with TDCs implemented on FPGA [3-4]. But TDCs based on FPGA also can reach picosecond precision [5]. TDCs implemented on FPGA need to take important features into consideration: quantization step, measurement range, standard measurement uncertainty, nonlinearities, dead time, and readout speed. There is an important guiding principle and trade-off in FPGA: the balance between size and speed [6]. We should consider high quality clocks and the placement of logic resources, resources consumption, and find ways to reduce nonlinearity effects and calibrate the final results.

We first introduce the basic theory of TDC and then compare and analyze two popular TDC structures in Section II. In Section III we present a low resource consumption and high stability TDC with shifted clock sampling technique, multi-level latches, and four-level shift register arrays. The results are described in Section IV.

**TDC architecture**

**A. Basic theory of TDC**

**Fig1 TDC Timing Diagram**

∆T = ∆t1 + N*Tref – ∆t2 (1)

The simplest design of TDC is the direct counting method, but in this way we only could measure the coarse time intervals between start and stop signals, while intervals ∆t1 & ∆t2 are ignored and bring about a ±1 LSB measurement error. This scheme is also called “coarse counter”, and the least significant bit (LSB) is determined by the frequency of the reference clock, namely LSB = 1/clk. The highest frequency in a FPGA is limited and generally only could reach approximately 500MHz.

**B. Interpolator TDL and SCS**

There are two schemes used to improve the resolution in the design of TDCs, namely tapped delay line (TDL) and shifted clock sampling (SCS). Both of them interpolate between clocks to determine ∆t1 & ∆t2.

**Fig2 Two kinds of TDC basic framework: (L) Tapped Delay Lines; (R) Shifted Clock Sampling**

For the TDL in Fig2, the delay chain is made up of a lot of delay units such as buffers or inverters. The shortest propagation delay of a logic gate in Virtex-5 may be only a few picoseconds, however, it is difficult to achieve a delay chain structure comprised of pure logic gate resources in an FPGA. Torres et al. [7] take advantage of the CARRY4 block as a delay element (in MUXCY of Xilinx FPGA), and better than 100ps accuracy is achieved.

The fine interpolation time range must be larger than a reference clock period of the coarse counter. This brings about a problem that if a delay time of one stage is 100 ps and the coarse reference clock’s frequency is 100MHz, 100 slices are needed to implement TDL. What’s more, calibration and temperature compensation are needed because propagation delay will not be ideal, and is sensitive to PVT (process, voltage, temperature). Also, more resources are used than the SCS scheme.

The kernel of the SCS scheme in an FPGA is the generation of shifted-phase clocks. A fixed clock is converted by the CMT (clock management tile: DCM and PLL) and generates multiple phase-shifted clock signals, e.g., 2π/N, N is the number of the sampling clock signal. The input signal is sampled at the same time by these phase shifted clocks. SCS with an 8-phase clock is shown in Fig 3.

**Fig 3 Using SCS, the input signal is synced to the 225° phase.**

Therefore, each clock is divided equally into N time slices in every reference clock period. In Fig. 3, the input signal’s leading edge is located in the middle of 225° and 180° phase clock; the sample output is ‘0’ before 225° and then outputs ‘1’, and is finally sampled by the reference clock at 0° (output = “11100000”).

If the clock frequency of fine counting is 400MHz, and each clock phase shift 45° (dividing each clock into 8 time slices), the output sampling clock interval is 1 / 400MHz / 8 = 312.5ps. Compared to TDL precision of 10ps, this scheme cannot reach this high resolution, but the resource is reduced. If the coarse counting clock is 100MHz, only 10ns / 312.5ps = 32 registers are needed in TDL scheme, while in SCS it only needs 8 registers. And it is easier to implement more multi-channel sampling TDCs by means of incremental technology in PlanAhead tools[8].

**Implementation of SCS**

**A. Shifted clock generation**

The proposed TDC is designed with shifted clock sampling technique in the XC5VFXT70T located on the Virtex-5 ML507 device. This powerful FPGA is built on a 65nm copper process technology and contains 11,200 slices (each slice contains 4 storage elements, 4 function generators/arithmetic logic gates, large multiplexers, and a fast carry look-ahead chain), 296 blocks of 36kb RAMs,128 DSP slices, 6 CMTs, and global-clock multiplexer buffers, which offer the best solution for high-performance designs [9].

The implementation of the fine counter module contains four parts: shifted clock generator, lock chain and encode circle, data store and address match, and data collection and analysis. To achieve better linearity in the signal sampling and quantization process, a high quality clock is the key point.

There are 32 global clock lines in Virtex-5 device that can clock all sequential resources on the whole device (CLB, block RAM, CMTs, and I/O), and they are also used to drive logic signals. Any 10 of these 32 global clock lines can be used in any region. Global clock lines are only driven by a global clock buffer, which can also be used as a clock enable circuit, or a glitch-free multiplexer. A global clock buffer is often driven by a Clock Management Tile (CMT) to eliminate the clock distribution delay, or to adjust its delay relative to another clock [10].

Considering the global clock lines limit, we use DCM and PLL to output four phase-shifted clocks; another four apparent phases are obtained using falling edges to sample.

**Fig 4 Schematic of phase clock generator of 8-bin TDC**

**B. Lock chain and encode circle**

The most important part of SCS TDC design is the lock chain and encode circle, which is used to latch the input signal and catches the current jump points.

Fig5 Multi-level latch architecture

When asynchronous input signals are sampled, metastable states must be taken into account. As the structure shows in Fig 5, a cascaded multilevel register is used, with CLK0 sampling the eight latched signals to a unified clock domain.

After latched data of eight bits, e.g., “11100000” or “00000000” or “11111111”, the jumping point in a period is not easy to find due to the high speed data flow. At the same time, dead band time is expanded and the fine count and coarse count data are not synchronized.

The coarse reference clock is 100 MHz, and fine count reference clock, 400 MHz. We propose a four-level buffer structure, with four-cycle fine count data, sent to a series of 32-bit words, and then sampled by the coarse count clock. The input signals are sampled by phase shifted clocks through the latch chain, and then output eight bin signals (Q0, Q45…Q315 ). These eight signals are then shifted by a four-level buffer array shows as follows: First, eight signals are sampled in Floor1 and stored in 31~25 registers, then this data is shifted by Floor2 at the same time eight new signals are stored in 31~25; finally, four periods’ data are sampled by coarse clock and stored in a 32-bit register.

**Fig 6 Schematic of sampling data buffer unit**

32 bits data sequence are equally to put a coarse period into 32 (N) time slices, whenever the input signal rising edge and then the transition from ‘0’ to ‘1’ is checked, the position of leading edge in a coarse sampling clock period is confirmed, the following edge detection formula (2) is used:

Binew = Bi & !Bi – 1 (0 < i < = N-1) (2)

To avoid the loss of input signals that may be occurred on the time i = 0, namely the leading edge of input signals are synchronous to CLK0, this formula is still used but Bi – 1 now turns into the previous period of the highest bit (BN-1) to check whether it is a transition point.

An example of the relationship of the time slices in a coarse period between the respective time values and position in 16 bits is shown in Table1. A string data will transition through formula (2), decoding value ‘1’ corresponds to the location of a transition from ‘0’ to ‘1’.

The leading edge start at position 5 and end at position 11, all we want is check for the transition of ‘0’ to ‘1’.

It can see from the above table that the fine time difference between the Hit’s leading edge and the reference clock’s leading edge is 5 time slice widths (we can also see that the pulse width is 7 time slice widths); a time slice is the LSB of fine count module, and the value is Tfineref/N * 1, (N: number of output phase shift clocks).

Whenever the leading edge is detected, the time information contains the current coarse counter values and the current jump point in a reference clock period (fine counter value). For stamp matching, a trigger management unit was set; only the time stamps relative to the input signal are needed. The time of the input signal’s rising edge is counted and turned into RAM write address and write enable signals. The measurement system needs three memory modules: both the fine and coarse counter units need a RAM, and at the same time, counter values stored in the RAM then pass through a FIFO for output data analysis, such as displaying on the LCD1602 located on the carrier board or through a connection (RS232, USB2.0) for computer analysis of the DNL and INL.

The data are stored in RAMs according to the order of addresses corresponding to each jump points, which can easily get any specific time interval between the jump point values, the need to do is to use the following formula (3):

∆T = ∆Ncoarse * Tcoref – ∆nfine * LSB (3)

The coarse reference clock period Tcoref and fine counter bin LSB is determined by the system clock – all we need to do is get the difference time interval values in coarse RAM (∆Ncoarse) and fine RAM (∆nfine) respectively.

In order to get more reliable time interval measurement results, more complexity is required. Even though the clock phase-shift variations could be well controlled by the CMT, the input signal routing skew should be taken into consideration. Steps such as STA, adequate timing and area constraints, and P&R in Plan Ahead are done to limit the skew and the delay of the input signal to the registers.

**Fig 7 The SCS TDC**

**Results and Discussions**

For an eight-bin TDC, a time bin LSB is only 312.5ns when the reference clock is 100MHz, so we finally implemented 16 phase-shifted clocks, thus the LSB is 156ps, eight clocks are generated by two PLLs, and by falling-edge-triggered flip-flops sampling the input signal, effectively generate another eight phase-shifted clocks. Total measurement range is 10µs for 1K RAM.

Nonlinearity is determined by using statistical code density tests: self-oscillation is used due to the low resource consumption of SCS, the frequency is easily controlled by delay chain which is made up of MUXCY in internal resources, a large number of random pulses passing through the fine counter module is measured, and counts falling in each time bin is histogrammed. Fig 8 shows that the worst DNL is about 0.3 LSB and the INL is smaller than 0.6 LSB for 16 phase-shifted clocks and extended to 64 bin slices.

**Fig 8 Differential Nonlinearity and Integral Nonlinearity**

To determine the time resolution, we adopted a cable delay test. A fixed frequency square wave signal is fed into lock chain for difference distribution statistics. The time resolution is also the RMS value: As the delay length is set, just compare to the theoretical value and the other values from actual measurement to indicate the standard deviation. According to reference [11], TDC precision is approximate to 1/√6 of the theoretical value; namely, RMS of a single time delay measurement is 0.408 LSB ≈ 64ps.

**Conclusion**

The TDC with shifted clock sampling technique has been implement successfully in the XC5VFXT70T on ML507 carrier board. 16 phase-shifted clocks were generated by the CMT and the lock chain adopted multi-level latches to reduce the probability of metastable states, and a 64-bit four-level buffer array was used for data caching. This reduced the dead time and the clock rate, so that the coarse counter reference clock frequency could be 100 MHz and a large dynamic range of around 10µs was achieved. Additionally, the trigger management unit is specially set for stamp matching. The fine counter clock frequency was extended to 400 MHz and resulted in the TDC LSB of 156 ps, the worst DNL was less than 0.3LSB, the INL is 0.6 LSB RMS, measured by fixed cable delays test, and the single-shot resolution was better than 64 ps.

**References:**

[1] J.P. Jansson, , A. Mantyniemi, J Kostamovaara,. A cmos time-to-digital converter with better than 10 ps single-shot precision. IEEE J. Solid-State Circuits, 41(6), 1286–1296 (2006).

[2] K. Karadamoglou, N. Paschalidis, , N. Stamatopoulos, et al., A cmos time to digital converter for space science instruments., Proceedings of the 28th European Solid-State Circuits Conference (ESSCIRC 2002), pp. 707–710, 2002

[3] B. Markovic, S. Tisa, F.A. Villa, et al., A High-Linearity, 17 ps Precision Time-to-Digital Converter Based on a Single-Stage Vernier Delay Loop Fine Interpolation.IEEE Trans. Circuits Systems I Regular Papers, 60(3): 557-569, 2013.

[4] J.P. Jansson, V. Koskinen, A. Mantyniemi, et al., A Multichannel High-Precision CMOS Time-to-Digital Converter for Laser-Scanner-Based Perception Systems.IEEE Trans. Instrumentation Measurement, 61(9): 2581-2590, 2012

[5] J. Wang, S. Liu, L. Zhao, et al., The 10-ps Multi-time Measurements Averaging TDC Implemented in an FPGA. IEEE Trans. Nuclear Science, 58(4): 2011-2018,2011.

[6] B. So, M.W. Hall, P.C. Diniz, A Compiler Approach to Fast Hardware Design Space Exploration in FPGA-based Systems.[J]. Acm Sigplan Notices, 2002, 37(5):165–176.

[7] J. Torres, A. Aguilar, R. G. Olcina, et al., Time-to-Digital Converter Based on FPGA With Multiple Channel Capability , IEEE Trans. Nuclear Science, 61(1): 107-114, 2014.

[8] M. Büchele, H. Fischer, M. Gorzellik, et al., A 128-channel Time-to-Digital Converter (TDC) inside a Virtex-5 FPGA on the GANDALF module, Journal of Instrumentation, 7(03): C03008, 2012.

[9] Xilinx, Inc., DS100: Virtex-5 Family Overview (2009).

[10] Xilinx, Inc., UG190: Virtex-5 FPGA User Guide (2009).

[11] F. Baronti, L. Fanucci, D. Lunardini, et al., On the differential nonlinearity of time-to-digital converters based on delay-locked-loop delay lines. IEEE Trans. Nuclear Science, 48(6):2424 – 2431, 2001.