# A Clock Distribution and Synchronization Scheme Over Optical Links for Large-Scale Physics Experiments

Yang 🕀 , Yonggang Wang, Member, IEEE, Jie Kuang, and Bo Wu

Abstract— In large-scale physics experiments, there is a trend to implement clock distribution and synchronization over multi-Gigabit serial links on optical networks. The accuracy of clock synchronization in such a fashion is mainly determined by two aspects: one is the stability of the clock distribution over the links, and the other is the employed mechanism of clock synchronization. To achieve a high synchronization accuracy in a range of tens of picoseconds, this article proposes a new optical fiber link and synchronization scheme based on optical circulators and serial transceivers embedded in field-programmable gate arrays (FPGAs). Benefiting from the single wavelength fiber connection, high-precision time-to-digital converters (TDCs) and phase interpolator (PI) sub-block in state-of-the-art Serializer-Deserializer (SerDes) transceivers on FPGAs, the offset of distributed clocks can be measured on-line and precisely compensated. The time synchronization performance is evaluated on a prototype system with three levels of structure. Through multiple power or reset cycles and long-term operation tests, the clock synchronization accuracy over a 5-km fiber connection is measured to be lower than 15 ps, and the recovered clocks at nodes at different levels have independent jitter, all lower than 4.2 ps. In addition, the achieved high performance is also proved to be insensitive to operating temperature and the connection distance.

*Index Terms*— Clock distribution and synchronization, fieldprogrammable gate array (FPGA), optical circulator, Serializer– Deserializer (SerDes), time-to-digital converter (TDC).

### I. INTRODUCTION

N TIME-RELATED large-scale physics experients or control systems, a large number of distributed nodes need to be synchronized, which requires a network broadcasting the central clock event to reach all nodes after a fixed traveling time. Different from conventional solutions using a dedicated network just for time synchronization, several projects, such as the Large Hadron Collider (LHC) experiments [1], Compressed Baryonic Matter (CBM) experiment [2], Large High Altitude Air Shower Observatory (LHAASO) experiment [3], and even positron emission

Manuscript received November 16, 2020; revised January 28, 2021 and March 19, 2021; accepted March 30, 2021. Date of publication April 2, 2021; date of current version June 14, 2021. This work was supported in part by the National Natural Science Foundation of China (NSFC) under Grant 11735013 and in part by the Chinese Academy of Science under Grant GJJSTD20200008.

The authors are with the State Key Laboratory of Particle Detection and Electronics, University of Science and Technology of China, Hefei 230026, China, and also with the Modern Physics Department, University of Science and Technology of China, Hefei 230026, China (e-mail: wangyg@ustc.edu.cn).

Color versions of one or more figures in this article are available at https://doi.org/10.1109/TNS.2021.3070589.

Digital Object Identifier 10.1109/TNS.2021.3070589

tomography (PET) scanners [4], have long focused on embedding the clock distribution into the fiber link-based data stream network. Using only one single bidirectional fiber link for all network functionalities, some special demands for these experiments, for example, limited space for hardware and long-distance connection, can be satisfied.

Such a unified interconnection network must efficiently support four types of communication: clock distribution, time synchronization, control message, and data streams. The popular choice at the moment for constructing the networks is using off-the-shelf high-speed Serializer–Deserializer (SerDes) transceivers embedded in field-programmable gate arrays (FPGAs). Since these transceivers are typically designed for applications that can tolerate latency variations, their use for clock distribution requiring fixed latency has led to studies focusing on the time property of transceivers and the necessary techniques to keep the same latency through the data path, in particular, after a reset, a loss of lock or a power cycle [5], [6]. The latest research work [1] proposed some new solutions to achieve picosecond-level phase stability with Xilinx UltraScale transceivers.

The bidirectional packet transmission ability over the serial network makes time synchronization possible, regardless of the distance among nodes. In fact, on general purpose networks like LANs, some special packet-based protocols have already been developed to synchronize nodes as accurately as possible. For example, the network time protocol (NTP) [7] allowing time synchronization in the range of some microseconds and the precise time protocol (PTP, IEEE 1588) [8] allowing a more precise synchronization. The White Rabbit extension of the PTP (WR-PTP) [9], initially developed to serve accelerators at CERN, has now become a widely used high-accuracy synchronization protocol in scientific installations [10]. Based on the regular Ether- net, the White Rabbit Network can provide sub-nanosecond accuracy and tens of picoseconds precision of synchronization [11], [12]. However, the conventional WR method uses two different wavelength lasers for duplex transmission in a single-mode fiber cable. Due to the effect of chromatic dispersion, the transmission delays of uplink and downlink are different. This asymmetry parameter increases the complexity of the calibration process, which results in an uncertainty of hundreds of picoseconds with a 5-km fiber for the time distribution [13].

The next-generation particle accelerators with high energy and luminosity require detectors with unprecedented time and

0018-9499 © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.

See https://www.ieee.org/publications/rights/index.html for more information.



Fig. 1. Composition of the system distributing the clock from a master board to a slave board over an optical fiber link.

position resolutions [14], [15]. To achieve 20–30 ps system time resolution in these new detector systems, the contribution from the clock distribution and synchronization is required at the level of 10 ps, which can only be achieved by special

hardware support and a more precise synchronization scheme. In this article, we present a new fiber link architecture and a time synchronization mechanism to meet this requirement. The most advanced SerDes transceivers in Xilinx UltraScale FPGA and optical circulator components are employed in this scheme. The former provides the possibility for higher phase determinism and fine-tunes the skew of the link with a resolution of 1 ps, and the latter significantly improves the symmetry of the transmission path, which allows the transmission latency to be measured, monitored, and compensated precisely in real time. By constructing a prototype system with up to three levels of hierarchical structure, the time synchronization accuracy and precision are experimentally evaluated, which demonstrates that the proposed method is effective for achieving high-performance clock distribution.

# II. LINK ARCHITECTURE AND SYNCHRONIZATION PRINCIPLE

# A. Link Architecture

The link architecture and system composition are explained in conjunction with the implementation of the prototype, as shown in Fig. 1. The main clock in the master board is being sent to the slave board via an optical-circulator-based fiber link. On the master board, the crystal oscillator Si570 (Silicon Laboratories) generates a 156.25-MHz clock signal whose jitter is further reduced by a narrow-band phase locked loop (PLL), Si5344, before it is input to the FPGA (Xilinx Kintex UltraScale XCKU040) as the master main clock. Inside the transmitter (Tx) module of the FPGA, the master main clock is embedded into the 3.125 Gb/s serial data stream to be transmitted to the slave board through the fiber link. The Small Form-factor Pluggable (SFP, 10Gtek-1310 nm-10 km) transceivers on the two boards perform electro-optical/opticalelectroconversion, which modulates and demodulates the data bits in the same wavelength light (1310 nm). On the slave board, the embedded clock is recovered from the received data stream by the receiver (Rx) module. This clock, which is called the recovered clock in the slave board, is output by the

FPGA to a Si5344 jitter cleaner that feeds the resulting purified clock back to the FPGA where it is used as the main clock of the slave board. To synchronize the master main clock and the slave main clock between the two boards, the slave main clock is transmitted back to the master board in the opposite direction. Similarly, the clock recovered from the Rx module on the master board is called the recovered clock in the master board.

The optical connection between the SFPs on the two boards consists of two optical circulators connected with a singlemode fiber (G652). The optical circulator is a passive threeport optical device with specific transmission paths, as shown in Fig. 1. Port 1 can only be input, port 3 can only be output, and port 2 can only be bidirectional. When light enters on port 1, it can only get out on port 2, meanwhile, when light enters on port 2, it can only get out on port 3. The purpose of using the circulators here is to allow the uplink and downlink transmissions to use the same optical wavelength over a single optical fiber, which guarantees identical propagation delays in both directions. Keeping the transmission paths in the two directions symmetric is very important for achieving high synchronization accuracy, as it will be discussed further in Section III-A.

The time-to-digital converter (TDC) module in each board is used to measure the phase difference between the recovered clock and the main clock on each board. Using the measured average value of phase difference of 20 000 measurements, the phase interpolator (PI) in the Tx module of the master board can be adjusted to compensate for the one-way transmission delay to guarantee the phase alignment of the two main clocks. During the calibration process, an oscilloscope is required to monitor the phase difference between the two main clocks.

The principle of the FPGA-based TDC can be briefly explained as a clock counter combined with a tapped delay line (TDL)-based time interpolation within clock cycles [16]. The hit signal transmits along the TDL, and the status is sampled out at every clock edge. Taking the system clock as a reference, the readout of the counter is its coarse timestamp, and the position of the hit signal on the TDL is its fine timestamp. The combination of the two timestamps is its final measurement result. Since modern FPGAs provide abundant predefined logic resources, TDCs based on a simple TDL structure are adequate to reach the required performance. Using UltraScale FPGA, the TDC modules in this article are implemented with 4-ps rms precision and 4-ns measurement dead time. Furthermore, the TDL-based TDC has also been proved with good portability, either between different FPGA families [17] or different FPGA vendors [18].

To measure the phase difference between the main clock and the recovered clock on both boards, each TDC module has two identical TDC channels: one channel receiving the on-board main clock and the other receiving the on-board recovered clock. The difference between the two measured timestamps is the phase difference between the two clocks.

In the FPGAs, each SerDes transceiver includes independent Tx and Rx blocks. The Tx block receives the parallel data sequence clocked by the on-board main clock and converts it into a high-speed serial data stream for transmission. The Rx block receives the serial data stream and extracts the embedded bit clock from it. The serial data stream is converted back into a parallel data sequence; meanwhile, the recovered clock can be generated from the extracted bit clock [19]. In the Tx path of GTH (Gigabyte Transceiver H) transceivers, Xilinx has introduced a sub-block called the PI. The Tx PI is a dynamically controllable precise phase shifter, which has two main modes of operations: 1) continuous update mode which can be used for data reception at a given frequency and re-emission at a slightly different frequency that has a precisely known ppm offset from the Rx and 2) stepping mode which allows for a controllable phase relationship between the reference clock and the Tx output of the GTH transceiver [20]. In our design, the stepping mode is used to adjust the phase difference between the on-board main clock and the transmitted bit clock. The Tx PI has the minimum adjustment step size of 1.25 ps, which will play an important role in our synchronization scheme (in Section III-A). The phase lock modules on both boards monitor the phase difference between the main clock and the transmitted clock (the clock after the Tx PI) to keep it constant during the process of reset/initialization, which will be introduced in detail in Section III-B.

It is very important to select an appropriate clock source to ensure the high performance of clock distribution and synchronization. The Si570 from Silicon Laboratories is a cheap low-jitter clock oscillator ideally suited for this research and development (R&D) requiring programmable frequencies. The phase jitter (rms) of the Si570 output clock is less than 1 ps [21]. During clock transmission, performance will be affected by the jitter increments which need to be limited. The Si5344 is a clock multiplier that can provide multiclocks at different frequencies with jitter attenuation for applications requiring a high level of clock performance. The phase jitter (rms) of the Si5344 output clock can be less than 200 fs [22]. In our design, the Si5344 is configured as a jitter cleaner in a zero-delay mode, that is, the input and output clocks are synchronized with zero phase offset.

## B. Coarse Timestamp Synchronization

The coarse counters are driven by the main clock in each board to generate the coarse timestamps. Any time *t* in both



Fig. 2. SYNC packet exchange process.

boards can be expressed as

$$= N T_{\text{period}} + T_{\text{phase}} \tag{1}$$

where *N* is the coarse timestamp.  $T_{\text{period}}$  is the main clock period, which is 6.4 ns in our case corresponding to a frequency of 156.25 MHz.  $T_{\text{phase}}$  is the phase of *t* in the current clock period.

The coarse timestamp alignment is the first step of the clock synchronization process. The offset of the coarse counters on the two boards is unknown initially. It needs to be determined through a SYNC packet exchange process (Fig. 2). The master board sends a SYNC packet to the slave board and records the counter value as  $N_1$ . When the SYNC packet is detected on the slave board,  $N_2$  is recorded. Similarly, two coarse timestamps  $N_3$  and  $N_4$  can be obtained when the SYNC packet is sent back to the master board from the slave board.

The round-trip coarse time delay can be expressed as

$$delay_{mm} = (N_4 - N_1)T_{period} - (N_3 - N_2)T_{period} = K \times T_{period}.$$
(2)

Considering the symmetry of the transmission link and the hardware circuit, the delays of two directions are equal, and the one-way coarse time delay is

$$delay_{ms} = \frac{1}{2} delay_{mm} = \frac{1}{2} K \times T_{period}.$$
 (3)

*K* should be an even number because of the symmetry of the transmission link. But if the accumulation of the phase delay reaches  $T_{\text{period}}$ ,  $N_3$  or  $N_4$  will increase by 1, and *K* will be an odd number. The offset of the coarse counters on the two boards is calculated as

offset = 
$$N_2 - N_1 - K_2^{*}$$
 (4)

where  $\times x$  ] denotes the integral part of x.

## C. Phase Synchronization

After the coarse timestamp is calibrated, the phase difference between the main clock and the recovered clock on each board is measured with the TDC for phase synchronization. With the definitions of time parameters as shown in Fig. 3, the readout of the TDC on the master board can be expressed as

$$T_m = (O + \delta_{ms} + T_s + \delta_{sm}) \mod T_{\text{period}}.$$
 (5)



Fig. 3. Propagation delay model.

Here,  $O = T_{\text{pin}} + O_{txm} + O_{rxs} + T_{\text{pis}} + O_{txs} + O_{rxm}$  represents all the delays inside the SerDes transceivers of the round-trip.  $\delta_{ms}$  and  $\delta_{sm}$  are the transmission time delays of the fiber.  $T_s$ is the readout of the TDC on the slave board. If *K* obtained during the coarse timestamp synchronization is an odd number, one extra clock period  $T_{\text{period}}$  is added to the measured  $T_m$ .

To determine the contribution of some delay parameters in the round trip, the two boards are first connected using a fiber with a length of 3 m. The values of the two TDCs, in this case, are measured as  $T_{m_3}$  m and  $T_{s_3}$  m, respectively. Since the propagation delays on the fiber in two directions are equal, that is,  $\delta_{m_s,3m} = \delta_{sm_3m}$ , we can get

$$T_{m_{3m}} - T_{s_{3m}} = 2\delta_{ms_{3m}} + O.$$
 (6)

According to these measured TDC values, we now can adjust the PI value in the Tx of the master board until the measured clock skew by the oscilloscope is 0, which means the two main clocks are exactly synchronized. Under this circumstance, the values of the two TDCs are recorded as  $T_{m_3m}$  and  $T_{s_3m}$ , respectively. Equation (6) is rewritten as

$$T_{m_{3m}}^{j} - T_{s_{3m}}^{j} = OT_{\text{pim}_{3m}} + 2\delta_{ms_{3m}} + O.$$
(7)

 $OT_{\text{pim}_{3m}}$  is the adjustment of the PI. Considering that the phases of the two main clocks are exactly same, the one-way phase delay has been completely compensated with  $OT_{\text{pim}_{3m}}$ , that is,

$$O_{txm} + T_{pim} + OT_{pim_3m} + \delta_{ms_3m} + O_{rxs} + T_{s_3m}^{j} = 0.$$
 (8)

From (7) and (8),  $T_{m_3m}$  can be expressed as

$$T_{m_{3m}}^{j} = O_{txs} + \delta_{ms_{3m}} + O_{rxm} + T_{pis}.$$
 (9)

Now, the two boards are connected by a fiber with the length of x for realistic application. We first carry on the coarse timestamp synchronization as introduced in Section II-B and then start the phase synchronization as follows.

With the TDC measured values of  $T_{m_x}$  and  $T_{s_x}$ , there is

$$T_{m_x} - T_{s_x} = 2\delta_{ms_x} + 0.$$
 (10)

We assume that the phase synchronization will be completed by adjusting the Tx PI on the master board. Under this circumstance, the readout value of the TDC on the master board is  $T_{m,x}$ , which should satisfy the following formula:

$$T_{m_{x}}^{j} = O_{txs} + \delta_{ms_{x}} + O_{rxm} + T_{pis}$$
  

$$= T_{m_{3}m}^{j} + \delta_{ms_{x}} - \delta_{ms_{3}m}$$
  

$$= T_{m_{3}m}^{j} + \frac{1}{2}(T_{m_{x}} - T_{s_{x}} - O) - \frac{1}{2}(T_{m_{3}m} - T_{s_{3}m} - O)$$
  

$$= T_{m_{3}m}^{j} + \frac{1}{2}(T_{m_{x}} - T_{s_{x}}) - (T_{m_{3}m} - T_{s_{3}m}).$$
 (11)

Equation (11) shows that  $T_{m_x}$  can be determined with all known measured TDC values. Therefore, the phase synchronization can be implemented by adjusting the Tx PI on the master board until the measured TDC result is equal to  $T_m^{\downarrow}_x$ .

#### **III. KEY PREREQUISITES**

To obtain high time synchronization performance, two prerequisites must be met, and the corresponding solutions are discussed below.

#### A. Symmetry of Transmission Link

The synchronization scheme introduced in Section II is based on the assumption that the signal transmission delays in both directions must be equal, that is,

$$T_{\text{pim}} + O_{txm} + O_{rxs} + \delta_{ms} = T_{\text{pis}} + O_{txs} + O_{rxm} + \delta_{sm}.$$
(12)

First, the optical signal transmission path and the transmission delay between the two SFPs should be as symmetric as possible. For this purpose, the two optical circulators are used with the fiber connection, as shown in Fig. 1. The length of the physical transmission path from the master to the slave can be the same as the length from the slave to the master. In addition, since the light wavelengths used for transmission in both directions are the same, the transmission delays in the two paths are naturally the same, that is,  $\delta_{ms} \delta_{sm}$ .

Second, the master board and the slave board must use the same type of FPGA that includes the PI block. The electronics hardware circuits inside the FPGA, as well as the routing path on the printed circuit board (PCB), is also made as symmetric as possible to satisfy  $O_{txm} \rightarrow O_{rxs} = O_{txn} O_{txm}$ . In addition, the phase lock modules on the two boards (Section III-B) can guarantee that  $T_{pim} = T_{\overline{PPS}}$ . Since the transmission delay of electronics and the measurement of TDC are sensitive to ambient temperature and the power supply, it is necessary to keep them as stable as possible. For the case where the temperature of the two boards is different, the symmetry will be impaired to a certain extent, which may bring some synchronization errors.

Third, since the recovered clock signal on the slave board is routed out of the FPGA for jitter cleaning, the symmetry of circuits and signal traces will be slightly imbalanced. Adding a TDC module on the slave board can partially correct this asymmetry. The TDC value  $T_m$  represents the round-trip accumulation of phase delay.  $T_s$  represents the extra delay for the jitter cleaning. Subtracting  $T_s$  from  $T_m$  can make the calculation of the one-way phase delay more symmetric. The measurement precision of TDCs will limit the synchronization accuracy. Last, the integrated PI in the SerDes transceiver of FPGAs can provide picosecond-level phase adjustment, which largely simplifies the system design without the need for an extra phase-adjustment circuit. The step size for phase adjusting is determined by the period of the reference clock and the bitwidth of parallel data entering parallel-in-to-serial-out (PISO) module in the Tx path as

$$STEP\_SIZE = \frac{TXPIPPMSTEPSIZE[3:0]}{64 \times TXOUT\_DIV} \\ \times \frac{T}{PARALLEL\_DATA\_WIDTH}$$
(13)

where TXPIPPMSTEPSIZE [3:0] is the input control parameter, and TXOUT\_DIV is a fixed value that is determined during the SerDes IP core instantiation [20]. In our configuration,  $T_{\text{period}}$  is 6.4 ns, and the bit-width of parallel data is 20, and therefore, the minimum step size of phase adjustment can be set as low as 1.25 ps, which means the phase shift can be compensated very precisely.

#### B. Determination of Transceiver Delay

The second prerequisite of the scheme is the stability of the clock distribution, namely the latency of every part through the transmission path should be fixed in various situations. Since the SerDes transceivers of FPGAs are typically designed for applications that can tolerate latency variations, besides bypassing the internal FIFO of transceivers [23], other necessary techniques must be taken to ensure that the latency of the transceivers remains unchanged, in particular, in multiple system restarts.

On both boards, the phase lock module as shown in Fig. 1 is specifically designed to fix the phase difference between the main clock and the transmitted bit clock in the Tx path. Inside the Tx, as shown in Fig. 4, the main clock acts as the reference clock of the PLL, which generates the bit clock passing through the Tx PI and D-Divider. The bit clock frequency is further divided by N as the word clock to drive the PISO register. Since the phase difference between the main clock and the transmitted bit clock might be uncertain when the transceiver is initialized, the extra phase lock module is designed to monitor it and the Tx PI is used to compensate it to a fixed default value. The realization of the phase lock module is very straightforward, as shown in Fig. 4.

If the measured phase difference with the two TDC channels does not match the preset value before the synchronization process, the phase lock module will control the Tx PI to compensate for the phase offset. Compared with the method in [1] using FIFO as an early-late phase detector, our TDC scheme is more flexible and reliable without losing accuracy. For example, we have found that when multiple SerDes channels are used for the clock synchronization in one FPGA, shifting the Tx PI in a channel may result in phase variation in other channels. Therefore, it is not enough only to fix the phase difference at the beginning. Our TDC scheme can monitor and compensate it on-line.

In the Rx path, the phase of the recovered clock also needs to be constant in the process of system restart/initialization. When the Rx is receiving data, the clock data recovery (CDR)



Fig. 4. Phase lock module and Tx architecture.

module recovers the bit clock from the received high-speed serial data stream and converts it to a word clock along with the serial data changing into parallel data. In our case, the frequency of the serial bit clock is 3.125 GHz, which is 20 times the word clock with a frequency of 156.25 MHz. Therefore, there are 20 different random phases for the word clock. To ensure the recovered word clock has a fixed phase without unit interval (UI) jumps after each system restart, a phase alignment logic module as introduced in [19] needs to be designed inside of FPGAs. In addition, there is an equalizer in the CDR module for the compensation of transmission channel losses. While the discrete feedback equalizer (DFE) is implemented by default in the Rx, there are some adaptive blocks in the equalizer controlled by slow feedback loops. The steady state of the loop can be slightly different for different startups, which has an impact on the phase of the recovered clock. Therefore, we configure the equalizer in low power mode (LPM), which is a continuous-time linear equalizer, to decrease the sub-UI phase uncertainty effectively [1].

# IV. TEST AND RESULTS

To evaluate the performance of the proposed clock distribution and synchronization scheme, three identical electronics boards, each with the composition and functions as described in Section I, are built and connected with fibers via the optical circulators, as shown in Fig. 5. The three boards are arranged in a cascaded structure of three levels. The first board at the first level generates the system clock. The second board at the second level receives the system clock and sends the onboard main clock out to the third board at the third level. Depending on the test, the three boards can be connected with a length of 3-m, 1-km, and 5-km fiber, respectively.

# A. Clock Jitter

The three boards are connected with two rolls of 5-km fiber via four optical circulators. The system clock is generated by the first board and distributed to the other boards in a cascaded manner. Using the jitter and eye diagram analysis tools of the oscilloscope (Tektronix MSO 72004C), the time interval error (TIE) jitter of clocks at the first, second, and third boards are measured as 4.2, 3.3, and 3.8 ps, respectively. Benefiting from



Fig. 5. Photograph of the prototype setup and illustration of connection.



Fig. 6. Histogram of skews between different clocks.

the used jitter cleaner Si5344, the on-board main clock does not add extra jitter. The jitter performance of all the clocks is equivalent.

## **B.** Synchronization Accuracy

With the aid of 3-m calibration described in Section II-C, the clock distribution system with 5 km is synchronized. Picking up any two clock signals from the three-clock levels to the oscilloscope, the clock skews can be measured with a typical histogram, as shown in Fig. 6. The measured mean value of 5000 samples is the skew between the two clocks and the rms value is their relative TIE jitter. The measured relative jitter is consistent with the results of the clock jitter measurement in Section IV-A. The measured clock skews are the accuracy of the clock synchronization.

To observe the stability of the synchronization accuracy, we first measure the clock skew between the first and second boards across multiple system power-ON or reset operations. When the two boards are connected with a 3-m fiber and a 5-km fiber, respectively, the measured peak–peak clock skew variations are 12.4 and 14.1 ps, as shown in Fig. 7. The test results demonstrate that the clock synchronization accuracy is within 15 ps across multiple system initialization, regardless of the length of the connecting fiber.

The temperature dependence of the synchronization accuracy is evaluated by two boards connecting with a 1-km fiber. The fiber is placed in a climate chamber, and the temperature



Fig. 7. Skew measurement after multiple restarts with (a) 3-m and (b) 5-km fiber.



Fig. 8. Skew measurement under different temperatures with a 1-km fiber.

is varied in the range from 30 °C to 70 °C. Repeating the clock synchronization process at different temperatures, the clock skews are measured by the oscilloscope as shown in Fig. 8 with a variation range of 7.8 ps. Just for reference, here we list the synchronization accuracy results of the other two methods. The standard WR system with a 5-km G652 fiber connection has been measured in the temperature range from 12.5 °C to 85 °C with about 200-ps phase variation [24]. Another measurement of temperature dependence is that of the TClink for the HL-LHC experiments [25], which uses two separate 100-m fibers for the two directions with one wavelength to ensure the symmetry. Putting the fibers in a climate chamber in the temperature range from 30 °C to 60 °C, we can identify from their test graph that the phase variation error is about 15 ps.

## C. Long-Term Stability

The long-term stability of the synchronization accuracy is evaluated over 12 h of continuous operation in a laboratory environment. As shown in Fig. 9, the master board is connected to the two slave boards via 5-km-long optical fibers, and the main clock is distributed to the two slave boards through the two SerDes channels. Each slave board contains a channel of FPGA TDC [16]. The recovered main clocks on the two slave boards clock the TDCs through the frequency



Fig. 9. Diagram of the long-term stability test.



Fig. 10. Time difference of the two TDCs during the long-term stability test.

multiplier of Si5344. The two separate hit signals feeding into the two TDCs are generated by the two channels of an arbitrary waveform generator (model AWG5012C, Tektronix). The phase difference of the two hit signals is adjustable with about 1-ps precision. The measured time difference between the two TDCs is the time interval of the hit signals. Keeping the time-interval between the hit signals constant, the variation of the measured time difference represents the phase difference between the two TDC driving clocks. During the period of the test from 10 PM to 10 AM of the next day, the time difference was read out every 40 s, and the variation of ambient temperature was monitored within 3 °C. The measured time differences are in the range from -6.4 to 5.8 ps as shown in Fig. 10 with a variation range of 12.2 ps.

# V. CONCLUSION

Compared with using a specific separate network for clock distribution, integrating the clock distribution into the fiberbased serial data link has a number of advantages. The technical challenges to realizing such a clock distribution are reflected in two aspects: one is the stability of the distribution and the other is the synchronization scheme to ensure high synchronization accuracy. Although state-of-the-art SerDes transceivers in Xilinx FPGAs are not specifically designed for clock distribution, a number of techniques have been successfully explored in this article to guarantee a stable and consistent latency of every part in the transmission link, in particular, after multiple system restarts. The proposed optical circulator-based fiber link can effectively enhance the symmetry of the transmission link, which is the basis for packet switching-based synchronization protocol to achieve high synchronization accuracy. Through a prototype implementation and performance evaluation, the distributed clock has shown a synchronization accuracy better than 15 ps, which can meet the requirements of most current large-scale physics experiments.

Since modern mainstream FPGAs, either Xilinx FPGA or Intel FPGA, provide basically the same predefined logic resources in similar architectures, the key modules described in this article should also be implemented on Intel FPGA. In addition, it is better to choose the FPGA with a built-in PI block, so that the clock offset can be compensated without the need for external devices. Also, adding the optical circulators to the fiber connection does not bring major disadvantages to the system, even for a system with a large number of distributed nodes, because the optical circulator is a compact (typical a cylinder with a diameter of 0.5 cm and a length of 5 cm), small insertion loss (around 1 dB), and low cost (about \$100 each) passive device. In addition, the device can be designed to be compatible to singlemode fiber or multimode fiber with different wavelengths. Therefore, the proposed scheme of this article is a practical solution for implementing high-performance large-scale clock distribution.

#### REFERENCES

- E. Mendes, S. Baron, C. Soos, J. Troska, and P. Novellini, "Achieving picosecond-level phase stability in timing distribution systems with xilinx ultrascale transceivers," *IEEE Trans. Nucl. Sci.*, vol. 67, no. 3, pp. 473–481, Mar. 2020.
- [2] F. Lemke, D. Slogsnat, N. Burkhardt, and U. Bruening, "A unified DAQ interconnection network with precise time synchronization," *IEEE Trans. Nucl. Sci.*, vol. 57, no. 2, pp. 412–418, Apr. 2010.
- [3] L. Zhao et al., "Precise clock synchronization in the readout electronics of WCDA in LHAASO," *IEEE Trans. Nucl. Sci.*, vol. 62, no. 6, pp. 3249–3255, Dec. 2015.
- [4] R. J. Aliaga, J. M. Monzo, M. Spaggiari, N. Ferrando, R. Gadea, and R. J. Colom, "PET system synchronization and timing resolution using high-speed data links," *IEEE Trans. Nucl. Sci.*, vol. 58, no. 4, pp. 1596–1605, Aug. 2011.
- [5] I. Papakonstantinou *et al.*, "A fully bidirectional optical network with latency monitoring capability for the distribution of timing-trigger and control signals in high-energy physics experiments," *IEEE Trans. Nucl. Sci.*, vol. 58, no. 4, pp. 1628–1640, Aug. 2011.
- [6] R. Giordano and A. Aloisio, "Fixed-latency, multi-gigabit serial links with xilinx FPGAs," *IEEE Trans. Nucl. Sci.*, vol. 58, no. 1, pp. 194–201, Feb. 2011.
- [7] D. L. Mills, "Network time protocol version 4 reference and implementation guide," Univ. Delaware, NTP Working Group, Newark, DE, USA, Tech. Rep. 06-6-1,2006.
- [8] IEEE Standard for a Precision Clock Synchronization Protocol for Networked Measurement and Control Systems, Standard 1588-2008, 2008.
- [9] J. Serrano *et al.*, "The white rabbit project," in *Proc. IBIC*, Oxford, U.K.: THB, vol. 2, Sep. 2013, pp. 936–942.
- [10] M. Lipinski et al., "White rabbit applications and enhancements," in Proc. IEEE Int. Symp. Precis. Clock Synchronization Meas., Control, Commun. (ISPCS), Sep. 2018, pp. 1–7.
- [11] M. Lipinski, T. Wlostowski, J. Serrano, and P. Alvarez, "White rabbit: A PTP application for robust sub-nanosecond synchronization," in *Proc. IEEE Int. Symp. Precis. Clock Synchronization Meas., Control Commun.*, Sep. 2011, pp. 25–30.

- [12] P. Moreira, J. Serrano, T. Wlostowski, P. Loschmidt, and G. Gaderer, "White rabbit: Sub-nanosecond timing distribution over Ethernet," in *Proc. Int. Symp. Precis. Clock Synchronization Meas., Control Commun.*, Oct. 2009, pp. 1–5.
- [13] X. Y. Xu Yuan and B. W. Bo Wang, "Using single wavelength light to improve the synchronization accuracy of the white rabbit system," *Chin. Opt. Lett.*, vol. 15, no. 10, 2017, Art. no. 101202.
- [14] Z. Li et al., "Test and simulation of a Cherenkov picosecond timing counter," Nucl. Instrum. Methods Phys. Res. A, Accel. Spectrom. Detect. Assoc. Equip., vol. 966, Jun. 2020, Art. no. 163838.
- [15] C. Allaire, "A high-granularity timing detector in ATLAS: Performance at the HL-LHC," *Nucl. Instrum. Methods Phys. Res. A, Accel. Spectrom. Detect. Assoc. Equip.*, vol. 924, pp. 355–359, Apr. 2019.
- [16] Y. Wang, J. Kuang, C. Liu, and Q. Cao, "A 3.9-ps RMS precision timeto-digital converter using ones-counter encoding scheme in a Kintex-7 FPGA," *IEEE Trans. Nucl. Sci.*, vol. 64, no. 10, pp. 2713–2718, Oct. 2017.
- [17] Y. Wang, J. Kuang, C. Liu, Q. Cao, and D. Li, "A flexible 32-channel time-to-digital converter implemented in a Xilinx zynq-7000 field programmable gate array," *Nucl. Instrum. Methods Phys. Res. A, Accel. Spectrom. Detect. Assoc. Equip.*, vol. 847, pp. 61–66, Mar. 2017.
- [18] J. Kuang and Y. Wang, "A 5.5 ps time-interval RMS precision time-to-digital convertor implemented in intel arria 10 FPGA," 2018, arXiv:1805.12390. [Online]. Available: http://arxiv.org/abs/1805. 12390

- [19] Y. Hu, Y. Wang, L. Wang, Q. Cao, and J. Kuang, "Performance evaluation of time distribution over SerDes-based interconnections for PET system," in *Proc. IEEE Nucl. Sci. Symp. Med. Imag. Conf. (NSS/MIC)*, Nov. 2018, pp. 1–2.
- [20] (2018). UltraScale Architecture GTH Transceivers User Guide, Xilinx Inc. Accessed: Dec. 31, 2019. [Online]. Available: https://www.xilinx. com/support/documentation/user\_guides/ug576-ultrascale-gthtransceivers.pdf
- [21] (2018). Si570/Si571 10 MHz To 1.4 GHz I2C Programmable XO/VCXO. Accessed: Dec. 31, 2019. [Online]. Available: https://www.silabs.com/ documents/public/data-sheets/si570.pdf
- [22] (2018). Si5345/44/42 Rev D Data Sheet. Accessed: Dec. 31, 2019.
   [Online]. Available: https://www.silabs.com/documents/public/datasheets/Si5345-44-42-D-DataSheet.pdf
- [23] E. Mendes, S. Baron, D. Kolotouros, C. Soos, and F. Vasey, "The 10G TTC-PON: Challenges, solutions and performance," *J. Inst.*, vol. 12, no. 2, p. 2041, Feb. 2017.
- [24] G. Daniluk, "White Rabbit PTP core, the sub-nanosecond time synchronization over Ethernet," M.S. thesis, Dept. Electron. Inf. Technol., Warsaw Univ. Technol., Warsaw, Poland, 2012. [Online]. Available: http://www.ohwr.org/attachments/1368/GD\_mgr.pdf
- [25] E. B. De Souza Mendes, S. Baron, and M. Taylor, "TCLink: A timing compensated high-speed optical link for the HL-LHC experiments," in *Proc. Topical Workshop Electron. Part. Phys. PoS(TWEPP)*, Mar. 2020, p. 57.