Packet loss when streaming adjacent slots #371
-
|
Hi all, I'm having a packet drop issue at site and thought I'd post here to try to get more eyes on it. Our smurf is configured with slots 2, 3, and 5. Communication between the smurf and the slot seems good, but I'm seeing high data drop rates when trying to run all slots at once. In particular, I'm seeing: In particular: It also seems like the fraction of dropped packets is independent of flux-ramp rate, which to me rules out software, and points to something wrong with the fiber connections. I've tried:
I'm starting to run out of ideas so wanted to know if anyone (@yuhanwyhan, @swh76, @msilvafe) has seen anything like this, or has any more ideas of things to check. |
Beta Was this translation helpful? Give feedback.
Replies: 5 comments 4 replies
-
|
Is it just 2,3 or do 3,4 and 4,5 show the same behavior? Are you sure that the SFPs are installed in the correct SFP cages on the vadatech switch and that they're all installed properly. I would suspect the interface on that side given that you cleaned and swapped out all of this different hardware. |
Beta Was this translation helpful? Give feedback.
-
|
Ok, I swapped to a brand new QSFP, and am still seeing the same issues, but somehow not as symmetric as before? This might just be randomness in the packet drops, but before 2 and 3 were dropping about the same number of packets, and now 2 is dropping much more than 3. I think it might be an issue with the PCIe port... For now, I'm going to run with slots [2, 5, 7], and see if that avoids interference. Hopefully this is just an issue with the PCIe and we don't see the same type of thing with smurf-srv20 |
Beta Was this translation helpful? Give feedback.
-
|
Alright, I didn't know what else I could swap out so I'm thinking this might be an issue with the PCIe port. Running with slots 2, 5, and 7 and we are under spec for the packet drop rates: |
Beta Was this translation helpful? Give feedback.
-
|
Working with Larry and Ryan (SLAC engineers) we think we have found and solved this problem. It was due to a bug in the PCIe card firmware that fortunately has been fixed since we froze the SMuRF PCIe firmware version. Larry was able to build us a new version, https://github.com/slaclab/smurf-pcie/releases/tag/v3.0.1, with the fix, and at least on SAT1-Crate1 with four carriers, we are seeing no drops streaming all four slots simultaneously, 2000 channels at 10 kHz. This is a big fix!
When prompted, enter the number corresponding to the d89c87b mcs file. The script will then program the PCIe card FPGA and reboot the server. You should see this: It says "please reboot the computer" but will actually reboot the computer for you. Once the server is back up, ssh back in to confirm the firmware has been properly loaded like this: You should confirm that the driver version is v5.7.0 and the git hash is d89c87b6b29ddc5dab8a3be374e749725d598399 (=the new firmware version). I've already upgraded the 2x SAT1 servers (smrfsrv20-satp1 and smrfso3-satp1) and the LAT server (smrfso6-lat). Caveat emptor : we should watch for any issues with this new version - so far we have only checked whether or not it drops frames. |
Beta Was this translation helpful? Give feedback.
-
|
Another new feature of the SMuRF software stack : while tracking down this issue, SLAC added a new ability to poll / inspect registers in the PCIe card firmware space. This capability is present starting with the new v2.1.1 version of the smurf-pcie-docker. Here's how to pull that down to your SMuRF server: then you can start an EPICs server to serve up registers in the PCIe like this: Leave the docker running, then you will be able to poll PCIe registers anywhere you can access EPICs (e.g. in the SMuRF client dockers, utils docker, etc.): this particular register, for instance, records a certain kind of packet corruption that was useful to monitor to track down the issue in this ticket. |
Beta Was this translation helpful? Give feedback.
Working with Larry and Ryan (SLAC engineers) we think we have found and solved this problem. It was due to a bug in the PCIe card firmware that fortunately has been fixed since we froze the SMuRF PCIe firmware version. Larry was able to build us a new version, https://github.com/slaclab/smurf-pcie/releases/tag/v3.0.1, with the fix, and at least on SAT1-Crate1 with four carriers, we are seeing no drops streaming all four slots simultaneously, 2000 channels at 10 kHz. This is a big fix!
Larry reports that in his test bench at SLAC he was able to stream at 100kHz with no drops. Here's instructions for how to upgrade the firmware (we are still using the same kernel driver version, v5.7.0):
S…