-
Notifications
You must be signed in to change notification settings - Fork 199
shelldriver: fixes and improvements #945
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
shelldriver: fixes and improvements #945
Conversation
702d088
to
62a15ff
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #945 +/- ##
========================================
- Coverage 62.7% 62.6% -0.2%
========================================
Files 159 159
Lines 11658 11687 +29
========================================
+ Hits 7319 7321 +2
- Misses 4339 4366 +27 ☔ View full report in Codecov by Sentry. |
f3f4366
to
32c250b
Compare
There was `post_login_settle_time` variable added in commit 0758753 ("shelldriver: Add post-login "settle" time") in order to improve prompt detection on a chatty consoles. For some reason, that feature is currently only available on systems with user logins. Chatty consoles might pose problems on all systems, so let's make this feature available to everyone. For example in my current case, I'm testing OpenWrt system, which has passwordless root login by default and labgrid is not able to inject run() wrapper reliably (modified for brevity): DEBUG: Write 60 bytes: b'run() { echo -n "$MARKER"; sh -c "$@"; echo "$MARKER $?"; }\n' (with 0.010000s txdelay) DEBUG: Read 404 bytes: b'6.093178] kmodloader: done loading kernel modules from /etc/modules.d/*\r\n){-$AE"; sh -c "$@" ; echo "$MARKER $?"; ', timeout 30.00, requested size 1 DEBUG: Read 25 bytes: b'}\r\n/bin/ash: syntax error', timeout 30.00, requested size 1 With post_login_settle_time=10 it works much better. Cc: Joshua Watt <[email protected]> Signed-off-by: Petr Štetiar <[email protected]>
I've noticed increased test failures on my OpenWrt based systems, which have passwordless root login by default due to more chatty kernel console, which interleaves command outputs. I've found out, that there is no `dmesg -n 1` command run anymore and tracked it down to the changes done in commit 4015426 ("shelldriver: Silence kernel earlier"). This change wrongly assumes, that `dmesg -n 1` should be run only on consoles which have login enabled. So fix it, by removing that wrong assumption and thus run kernel silencer on all consoles. Cc: Joshua Watt <[email protected]> Fixes: 4015426 ("shelldriver: Silence kernel earlier") Signed-off-by: Petr Štetiar <[email protected]>
Current fire and forget approach doesn't work reliably in all corner cases for various reasons, like for example serial console bitflip or lost byte issues due to RX fifo overruns. Those issues mostly happen frequently during early boot process, where there is a lot of output going on. So let's make it more reliable by actually checking the kernel log level setting via /proc/sys/kernel/printk and retry if it's not set to desired log level. Following example is from GL.iNet GL-B1300 device: root@(none):/# mg- /bin/ash: mg-: not found root@(none):/# coRC;ct/rcsy/kerne/rnk;eh QZE[ 6.365640] kmodloader: loading kernel modules from /etc/modules.d/* ... dmesg -n 1 /bin/ash: coRC: not found /bin/ash: ct/rcsy/kerne/rnk: not found /bin/ash: eh: not found root@(none):/# echo BZXI; cat /proc/sys/kernel/printk ; echo AUTHSA BZXI 7 4 1 7 AUTHSA root@(none):/# dmesg -n 1 root@(none):/# echo XYTH; cat /proc/sys/kernel/printk ; echo PNCWGM XYTH 1 4 1 7 PNCWGM Signed-off-by: Petr Štetiar <[email protected]>
Current fire and forget approach doesn't work reliably in all corner cases for various reasons, like for example serial console bitflip or lost byte issues due to RX fifo overruns. Those issues mostly happen frequently during early boot process, where there is a lot of output going on. So let's make it more reliable by actually checking, if the run wrapper was injected into the target system correctly. Signed-off-by: Petr Štetiar <[email protected]>
Variable which is used by the driver is self._status. Signed-off-by: Petr Štetiar <[email protected]>
32c250b
to
65d0269
Compare
Tested this with a Turris Omnia and it fixes the ShellDriver issues as described. |
rf"{marker[:4]}(.*){marker[4:]}\s+{self.prompt}", | ||
timeout=timeout | ||
) | ||
data = self.re_vt100.sub('', match.group(1).decode(codec, decodeerrors)).split('\r\n') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
data = self.re_vt100.sub('', match.group(1).decode(codec, decodeerrors)).split('\r\n') | |
data = re_vt100.sub('', match.group(1).decode(codec, decodeerrors)).split('\r\n') |
This seems to have changed in 6743c69
While this fixes "real world" devices, it break the QEMU driver for me, muting all shell output. I'll investigate further and give more details once I got them. |
self.status = 1 | ||
self._status = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's already fixed with 8bc7858.
def _inject_run(self): | ||
self.console.sendline( | ||
'''run() { echo -n "$MARKER"; sh -c "$@"; echo "$MARKER $?"; }''' | ||
) | ||
self.console.expect(self.prompt) | ||
def _inject_run(self, inject_timeout=10.0): | ||
"""Tries to inject run() command wrapper into shell on the console.""" | ||
timeout = Timeout(inject_timeout) | ||
|
||
while True: | ||
self.console.sendline( | ||
'''run() { echo -n "$MARKER"; sh -c "$@"; echo "$MARKER $?"; }''' | ||
) | ||
self.console.expect(self.prompt) | ||
try: | ||
data = self._raw_run("type run") | ||
if data[2] and "function" in data[2]: | ||
return | ||
except TIMEOUT: | ||
pass | ||
|
||
if timeout.expired: | ||
raise TIMEOUT(f"Timeout of {inject_timeout} seconds exceeded during waiting for run injection") # pylint: disable=line-too-long |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If your serial console does not work reliably (even after specifying a certain settle time), I don't think retrying is the correct approach. Even if this works after the nth try, too many other things will fail in unexpected ways later. Better make your serial console work reliably or prefer the SSHDriver in these situations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If your serial console does not work reliably (even after specifying a certain settle time)
Ok, fair enough, but my goal is to make it generic and reusable across different use cases, instead of fiddling with variable settle time, fighting crazy applications logging into /dev/console
directly etc.
If your serial console does not work reliably (even after specifying a certain settle time), I don't think retrying is the correct approach.
Well, this approach works reliably for over than 2 years, with a mix of dozen different devices, bootloaders, kernels/drivers and operating systems. So the idea here is to get as fast (and reliably) as possible into the SSH and profit.
SSHDriver in these situations.
Indeed, its the only reliable solution if there is no dedicated serial console available.
def _raw_run(self, cmd, timeout=2, codec="utf-8", decodeerrors="strict"): | ||
"""Runs command without run() wrapper as it's not yet available""" | ||
marker = gen_marker() | ||
self.console.sendline(f"echo {marker[:4]}; {cmd} ; echo {marker[4:]}") | ||
_, _, match, _ = self.console.expect( | ||
rf"{marker[:4]}(.*){marker[4:]}\s+{self.prompt}", | ||
timeout=timeout | ||
) | ||
data = self.re_vt100.sub('', match.group(1).decode(codec, decodeerrors)).split('\r\n') | ||
if data and not data[-1]: | ||
del data[-1] | ||
return data | ||
|
||
def _silence_kernel(self, timeout, kernel_log_level=1, codec="utf-8", decodeerrors="strict"): | ||
"""Tries to set kernel log level to KERN_ALERT log level to make it less chatty.""" | ||
|
||
while True: | ||
self.console.sendline(f"dmesg -n {kernel_log_level}") | ||
try: | ||
data = self._raw_run("cat /proc/sys/kernel/printk") | ||
if data[2] and data[2].startswith(f'{kernel_log_level}\t'): | ||
return | ||
|
||
except TIMEOUT: | ||
pass | ||
|
||
if timeout.expired: | ||
raise TIMEOUT(f"Timeout of {self.login_timeout} seconds exceeded during waiting for kernel silencer") # pylint: disable=line-too-long | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we want more retry code for certain devices behaving strangely. Shouldn't these devices wait until input can be handled reliably before starting login/shell?
From the commit message:
Those issues mostly happen frequently during early boot process, where there is a lot of output
going on.
Could await_login_timeout
help here instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we want more retry code for certain devices behaving strangely.
Ok, fair enough.
Shouldn't these devices wait until input can be handled reliably before starting login/shell?
Sure, if you have complete control over the build/settings of bootloader, kernel, drivers and operating system, then this is certainly doable.
For example, in OpenWrt project I'm testing 3 releases, this means 3-4 different kernel versions, 5.10, 5.15, 6.1 and 6.6. I've already learned during the years, that each of this kernel versions changed the behavior of the device, driver probing/loading and thus influence the serial console behavior. I don't want to handle this differences manually, thus I'm silencing the kernel ASAP, then continue bringing the device into SSH over LAN network and profit.
On some slow devices there might be 10s silence on the console and then ath11k is being probed and if you're lucky enough, exactly at the point when you're trying to detect LAN network availability of the device. In some builds, there are strange 3rd party applications which are not afraid to output debug logging into /dev/console
directly etc.
Could await_login_timeout help here instead?
Its quite some time, but I guess, that I've tried all available features/config options, indeed. I'll rebase, retest and see how it goes today.
@Bastian-Krause thanks a lot for a review, really appreciate it! I'll rebase and retest soon. |
Hi @ynezz Since I faced the same issue and gave this another try. I found that the simplest fix was to drop the Below is my working example which did not require any code modifications: targets:
main:
features:
- wifi
- online
resources:
RemotePlace:
name: !template '$LG_PLACE'
drivers:
NetworkPowerDriver: {}
SerialDriver:
txdelay: 0.05
ShellDriver:
prompt: 'root@[\w()]+:[^ ]+ '
login_prompt: Please press Enter to activate this console.
await_login_timeout: 15
login_timeout: 120
post_login_settle_time: 5
username: root
UBootDriver:
prompt: '=> '
init_commands:
- setenv kernel_addr_r 0x2000000
- setenv bootargs earlyprintk console=ttyS0,115200
boot_command: >
dhcp $kernel_addr_r turris-omnia/openwrt-mvebu-cortexa9-cznic_turris-omnia-initramfs-kernel.bin;
bootm $kernel_addr_r
TFTPProviderDriver: {}
UBootStrategy: {} |
I am also testing OpenWRT (specifically Gluon) devices with labgrid. In my use case using labgrid/labgrid/driver/shelldriver.py Line 171 in 1ab0abd
Cherrypicking 1209a65 onto current master at least fixes my issue. But looking at the code, as it is now, I think I would prefer to set |
Description
Checklist
Cc: @JoshuaWatt