Skip to content

shelldriver: fixes and improvements #945

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

ynezz
Copy link
Contributor

@ynezz ynezz commented Jun 25, 2022

Description

  • shelldriver: make run wrapper injecting more reliable
  • shelldriver: make kernel silencing more reliable
  • shelldriver: fix missing kernel silencing on password less consoles
  • shelldriver: make post_login_settle_time generally available

Checklist

  • PR has been tested

Cc: @JoshuaWatt

@ynezz ynezz force-pushed the upstream/shelldriver-fixes-improvements branch from 702d088 to 62a15ff Compare June 26, 2022 07:29
@codecov
Copy link

codecov bot commented Jun 26, 2022

Codecov Report

Attention: Patch coverage is 8.10811% with 34 lines in your changes are missing coverage. Please review.

Project coverage is 62.6%. Comparing base (69fd553) to head (65d0269).
Report is 295 commits behind head on master.

Files Patch % Lines
labgrid/driver/shelldriver.py 8.1% 34 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff            @@
##           master    #945     +/-   ##
========================================
- Coverage    62.7%   62.6%   -0.2%     
========================================
  Files         159     159             
  Lines       11658   11687     +29     
========================================
+ Hits         7319    7321      +2     
- Misses       4339    4366     +27     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ynezz ynezz force-pushed the upstream/shelldriver-fixes-improvements branch 2 times, most recently from f3f4366 to 32c250b Compare July 12, 2022 08:06
ynezz added 5 commits May 17, 2023 14:02
There was `post_login_settle_time` variable added in commit 0758753
("shelldriver: Add post-login "settle" time") in order to improve prompt
detection on a chatty consoles.

For some reason, that feature is currently only available on systems
with user logins. Chatty consoles might pose problems on all systems, so
let's make this feature available to everyone.

For example in my current case, I'm testing OpenWrt system, which has
passwordless root login by default and labgrid is not able to inject
run() wrapper reliably (modified for brevity):

  DEBUG: Write 60 bytes: b'run() { echo -n "$MARKER"; sh -c "$@"; echo "$MARKER $?"; }\n' (with 0.010000s txdelay)
  DEBUG: Read 404 bytes: b'6.093178] kmodloader: done loading kernel modules from /etc/modules.d/*\r\n){-$AE";
sh -c "$@"
; echo "$MARKER $?"; ', timeout 30.00, requested size 1
  DEBUG: Read 25 bytes: b'}\r\n/bin/ash: syntax error', timeout 30.00, requested size 1

With post_login_settle_time=10 it works much better.

Cc: Joshua Watt <[email protected]>
Signed-off-by: Petr Štetiar <[email protected]>
I've noticed increased test failures on my OpenWrt based systems, which
have passwordless root login by default due to more chatty kernel
console, which interleaves command outputs. I've found out, that there
is no `dmesg -n 1` command run anymore and tracked it down to the
changes done in commit 4015426 ("shelldriver: Silence kernel
earlier"). This change wrongly assumes, that `dmesg -n 1` should be run
only on consoles which have login enabled. So fix it, by removing that
wrong assumption and thus run kernel silencer on all consoles.

Cc: Joshua Watt <[email protected]>
Fixes: 4015426 ("shelldriver: Silence kernel earlier")
Signed-off-by: Petr Štetiar <[email protected]>
Current fire and forget approach doesn't work reliably in all corner
cases for various reasons, like for example serial console bitflip or
lost byte issues due to RX fifo overruns. Those issues mostly happen
frequently during early boot process, where there is a lot of output
going on.

So let's make it more reliable by actually checking the kernel log level
setting via /proc/sys/kernel/printk and retry if it's not set to desired
log level.

Following example is from GL.iNet GL-B1300 device:

 root@(none):/# mg-
 /bin/ash: mg-: not found
 root@(none):/# coRC;ct/rcsy/kerne/rnk;eh QZE[    6.365640] kmodloader:
 loading kernel modules from /etc/modules.d/*
 ...
 dmesg -n 1
 /bin/ash: coRC: not found
 /bin/ash: ct/rcsy/kerne/rnk: not found
 /bin/ash: eh: not found
 root@(none):/# echo BZXI; cat /proc/sys/kernel/printk ; echo AUTHSA
 BZXI
 7       4       1       7
 AUTHSA
 root@(none):/# dmesg -n 1
 root@(none):/# echo XYTH; cat /proc/sys/kernel/printk ; echo PNCWGM
 XYTH
 1       4       1       7
 PNCWGM

Signed-off-by: Petr Štetiar <[email protected]>
Current fire and forget approach doesn't work reliably in all corner
cases for various reasons, like for example serial console bitflip or
lost byte issues due to RX fifo overruns. Those issues mostly happen
frequently during early boot process, where there is a lot of output
going on.

So let's make it more reliable by actually checking, if the run wrapper
was injected into the target system correctly.

Signed-off-by: Petr Štetiar <[email protected]>
Variable which is used by the driver is self._status.

Signed-off-by: Petr Štetiar <[email protected]>
@Emantor Emantor force-pushed the upstream/shelldriver-fixes-improvements branch from 32c250b to 65d0269 Compare May 17, 2023 12:03
@aparcar
Copy link
Contributor

aparcar commented Apr 26, 2024

Tested this with a Turris Omnia and it fixes the ShellDriver issues as described.

rf"{marker[:4]}(.*){marker[4:]}\s+{self.prompt}",
timeout=timeout
)
data = self.re_vt100.sub('', match.group(1).decode(codec, decodeerrors)).split('\r\n')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
data = self.re_vt100.sub('', match.group(1).decode(codec, decodeerrors)).split('\r\n')
data = re_vt100.sub('', match.group(1).decode(codec, decodeerrors)).split('\r\n')

This seems to have changed in 6743c69

@aparcar
Copy link
Contributor

aparcar commented Apr 29, 2024

While this fixes "real world" devices, it break the QEMU driver for me, muting all shell output. I'll investigate further and give more details once I got them.

@Bastian-Krause Bastian-Krause added the needs rebase Needs a rebase onto the master branch, maintainter could probably not push to submitter branch. label Apr 29, 2024
self.status = 1
self._status = 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's already fixed with 8bc7858.

Comment on lines -239 to +256
def _inject_run(self):
self.console.sendline(
'''run() { echo -n "$MARKER"; sh -c "$@"; echo "$MARKER $?"; }'''
)
self.console.expect(self.prompt)
def _inject_run(self, inject_timeout=10.0):
"""Tries to inject run() command wrapper into shell on the console."""
timeout = Timeout(inject_timeout)

while True:
self.console.sendline(
'''run() { echo -n "$MARKER"; sh -c "$@"; echo "$MARKER $?"; }'''
)
self.console.expect(self.prompt)
try:
data = self._raw_run("type run")
if data[2] and "function" in data[2]:
return
except TIMEOUT:
pass

if timeout.expired:
raise TIMEOUT(f"Timeout of {inject_timeout} seconds exceeded during waiting for run injection") # pylint: disable=line-too-long
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If your serial console does not work reliably (even after specifying a certain settle time), I don't think retrying is the correct approach. Even if this works after the nth try, too many other things will fail in unexpected ways later. Better make your serial console work reliably or prefer the SSHDriver in these situations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If your serial console does not work reliably (even after specifying a certain settle time)

Ok, fair enough, but my goal is to make it generic and reusable across different use cases, instead of fiddling with variable settle time, fighting crazy applications logging into /dev/console directly etc.

If your serial console does not work reliably (even after specifying a certain settle time), I don't think retrying is the correct approach.

Well, this approach works reliably for over than 2 years, with a mix of dozen different devices, bootloaders, kernels/drivers and operating systems. So the idea here is to get as fast (and reliably) as possible into the SSH and profit.

SSHDriver in these situations.

Indeed, its the only reliable solution if there is no dedicated serial console available.

Comment on lines +113 to +141
def _raw_run(self, cmd, timeout=2, codec="utf-8", decodeerrors="strict"):
"""Runs command without run() wrapper as it's not yet available"""
marker = gen_marker()
self.console.sendline(f"echo {marker[:4]}; {cmd} ; echo {marker[4:]}")
_, _, match, _ = self.console.expect(
rf"{marker[:4]}(.*){marker[4:]}\s+{self.prompt}",
timeout=timeout
)
data = self.re_vt100.sub('', match.group(1).decode(codec, decodeerrors)).split('\r\n')
if data and not data[-1]:
del data[-1]
return data

def _silence_kernel(self, timeout, kernel_log_level=1, codec="utf-8", decodeerrors="strict"):
"""Tries to set kernel log level to KERN_ALERT log level to make it less chatty."""

while True:
self.console.sendline(f"dmesg -n {kernel_log_level}")
try:
data = self._raw_run("cat /proc/sys/kernel/printk")
if data[2] and data[2].startswith(f'{kernel_log_level}\t'):
return

except TIMEOUT:
pass

if timeout.expired:
raise TIMEOUT(f"Timeout of {self.login_timeout} seconds exceeded during waiting for kernel silencer") # pylint: disable=line-too-long

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we want more retry code for certain devices behaving strangely. Shouldn't these devices wait until input can be handled reliably before starting login/shell?

From the commit message:

Those issues mostly happen frequently during early boot process, where there is a lot of output
going on.

Could await_login_timeout help here instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we want more retry code for certain devices behaving strangely.

Ok, fair enough.

Shouldn't these devices wait until input can be handled reliably before starting login/shell?

Sure, if you have complete control over the build/settings of bootloader, kernel, drivers and operating system, then this is certainly doable.

For example, in OpenWrt project I'm testing 3 releases, this means 3-4 different kernel versions, 5.10, 5.15, 6.1 and 6.6. I've already learned during the years, that each of this kernel versions changed the behavior of the device, driver probing/loading and thus influence the serial console behavior. I don't want to handle this differences manually, thus I'm silencing the kernel ASAP, then continue bringing the device into SSH over LAN network and profit.

On some slow devices there might be 10s silence on the console and then ath11k is being probed and if you're lucky enough, exactly at the point when you're trying to detect LAN network availability of the device. In some builds, there are strange 3rd party applications which are not afraid to output debug logging into /dev/console directly etc.

Could await_login_timeout help here instead?

Its quite some time, but I guess, that I've tried all available features/config options, indeed. I'll rebase, retest and see how it goes today.

@ynezz
Copy link
Contributor Author

ynezz commented Apr 29, 2024

@Bastian-Krause thanks a lot for a review, really appreciate it! I'll rebase and retest soon.

@aparcar
Copy link
Contributor

aparcar commented Jun 13, 2024

Hi @ynezz

Since I faced the same issue and gave this another try. I found that the simplest fix was to drop the console_ready parameter and instead use logIn_prompt. It's a little bit mystical to me but it turned out to work fine. I assume Labgrids internal state machine for silencing the Kernel got confused by OpenWrt not requiring a proper login.

Below is my working example which did not require any code modifications:

targets:
  main:
    features:
      - wifi
      - online
    resources:
      RemotePlace:
        name: !template '$LG_PLACE'
    drivers:
      NetworkPowerDriver: {}
      SerialDriver:
        txdelay: 0.05
      ShellDriver:
        prompt: 'root@[\w()]+:[^ ]+ '
        login_prompt: Please press Enter to activate this console.
        await_login_timeout: 15
        login_timeout: 120
        post_login_settle_time: 5
        username: root
      UBootDriver:
        prompt: '=> '
        init_commands:
          - setenv kernel_addr_r 0x2000000
          - setenv bootargs earlyprintk console=ttyS0,115200
        boot_command: >
          dhcp $kernel_addr_r turris-omnia/openwrt-mvebu-cortexa9-cznic_turris-omnia-initramfs-kernel.bin;
          bootm $kernel_addr_r
      TFTPProviderDriver: {}
      UBootStrategy: {}

@SmithChart
Copy link
Member

I am also testing OpenWRT (specifically Gluon) devices with labgrid.
And I can report, that for me silencing of the log-messages with dmesg -n 1 is currently broken.

In my use case using console_ready means that i am directly logged in afterwards. So I would either assume that we should set did_login = True after

self.console.sendline("")
or use 1209a65 .

Cherrypicking 1209a65 onto current master at least fixes my issue.

But looking at the code, as it is now, I think I would prefer to set did_login = True after logging in with index == 4. This way we make sure, that we only issue the dmesg -n 1 a single time - even if we maybe extend the state machine later on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs rebase Needs a rebase onto the master branch, maintainter could probably not push to submitter branch.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants