shelldriver: fixes and improvements #945

ynezz · 2022-06-25T15:12:01Z

Description

shelldriver: make run wrapper injecting more reliable
shelldriver: make kernel silencing more reliable
shelldriver: fix missing kernel silencing on password less consoles
shelldriver: make post_login_settle_time generally available

Checklist

PR has been tested

Cc: @JoshuaWatt

codecov · 2022-06-26T07:35:43Z

Codecov Report

Attention: Patch coverage is 8.10811% with 34 lines in your changes are missing coverage. Please review.

Project coverage is 62.6%. Comparing base (69fd553) to head (65d0269).
Report is 295 commits behind head on master.

Files	Patch %	Lines
labgrid/driver/shelldriver.py	8.1%	34 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff            @@
##           master    #945     +/-   ##
========================================
- Coverage    62.7%   62.6%   -0.2%     
========================================
  Files         159     159             
  Lines       11658   11687     +29     
========================================
+ Hits         7319    7321      +2     
- Misses       4339    4366     +27

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

labgrid/driver/shelldriver.py

There was `post_login_settle_time` variable added in commit 0758753 ("shelldriver: Add post-login "settle" time") in order to improve prompt detection on a chatty consoles. For some reason, that feature is currently only available on systems with user logins. Chatty consoles might pose problems on all systems, so let's make this feature available to everyone. For example in my current case, I'm testing OpenWrt system, which has passwordless root login by default and labgrid is not able to inject run() wrapper reliably (modified for brevity): DEBUG: Write 60 bytes: b'run() { echo -n "$MARKER"; sh -c "$@"; echo "$MARKER $?"; }\n' (with 0.010000s txdelay) DEBUG: Read 404 bytes: b'6.093178] kmodloader: done loading kernel modules from /etc/modules.d/*\r\n){-$AE"; sh -c "$@" ; echo "$MARKER $?"; ', timeout 30.00, requested size 1 DEBUG: Read 25 bytes: b'}\r\n/bin/ash: syntax error', timeout 30.00, requested size 1 With post_login_settle_time=10 it works much better. Cc: Joshua Watt <[email protected]> Signed-off-by: Petr Štetiar <[email protected]>

I've noticed increased test failures on my OpenWrt based systems, which have passwordless root login by default due to more chatty kernel console, which interleaves command outputs. I've found out, that there is no `dmesg -n 1` command run anymore and tracked it down to the changes done in commit 4015426 ("shelldriver: Silence kernel earlier"). This change wrongly assumes, that `dmesg -n 1` should be run only on consoles which have login enabled. So fix it, by removing that wrong assumption and thus run kernel silencer on all consoles. Cc: Joshua Watt <[email protected]> Fixes: 4015426 ("shelldriver: Silence kernel earlier") Signed-off-by: Petr Štetiar <[email protected]>

Current fire and forget approach doesn't work reliably in all corner cases for various reasons, like for example serial console bitflip or lost byte issues due to RX fifo overruns. Those issues mostly happen frequently during early boot process, where there is a lot of output going on. So let's make it more reliable by actually checking the kernel log level setting via /proc/sys/kernel/printk and retry if it's not set to desired log level. Following example is from GL.iNet GL-B1300 device: root@(none):/# mg- /bin/ash: mg-: not found root@(none):/# coRC;ct/rcsy/kerne/rnk;eh QZE[ 6.365640] kmodloader: loading kernel modules from /etc/modules.d/* ... dmesg -n 1 /bin/ash: coRC: not found /bin/ash: ct/rcsy/kerne/rnk: not found /bin/ash: eh: not found root@(none):/# echo BZXI; cat /proc/sys/kernel/printk ; echo AUTHSA BZXI 7 4 1 7 AUTHSA root@(none):/# dmesg -n 1 root@(none):/# echo XYTH; cat /proc/sys/kernel/printk ; echo PNCWGM XYTH 1 4 1 7 PNCWGM Signed-off-by: Petr Štetiar <[email protected]>

Current fire and forget approach doesn't work reliably in all corner cases for various reasons, like for example serial console bitflip or lost byte issues due to RX fifo overruns. Those issues mostly happen frequently during early boot process, where there is a lot of output going on. So let's make it more reliable by actually checking, if the run wrapper was injected into the target system correctly. Signed-off-by: Petr Štetiar <[email protected]>

Variable which is used by the driver is self._status. Signed-off-by: Petr Štetiar <[email protected]>

aparcar · 2024-04-26T15:06:01Z

Tested this with a Turris Omnia and it fixes the ShellDriver issues as described.

aparcar · 2024-04-26T15:04:21Z

labgrid/driver/shelldriver.py

+            rf"{marker[:4]}(.*){marker[4:]}\s+{self.prompt}",
+            timeout=timeout
+        )
+        data = self.re_vt100.sub('', match.group(1).decode(codec, decodeerrors)).split('\r\n')


Suggested change

data = self.re_vt100.sub('', match.group(1).decode(codec, decodeerrors)).split('\r\n')

data = re_vt100.sub('', match.group(1).decode(codec, decodeerrors)).split('\r\n')

This seems to have changed in 6743c69

aparcar · 2024-04-29T09:30:23Z

While this fixes "real world" devices, it break the QEMU driver for me, muting all shell output. I'll investigate further and give more details once I got them.

Bastian-Krause · 2024-04-29T12:41:27Z

labgrid/driver/shelldriver.py

-                    self.status = 1
+                    self._status = 1


That's already fixed with 8bc7858.

Bastian-Krause · 2024-04-29T12:54:12Z

labgrid/driver/shelldriver.py

-    def _inject_run(self):
-        self.console.sendline(
-            '''run() { echo -n "$MARKER"; sh -c "$@"; echo "$MARKER $?"; }'''
-        )
-        self.console.expect(self.prompt)
+    def _inject_run(self, inject_timeout=10.0):
+        """Tries to inject run() command wrapper into shell on the console."""
+        timeout = Timeout(inject_timeout)
+
+        while True:
+            self.console.sendline(
+                '''run() { echo -n "$MARKER"; sh -c "$@"; echo "$MARKER $?"; }'''
+            )
+            self.console.expect(self.prompt)
+            try:
+                data = self._raw_run("type run")
+                if data[2] and "function" in data[2]:
+                    return
+            except TIMEOUT:
+                pass
+
+            if timeout.expired:
+                raise TIMEOUT(f"Timeout of {inject_timeout} seconds exceeded during waiting for run injection")  # pylint: disable=line-too-long


If your serial console does not work reliably (even after specifying a certain settle time), I don't think retrying is the correct approach. Even if this works after the nth try, too many other things will fail in unexpected ways later. Better make your serial console work reliably or prefer the SSHDriver in these situations.

If your serial console does not work reliably (even after specifying a certain settle time)

Ok, fair enough, but my goal is to make it generic and reusable across different use cases, instead of fiddling with variable settle time, fighting crazy applications logging into /dev/console directly etc.

If your serial console does not work reliably (even after specifying a certain settle time), I don't think retrying is the correct approach.

Well, this approach works reliably for over than 2 years, with a mix of dozen different devices, bootloaders, kernels/drivers and operating systems. So the idea here is to get as fast (and reliably) as possible into the SSH and profit.

SSHDriver in these situations.

Indeed, its the only reliable solution if there is no dedicated serial console available.

Bastian-Krause · 2024-04-29T13:12:30Z

labgrid/driver/shelldriver.py

+    def _raw_run(self, cmd, timeout=2, codec="utf-8", decodeerrors="strict"):
+        """Runs command without run() wrapper as it's not yet available"""
+        marker = gen_marker()
+        self.console.sendline(f"echo {marker[:4]}; {cmd} ; echo {marker[4:]}")
+        _, _, match, _ = self.console.expect(
+            rf"{marker[:4]}(.*){marker[4:]}\s+{self.prompt}",
+            timeout=timeout
+        )
+        data = self.re_vt100.sub('', match.group(1).decode(codec, decodeerrors)).split('\r\n')
+        if data and not data[-1]:
+            del data[-1]
+            return data
+
+    def _silence_kernel(self, timeout, kernel_log_level=1, codec="utf-8", decodeerrors="strict"):
+        """Tries to set kernel log level to KERN_ALERT log level to make it less chatty."""
+
+        while True:
+            self.console.sendline(f"dmesg -n {kernel_log_level}")
+            try:
+                data = self._raw_run("cat /proc/sys/kernel/printk")
+                if data[2] and data[2].startswith(f'{kernel_log_level}\t'):
+                    return
+
+            except TIMEOUT:
+                pass
+
+            if timeout.expired:
+                raise TIMEOUT(f"Timeout of {self.login_timeout} seconds exceeded during waiting for kernel silencer")  # pylint: disable=line-too-long
+


I'm not sure we want more retry code for certain devices behaving strangely. Shouldn't these devices wait until input can be handled reliably before starting login/shell?

From the commit message:

Those issues mostly happen frequently during early boot process, where there is a lot of output
going on.

Could await_login_timeout help here instead?

I'm not sure we want more retry code for certain devices behaving strangely.

Ok, fair enough.

Shouldn't these devices wait until input can be handled reliably before starting login/shell?

Sure, if you have complete control over the build/settings of bootloader, kernel, drivers and operating system, then this is certainly doable.

For example, in OpenWrt project I'm testing 3 releases, this means 3-4 different kernel versions, 5.10, 5.15, 6.1 and 6.6. I've already learned during the years, that each of this kernel versions changed the behavior of the device, driver probing/loading and thus influence the serial console behavior. I don't want to handle this differences manually, thus I'm silencing the kernel ASAP, then continue bringing the device into SSH over LAN network and profit.

On some slow devices there might be 10s silence on the console and then ath11k is being probed and if you're lucky enough, exactly at the point when you're trying to detect LAN network availability of the device. In some builds, there are strange 3rd party applications which are not afraid to output debug logging into /dev/console directly etc.

Could await_login_timeout help here instead?

Its quite some time, but I guess, that I've tried all available features/config options, indeed. I'll rebase, retest and see how it goes today.

ynezz · 2024-04-29T14:19:39Z

@Bastian-Krause thanks a lot for a review, really appreciate it! I'll rebase and retest soon.

aparcar · 2024-06-13T13:00:38Z

Hi @ynezz

Since I faced the same issue and gave this another try. I found that the simplest fix was to drop the console_ready parameter and instead use logIn_prompt. It's a little bit mystical to me but it turned out to work fine. I assume Labgrids internal state machine for silencing the Kernel got confused by OpenWrt not requiring a proper login.

Below is my working example which did not require any code modifications:

targets:
  main:
    features:
      - wifi
      - online
    resources:
      RemotePlace:
        name: !template '$LG_PLACE'
    drivers:
      NetworkPowerDriver: {}
      SerialDriver:
        txdelay: 0.05
      ShellDriver:
        prompt: 'root@[\w()]+:[^ ]+ '
        login_prompt: Please press Enter to activate this console.
        await_login_timeout: 15
        login_timeout: 120
        post_login_settle_time: 5
        username: root
      UBootDriver:
        prompt: '=> '
        init_commands:
          - setenv kernel_addr_r 0x2000000
          - setenv bootargs earlyprintk console=ttyS0,115200
        boot_command: >
          dhcp $kernel_addr_r turris-omnia/openwrt-mvebu-cortexa9-cznic_turris-omnia-initramfs-kernel.bin;
          bootm $kernel_addr_r
      TFTPProviderDriver: {}
      UBootStrategy: {}

SmithChart · 2024-08-26T19:40:32Z

I am also testing OpenWRT (specifically Gluon) devices with labgrid.
And I can report, that for me silencing of the log-messages with dmesg -n 1 is currently broken.

In my use case using console_ready means that i am directly logged in afterwards. So I would either assume that we should set did_login = True after

labgrid/labgrid/driver/shelldriver.py

Line 171 in 1ab0abd

self.console.sendline("")

or use 1209a65 .

Cherrypicking 1209a65 onto current master at least fixes my issue.

But looking at the code, as it is now, I think I would prefer to set did_login = True after logging in with index == 4. This way we make sure, that we only issue the dmesg -n 1 a single time - even if we maybe extend the state machine later on.

ynezz force-pushed the upstream/shelldriver-fixes-improvements branch from 702d088 to 62a15ff Compare June 26, 2022 07:29

JoshuaWatt suggested changes Jun 27, 2022

View reviewed changes

labgrid/driver/shelldriver.py Show resolved Hide resolved

labgrid/driver/shelldriver.py Show resolved Hide resolved

ynezz force-pushed the upstream/shelldriver-fixes-improvements branch 2 times, most recently from f3f4366 to 32c250b Compare July 12, 2022 08:06

Bastian-Krause assigned jluebbe Sep 15, 2022

ynezz added 5 commits May 17, 2023 14:02

shelldriver: fix bogus self.status variable assigment

65d0269

Variable which is used by the driver is self._status. Signed-off-by: Petr Štetiar <[email protected]>

Emantor force-pushed the upstream/shelldriver-fixes-improvements branch from 32c250b to 65d0269 Compare May 17, 2023 12:03

aparcar suggested changes Apr 26, 2024

View reviewed changes

Bastian-Krause added the needs rebase Needs a rebase onto the master branch, maintainter could probably not push to submitter branch. label Apr 29, 2024

Bastian-Krause reviewed Apr 29, 2024

View reviewed changes

labgrid/driver/shelldriver.py

self.status = 1

self._status = 1

Copy link

Member

Bastian-Krause Apr 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's already fixed with 8bc7858.

Bastian-Krause reviewed Apr 29, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

shelldriver: fixes and improvements #945

shelldriver: fixes and improvements #945

Uh oh!

ynezz commented Jun 25, 2022 •

edited

Loading

Uh oh!

codecov bot commented Jun 26, 2022 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

aparcar commented Apr 26, 2024

Uh oh!

aparcar Apr 26, 2024

Uh oh!

aparcar commented Apr 29, 2024

Uh oh!

Bastian-Krause Apr 29, 2024

Uh oh!

Bastian-Krause Apr 29, 2024

Uh oh!

ynezz Apr 29, 2024

Uh oh!

Bastian-Krause Apr 29, 2024

Uh oh!

ynezz Apr 29, 2024

Uh oh!

ynezz commented Apr 29, 2024

Uh oh!

aparcar commented Jun 13, 2024

Uh oh!

SmithChart commented Aug 26, 2024

Uh oh!

Uh oh!

	data = self.re_vt100.sub('', match.group(1).decode(codec, decodeerrors)).split('\r\n')
	data = re_vt100.sub('', match.group(1).decode(codec, decodeerrors)).split('\r\n')

shelldriver: fixes and improvements #945

Are you sure you want to change the base?

shelldriver: fixes and improvements #945

Uh oh!

Conversation

ynezz commented Jun 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Jun 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

aparcar commented Apr 26, 2024

Uh oh!

aparcar Apr 26, 2024

Choose a reason for hiding this comment

Uh oh!

aparcar commented Apr 29, 2024

Uh oh!

Bastian-Krause Apr 29, 2024

Choose a reason for hiding this comment

Uh oh!

Bastian-Krause Apr 29, 2024

Choose a reason for hiding this comment

Uh oh!

ynezz Apr 29, 2024

Choose a reason for hiding this comment

Uh oh!

Bastian-Krause Apr 29, 2024

Choose a reason for hiding this comment

Uh oh!

ynezz Apr 29, 2024

Choose a reason for hiding this comment

Uh oh!

ynezz commented Apr 29, 2024

Uh oh!

aparcar commented Jun 13, 2024

Uh oh!

SmithChart commented Aug 26, 2024

Uh oh!

Uh oh!

ynezz commented Jun 25, 2022 •

edited

Loading

codecov bot commented Jun 26, 2022 •

edited

Loading