Feature #11888
Improve the remote shell's performance by switching to a virtio channel
10%
Description
Currently the serial link-based transfer of data between host and guest is so slow (400 Kbaud) that I think we’ve consciously avoided adding some useful features that depend on fast file IO, e.g. dumping the output of tails-debugging-info
as a scenario failure artifact. Also, the data transfer delay of the thousands of remote shell calls we do each full test suite run add up, so I expect that run time will decrease measurably.
Subtasks
Related issues
Related to Tails - |
Rejected | 2016-03-31 | |
Blocked by Tails - |
Resolved | 2016-10-25 |
History
#1 Updated by intrigeri 2016-10-26 09:27:43
- related to
Feature #11294: Replace remote shell with python-negotiator added
#2 Updated by intrigeri 2016-10-26 09:28:09
Cool! It would be nice to clarify the relationship of this ticket with Feature #11294.
#3 Updated by intrigeri 2016-10-26 09:28:38
- related to
Feature #11887: Make the remote shell's file operations robust added
#4 Updated by anonym 2016-10-26 20:33:50
I’ve seen the test suite get stuck like this:
Checkpoint: I have started Tails from DVD and logged in with an administration password and the network is connected
Given I have started Tails from DVD without network and logged in with an administration password
And the network is plugged
calling as root: /usr/local/sbin/tor-has-bootstrapped
call returned: [1, "", ""]
calling as root: /usr/local/sbin/tor-has-bootstrapped
call returned: [1, "", ""]
calling as root: /usr/local/sbin/tor-has-bootstrapped
call returned: [1, "", ""]
calling as root: /usr/local/sbin/tor-has-bootstrapped
call returned: [1, "", ""]
calling as root: /usr/local/sbin/tor-has-bootstrapped
calling as root: journalctl SYSLOG_IDENTIFIER=restart-tor
Presumably both the sides ended up waiting for an answer => deadlock.
It happens in wait_until_tor_is_working()
:
def wait_until_tor_is_working
try_for(270) { $vm.execute('/usr/local/sbin/tor-has-bootstrapped').success? }
rescue Timeout::Error => e
c = $vm.execute("journalctl SYSLOG_IDENTIFIER=restart-tor")
[...]
So I can get how a try_for
that times out could create the situation where the last tor-has-bootstrapped
call is aborted while we listen for the response, but before it arrives, and then end up calling journalctl
, where we should receive both responses, but discard the first one (which is logged). However, I only see five runs of the try_for
block, so it’s only been ~five seconds. So this is still a mystery to me.
Just a thought: this might be some sort of race that affects the slow non-virtio remote shell as well, it’s just that it happens more often with this awesome speed.
#5 Updated by anonym 2016-10-27 08:32:05
And now I’ve seen these two deadlocks:
# just restored from a snapshot
calling as root: echo 'hello?'
call returned: [0, "hello?\n", ""]
calling as root: /sbin/ifconfig eth0 | grep -q 'inet addr'
call returned: [1, "", ""]
calling as root: date -s '@1477525308'
# deadlock
# first boot
calling as root: echo 'hello?'
call returned: [0, "hello?\n", ""]
calling as root: service tor status
call returned: [3, "â— tor.service - Anonymizing overlay network for TCP (multi-instance-master)\n Loaded: loaded (/lib/systemd/system/tor.service; disabled)\n Active: inactive (dead)\n", ""]
opening file /etc/tor/torrc in 'append' mode
append complete
[log] CLICK on (642,449)
calling as root: loginctl
call returned: [0, " SESSION UID USER SEAT \n c1 114 Debian-gdm seat0 \n\n1 sessions listed.\n", ""]
calling as root: loginctl
call returned: [0, " SESSION UID USER SEAT \n c1 114 Debian-gdm seat0 \n 1 1000 amnesia seat0 \n c5 0 root \n\n3 sessions listed.\n", ""]
calling as amnesia: gsettings set org.gnome.desktop.session idle-delay 0
call returned: [0, "", ""]
calling as amnesia: gsettings set org.gnome.desktop.interface toolkit-accessibility true
# deadlock
Add to this that when I was developing this branch, I very frequently hit the issue that the remote shell got deadlocked within the first ten calls after it started for the first time, but thought I fixed it by making sure both sides flush the socket they use for the virtio channel. Any way, there seems to be an increased risk for these deadlocks “early”, for some definition of “early”.
I guess I need to add some sort deep of debug logging to understand what happens. The only problem is that normally we’d use the remote shell to obtain those logs from the guest. Perhaps we can revive the serial link just for that purpose. :)
#6 Updated by bertagaz 2016-10-27 11:49:25
anonym wrote:
> Just a thought: this might be some sort of race that affects the slow non-virtio remote shell as well, it’s just that it happens more often with this awesome speed.
It is. I thought I’ve seen while working on Bug #10777 and mentioned it there, but it doesn’t seem so. I’m pretty sure having noticed that anyway, and that it was one the the reason of the remote shell sometimes failed.
#7 Updated by bertagaz 2016-10-27 11:53:22
bertagaz wrote:
> It is. I thought I’ve seen while working on Bug #10777 and mentioned it there, but it doesn’t seem so. I’m pretty sure having noticed that anyway, and that it was one the the reason of the remote shell sometimes failed.
Found it, that was in Bug #11846#note-8
#8 Updated by anonym 2016-11-02 15:03:21
bertagaz wrote:
> bertagaz wrote:
> > It is. I thought I’ve seen while working on Bug #10777 and mentioned it there, but it doesn’t seem so. I’m pretty sure having noticed that anyway, and that it was one the the reason of the remote shell sometimes failed.
>
> Found it, that was in Bug #11846#note-8
This is not the same issue, AFAICT. Dropping out-of-order responses is expected when VM.execute()
is called inside a try_for()
block (since we might abort after sending the request to the remote shell, but before we get the response). In fact, I’m shocked we haven’t experienced more issues before, when we did not handle this. I guess it’s because we mostly fail the run when a try_for()
fails, so the remote shell is not used any more and we are safe — if we did try to use it, we’d get the response from the aborted command, which could cause all sorts of trouble.
The deadlocks I’m talking about happen even if there’s no try_for()
, and in fact no exceptions at all are thrown. I suspect this is a low-level issue, perhaps related to IO buffering on the socket (which should be both disabled and flush()
ed passed, both on client and server). And it’s pretty rare; our test suite makes around 5000 calls to the remote shell each run, and if you look at Jenkins results for the branch, there’s one per run, at random places. I have never seen more than one happen, and only a few runs where it doesn’t happen at all.
This is a tricky one, and I’m currently trying to figure out how to debug it. The problem is that we can only easily debug one side (the client, in the test suite) since we cannot easily access the log of the remote shell server — the easy way would be to use the remote shell itself! :)
#9 Updated by anonym 2016-11-25 13:22:23
- Target version changed from Tails_2.9.1 to Tails 2.10
Given the problems I’m seeing I think I’ll split out the virtio socket blah into its own branch, so the other improvements can be included faster. Soon!
#10 Updated by anonym 2016-11-29 12:40:19
- Status changed from Confirmed to In Progress
- % Done changed from 0 to 10
- Feature Branch deleted (
test/remote-shell-file-operations)
To continue work on this, revert commit:9fb70d9c6e90ff18e4e2be97c6360fd8212aa0dd (in the test/11887-remote-shell-file-operations
branch, if it hasn’t been merged yet).
#11 Updated by anonym 2016-11-29 12:40:34
- related to deleted (
)Feature #11887: Make the remote shell's file operations robust
#12 Updated by anonym 2016-11-29 12:40:48
- blocked by
Feature #11887: Make the remote shell's file operations robust added
#13 Updated by anonym 2017-01-06 14:14:40
- Target version deleted (
Tails 2.10)
#14 Updated by anonym 2017-01-26 10:12:05
- Feature Branch set to feature/11888-remote-shell-virtio-transport
I’ve revived the code now (based on test/12059-dogtail-optimizations
(which in turn is based on test/11887-remote-shell-file-operations
)). Let’s see how it behaves on Jenkins.
#15 Updated by anonym 2019-10-11 09:28:34
anonym wrote:
> I’ve revived the code now (based on test/12059-dogtail-optimizations
(which in turn is based on test/11887-remote-shell-file-operations
)). Let’s see how it behaves on Jenkins.
That was two years ago, and I don’t know what the outcome was. :/ Any way, I’m curious if anything has changed since then so I have refreshed this branch yet again.
#16 Updated by anonym 2019-10-12 12:06:27
So after 5 runs of +force-all-tests
and 3 runs of the “normal” branch, one of the +force-all-test
runs had a failure like this:
05:00:49.055818907: Restoring snapshot 'with-network-logged-in'...
05:00:51.131680384: Remote shell: calling as root: echo 'hello?'
05:00:52.300526024: Remote shell: calling as root: echo 'hello?'
[...]
05:02:20.000511903: Remote shell: calling as root: echo 'hello?'
05:02:21.017870387: Remote shell: calling as root: echo 'hello?'
Generated snapshot step failed with exception:
Timeout::Error: Remote shell seems to be down
So, again, there is a problem “early” after restoring, although now it fails immediately after restoring, but 3 years ago it could happen a few calls later.
Any way, I added some debugging to help debug (and possibly workaround) this.
#18 Updated by anonym 2019-10-14 08:56:50
I tried bumping the timeout used for waiting for the remote shell after restoring a snapshot from 3 to 30 seconds, but it didn’t help.
What does help is to just retry restoring the snapshot again. It seems like an effective workaround as it has rescued all instances so far, but our serial based transport works fine without need for workarounds, it’s just slow. It seems this branch could lower the total run time from around 5h40m to 5h30m (and more if there are tons of failures → saving the journal over a slow serial link) which does not seem worth it.
I guess I’ll return in two years and see if things has improved. Unless we get a real need for this branch and want to invest time reporting this upstream and debugging this much further.
#19 Updated by intrigeri 2019-10-14 09:35:44
> I guess I’ll return in two years and see if things has improved. Unless we get a real need for this branch and want to invest time reporting this upstream and debugging this much further.
This plan works for me, until we see a real, immediate need for it.
#20 Updated by intrigeri 2019-10-14 14:50:36
Hi @anonym,
> I guess I’ll return in two years and see if things has improved.
Then, would you mind renaming the branches to wip/
to free some Jenkins cycles?
#21 Updated by anonym 2019-10-14 15:00:31
- Feature Branch changed from feature/11888-remote-shell-virtio-transport to wip/11888-remote-shell-virtio-transport
intrigeri wrote:
> Then, would you mind renaming the branches to wip/
to free some Jenkins cycles?
Didn’t think about that! Thanks, done!