Bug #12602

Test suite relies on ifconfig, that's not included anymore in Tails 3.0

Added by intrigeri 2017-05-26 05:11:15 . Updated 2017-06-12 16:10:09 .

Status:
Resolved
Priority:
Elevated
Assignee:
Category:
Test suite
Target version:
Start date:
2017-05-26
Due date:
% Done:

100%

Feature Branch:
test/12602-fix-network-check
Type of work:
Code
Blueprint:

Starter:
Affected tool:
Deliverable for:

Description

Regression against 2.x + I think this may break some magic we do in post_snapshot_restore_hook, which can make all kinds of tests relying on a working Tor fragile, hence target version = 3.0.


Subtasks


History

#1 Updated by intrigeri 2017-05-27 08:27:03

  • Priority changed from Normal to Elevated

(Regression affecting our QA process.)

#2 Updated by anonym 2017-06-01 02:41:37

  • Status changed from Confirmed to In Progress
  • Assignee changed from anonym to intrigeri
  • % Done changed from 0 to 50
  • QA Check set to Ready for QA
  • Feature Branch set to test/12602-fix-network-check

Indeed! Wow. Nice catch! The post_snapshot_restore_hook fixes are indeed never applied.

Looking closer at this I can see that this have been causing delays of over a minute for the first connection made via Tor after each snapshot restore, effectively making each Tor-using scenarios take that much longer than they should (which confirms a hunch I’ve had for a while). I also suspect it will fix a robustness issue I’ve seen with the a web server is running on the LAN on Stretch-based branches.

Please review’n’merge the much improved network check in the feature branch!

#3 Updated by intrigeri 2017-06-01 07:21:34

  • Assignee changed from intrigeri to anonym
  • QA Check changed from Ready for QA to Info Needed

I’m not sure I understand the underlying idea behind this commit: previously we would restart Tor only if eth0 had an IP address, and with this change we do it only if eth0’s link is up. These are different things, right? I’m concerned that restarting Tor earlier this way may be prone to race conditions, e.g. this could sometimes happen before NM/dhclient have refreshed IP connectivity and then Tor will be stuck. OTOH, the previous state of things was perhaps pretty bad as well: due to the same race condition, we might have skipped restarting Tor in some cases and it could also get stuck. I guess that we have greater chances to win the new race than the previous one, so perhaps I could merge this branch as-is anyway, but wait: taking a step back, I realize I don’t know what if $vm.has_network? is for exactly. I wonder if what we really want would rather be something like: if eth0’s link is up, wait for IP connectivity to be set up (better ask NM rather than ifconfig or ip addr, since an IP might be set before other pieces of the puzzle such as the default route) and then restart Tor etc.

#4 Updated by anonym 2017-06-01 16:35:56

  • Assignee changed from anonym to intrigeri
  • QA Check changed from Info Needed to Ready for QA

intrigeri wrote:
> I’m not sure I understand the underlying idea behind this commit: previously we would restart Tor only if eth0 had an IP address, and with this change we do it only if eth0’s link is up. These are different things, right?

Theoretically, yes, they are different things, but in practice there’s no situation where the link state is up (i.e. the network is plugged) and we haven’t let NetworkManager do DHCP etc. That said, this might change in the future, so perhaps we should also check that we got an IP address…

> I’m concerned that restarting Tor earlier this way may be prone to race conditions, e.g. this could sometimes happen before NM/dhclient have refreshed IP connectivity and then Tor will be stuck.

I think you have an invalid assumption here: it seems you think NM will reconnect after we restore a snapshot. Well, it won’t. We restore the network state (so link state, kernel state, userspace state) of the snapshot, which won’t match the rest of the world, but it turns out this is only a problem for Tor, not the IP stack in general.

Ok?

If you think we should reconnect the network after restoring a snapshot, in an attempt to have sane network state between the VM and host, then we need to temporarily disable all NM hooks, since restarting all of them probably will cause all sorts of trouble. OTOH, what we currently do seems to have worked for years, although it’d be interesting to test whether some network robustness issues perhaps are cause by e.g. non-matching IP state between the VM and the host. What do you think? If so I’d rather open a new ticket about that for later.

> OTOH, the previous state of things was perhaps pretty bad as well: due to the same race condition, we might have skipped restarting Tor in some cases and it could also get stuck. I guess that we have greater chances to win the new race than the previous one, so perhaps I could merge this branch as-is anyway, but wait: taking a step back, I realize I don’t know what if $vm.has_network? is for exactly.

Misunderstandings aside, my take is that this branch currently will only improve things. To stay on the safe side I’ll add to the has_network? check that we also make sure we have an IP address.

> I wonder if what we really want would rather be something like: if eth0’s link is up, wait for IP connectivity to be set up

Here’s the invalid assumption again…

> (better ask NM rather than ifconfig or ip addr, since an IP might be set before other pieces of the puzzle such as the default route)

Makes sense!

Please have a look at commit:b385e3ed5522643a2a20e55474ce8c2555b6b4ff. Are we on the same page now?

#5 Updated by intrigeri 2017-06-01 19:25:22

  • Assignee changed from intrigeri to anonym
  • QA Check changed from Ready for QA to Dev Needed

I’ll have a look once it builds on Jenkins: I’m sorry to be painful but I’m tired of dealing with Bug #12611 myself on everyone’s topic branches.

#6 Updated by intrigeri 2017-06-02 06:18:36

  • Assignee changed from anonym to intrigeri
  • QA Check changed from Dev Needed to Ready for QA

#7 Updated by intrigeri 2017-06-02 06:54:25

> intrigeri wrote:
>> I’m not sure I understand the underlying idea behind this commit: previously we would restart Tor only if eth0 had an IP address, and with this change we do it only if eth0’s link is up. These are different things, right?

> Theoretically, yes, they are different things, but in practice there’s no situation where the link state is up (i.e. the network is plugged) and we haven’t let NetworkManager do DHCP etc. That said, this might change in the future, so perhaps we should also check that we got an IP address…

Yes. That’s the semantics of has_network? we had previously, and one existing step (“the network connection is ready”, currently unused apparently) seems to depend on it. I wonder if we should rename has_network? to make its name express its semantics better. Whatever.

>> I’m concerned that restarting Tor earlier this way may be prone to race conditions, e.g. this could sometimes happen before NM/dhclient have refreshed IP connectivity and then Tor will be stuck.

> I think you have an invalid assumption here: it seems you think NM will reconnect after we restore a snapshot. Well, it won’t. We restore the network state (so link state, kernel state, userspace state) of the snapshot, which won’t match the rest of the world, but it turns out this is only a problem for Tor, not the IP stack in general.

So what I understand now is:

  • in the general case NM/dhclient don’t have to do anything special when restoring a snapshot, and they don’t; it’s very plausible that we currently only exercise this general case;
  • in some special cases (e.g. expired DHCP lease), dhclient will renew its lease (which might run NM hooks in some cases); I wouldn’t be surprised if there were more special cases we can’t think of right now; possibly we don’t exercise any such special case currently.

Correct?

I would find it perhaps a tiny bit too bold/self-confident to assume we won’t ever hit any such special case, especially given time is involved, and we do crazy stuff with time :) Anyway, I agree your branch should only improve things so let’s move on.

> If you think we should reconnect the network after restoring a snapshot, […].

Nope.

> OTOH, what we currently do seems to have worked for years, although it’d be interesting to test whether some network robustness issues perhaps are cause by e.g. non-matching IP state between the VM and the host. What do you think?

I don’t feel the need to argue/explore this further. Hopefully having this potential problem in mind will be enough for us to think about it if/when we see a weird failure that might be related :)

> Please have a look at commit:b385e3ed5522643a2a20e55474ce8c2555b6b4ff. Are we on the same page now?

Looks good to me! Code review passes, will now test.

#8 Updated by intrigeri 2017-06-02 12:19:46

https://jenkins.tails.boum.org/job/test_Tails_ISO_test-12602-fix-network-check/1/ has some failures, so I’ll wait for another test suite run.

#9 Updated by intrigeri 2017-06-02 15:20:16

  • Status changed from In Progress to Fix committed
  • % Done changed from 50 to 100

Applied in changeset commit:69052a9d10a2fda99f3c157be0cd16fb92d3b412.

#10 Updated by intrigeri 2017-06-02 15:20:35

  • Assignee deleted (intrigeri)
  • QA Check changed from Ready for QA to Pass

#11 Updated by intrigeri 2017-06-12 16:10:09

  • Status changed from Fix committed to Resolved