Bug #9478

How to deal with transient network errors in the test suite?

Added by anonym 2015-05-27 07:05:06 . Updated 2015-06-02 14:37:46 .

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Test suite
Target version:
Start date:
2015-05-27
Due date:
% Done:

100%

Feature Branch:
Type of work:
Discuss
Blueprint:

Starter:
Affected tool:
Deliverable for:

Description

In the test suite we’re using the live Tor network to access real Internet services. Such tests can of course “randomly” fail due to transient network issues, mainly us picking a bad Tor circuit. Another common source is Tor failing to bootstrap.

Some classes of this issue do not reflect the things we want to test, and are false positives. We need to:

  • decide what types of issues we want to treat as false positives,
  • investigate our options for how to deal with them to improve our test suite’s robustness,
  • and decide which approach to take.

Subtasks


Related issues

Related to Tails - Feature #5770: Retry when tor fails Rejected
Related to Tails - Feature #9519: Make the test suite more deterministic through network simulation In Progress 2015-06-02
Related to Tails - Feature #9521: Use the chutney Tor network simulator in our test suite Resolved 2016-04-15
Related to Tails - Feature #9520: Investigate the Shadow network simulator for use in our test suite Rejected 2015-06-02
Related to Tails - Feature #9515: Improve test suite robustness vs transient network errors Resolved 2015-06-02
Related to Tails - Feature #9516: Restart Tor if bootstrapping stalls for too long Resolved 2015-06-02
Related to Tails - Feature #9518: Retry with new OpenPGP key server pool member when they misbehave Resolved 2015-06-02
Blocks Tails - Bug #9072: Pidgin IRC tests often fail due to OFTC Tor blocking Resolved 2015-03-18

History

#1 Updated by anonym 2015-05-27 07:05:18

#2 Updated by anonym 2015-05-27 07:05:36

  • blocks Bug #9072: Pidgin IRC tests often fail due to OFTC Tor blocking added

#3 Updated by anonym 2015-05-27 08:55:33

  • Type of work changed from Research to Discuss

> * decide what types of issues we want to treat as false positives,

I think that we initially at least should take care of:

  • Tor bootstrap issues
  • the Pidgin IRC test failing because Tor is being blocked

More?

> * investigate our options for how to deal with them to improve our test suite’s robustness,

I can think of three approaches here:

1. Retry scenarios that fail during the same run

My thought was to add a tag, e.g. @retry, and that scenarios with that tag will be rerun during the same run upon failure, up to three times or whatever, via a cucumber Around hook. However, that doesn’t work due to the design in cucumber, so let’s scrap this idea.

2. Adapt tests to retry upon failure

The idea is that we essentially simulate what a user would when facing these issues. Let’s look at a few examples:

The Pidgin IRC test: If Tor is blocked, we’d get some kind of “Failed to connect” error shown in the buddy list, which has a “Reconnect” button. A normal user would click that button, and eventually it would work. Since we’d use the same circuit for up to ten minutes, we’d use the same exit node for that duration, so if it’s that specific exit node that’s banned we’d need to retry for at least 10 minutes. I believe it could also be that it’s the particular OFTC server we got that bans all Tor exits, and then I’m less sure what to do. Should we clear the DNS cache to force a new OFTC server?

Tor bootstrap failures: in my experience, restarting Tor is the only thing that works here. Users cannot do that directly, but unplugging and then plugging the network does it.

Some drawbacks with this approach are:

  • Increased code complexity, since each place in the code where these errors can happen require logic for retrying. The solutions will be very ad-hoc.
  • All logic (initial try + retries) must be fit in one step, since we cannot jump back in the scenario and rerun previous steps.

3. Rerun test suite on failing scenarios only

Cucumber has the rerun formatter that outputs the failing scenarios in a .feature format suitable to feed cucumber again, to rerun only the failed scenarios. Hence we can do something like:

retries=0
max_retries=3
./run_test_suite $args --format rerun --out reruns.txt
while [ "$?" -ne 0 ] && [ "$retries" -lt "$max_retries" ]; then
  retries=$((${retries} + 1))
  # here we make sure that $args doesn't include any .feature files; we only want to
  # re-run the scenarios in reruns.txt
  ./run_test_suite $args --format rerun --out reruns.txt @reruns.txt
fi

This seems like something that would fit into our CI infrastructure (i.e. jenkins) but probably not when running the test suite yourself. For intsance, it would not be very helpful when developing new tests. Possibly it could be made into an option for run_test_suite instead.

Since we in general only are interested in failures (at least in the CI context) it actually fits pretty well that the reruns.txt from the last iteration only includes failures, if fany.

A drawback is that this will indiscriminately rerun all failed tests, not just those that failed due to transient network errors (and no, we cannot tag scenarios so only they are affected). Hence if a change is made that introduces a genuine (abd probably deterministic) reason for tests to fail, it will be needlessly re-verified two additional times => increased run time.

> * and decide which approach to take.

Thoughts so far?

#4 Updated by anonym 2015-05-27 11:38:09

I guess there’s also an approach 4:

Simulate the Internet and the Tor network and run all services on the testing host

We could run shadow with its Tor plugin (or similar network topology simulator, if there’s any that’s supported better). Also we’d revive Feature #6300 or something similar.

This should both eliminate all non-determinism and increase performance a lot, and allow running the test suite offline, all which sounds pretty awesome. I guess we could still keep a few tests which uses the real Internet, that we only run for release testing.

The drawbacks are:

  • We’d have to reconfigure the Tor client to use this fake network. All such reconfigurations are of course bad, since it makes the system under testing deviate from the real situation. However, since we already are using a different Tor network, we’ve deviated quite a lot already. This would be somewhat alleviated with the release tests using the real Internet + Tor network.
  • Similar to the above point, we may have to reconfigure other parts of Tails (security/upgrade check, incremental upgrades, Tor Browser home page, APT repos, whisperback, probably more stuff)
  • We probably cannot do anything for I2P.
  • This is a huge project.

The vision of a deterministic test suite is very attractive though. This is something I possibly could consider working on in late 2016 or more likely 2017 or maybe even later. Of course, we need to make our test suite a lot more robust than it currently is way before that, so if we want this it’d be a long-term goal, with some other approach as an interim solution at the places we get the most hurt.

#5 Updated by intrigeri 2015-05-28 12:58:30

> I think that we initially at least should take care of:
> […]
> More?

Yes:

  • Buggy servers in the hkps:// pool (I still believe that “we” should get in touch with the people who manage this pool, encourage them to monitor it more closely and move out of the pool failing servers faster, but until this is done we’ll have to cope with the fact we sometimes hit the wrong server).

Nothing else I can think of right now. Time will tell: once we start addressing this class of problems, we’ll stop implicitly ignoring them, and then I’m confident we’ll find more stuff to add to the list. (Same process / mindset change that happened for increasing the general robustness of the test suite.)

> The idea is that we essentially simulate what a user would when facing these issues. Let’s look at a few examples:

> The Pidgin IRC test: If Tor is blocked, we’d get some kind of “Failed to connect” error shown in the buddy list, which has a “Reconnect” button. A normal user would click that button, and eventually it would work. Since we’d use the same circuit for up to ten minutes, we’d use the same exit node for that duration, so if it’s that specific exit node that’s banned we’d need to retry for at least 10 minutes. I believe it could also be that it’s the particular OFTC server we got that bans all Tor exits, and then I’m less sure what to do.

I believe a power-user would use Vidalia’s “New identity” feature, wait for the corresponding notification, and try to reconnect a few times. It’s less realistic that repeatedly clicking “reconnect” for 10 minutes, in terms of emulating average user behaviour, but I think it’s realistic enough, and saves up to 10 minutes of test suite runtime. (I hope that this would work, but I’m not sure. It might be that Pidgin is keeping the TCP connection open, and then we indeed won’t get a new circuit. But my personal experience seems to indicate that in practice it’ll work.)

> Should we clear the DNS cache to force a new OFTC server?

AFAIK we have no DNS cache anymore. Did I miss something?

> Tor bootstrap failures: in my experience, restarting Tor is the only thing that works here. Users cannot do that directly, but unplugging and then plugging the network does it.

I’ve seen actual users disconnect/reconnect so it’s not too far from reality. Not sure if we document that somewhere — if not, perhaps we should.

> Some drawbacks with this approach are:

> * Increased code complexity, since each place in the code where these errors can happen require logic for retrying. The solutions will be very ad-hoc.

I understand this concern in theory, but so far we’ve only identified three places that need ad-hoc solutions. And presumably, at least one part of the 2 or 3 corresponding solutions (e.g. the “New identity” business) would work for most other similar problems, so the logic can perhaps be factored out and reused. Now, of course some retrying code will be very ad-hoc (e.g. Pidgin’s “reconnect” button), and I have no idea what the duplicated code that calls to the factorized common bits would look like, so it may be that I’m reasoning under totally wrong assumptions => I’m tempted to just trust anonym’s intuition on this topic.

> 3. Rerun test suite on failing scenarios only
> […]
> This seems like something that would fit into our CI infrastructure (i.e. jenkins)

I’m not so sure about it: e.g. I’ve no idea how Jenkins (and its Cucumber plugins) would digest merged/concatenated .json output that runs the same scenario multiple times, which I think is what we need (see below).

> Since we in general only are interested in failures (at least in the CI context) it actually fits pretty well that the reruns.txt from the last iteration only includes failures, if fany.

I’m not convinced: we’re also interested in success (for stats and trend graphing) and runtime information.

> A drawback is that this will indiscriminately rerun all failed tests, not just those that failed due to transient network errors (and no, we cannot tag scenarios so only they are affected). Hence if a change is made that introduces a genuine (abd probably deterministic) reason for tests to fail, it will be needlessly re-verified two additional times => increased run time.

Indeed, it will hide tests that fail sometimes for reasons that have nothing to do with transient network problems, e.g. fragile tests (that should be made more robust) and bugs caused by race conditions (that should be fixed in Tails itself).

#6 Updated by intrigeri 2015-05-28 13:03:37

> Simulate the Internet and the Tor network and run all services on the testing host

Holy crap, you did it! :)

> The drawbacks are:

> * We’d have to reconfigure the Tor client to use this fake network. All such reconfigurations are of course bad, since it makes the system under testing deviate from the real situation. However, since we already are using a different Tor network, we’ve deviated quite a lot already. This would be somewhat alleviated with the release tests using the real Internet + Tor network.

Seems totally acceptable to me.

> * Similar to the above point, we may have to reconfigure other parts of Tails (security/upgrade check, incremental upgrades, Tor Browser home page, APT repos, whisperback, probably more stuff)

+ OpenPGP keyserver, check.torproject.org, etc.

Indeed, that’s a crazy lot of stuff. And not only we need to reconfigure it in the system under test, but we also need to create good enough mockups of the functionality currently provided by the remote services we want to avoid hitting.

I think it would be much more realistic to emulate the Tor network only, while still relying on real online services when we need them. At least it would be a pretty good first step.

> This is something I possibly could consider working on in late 2016 or more likely 2017 or maybe even later.

OK, cool. Please add it to the roadmap brainstorming part of the summit agenda, then :)

#7 Updated by anonym 2015-05-28 15:27:29

intrigeri wrote:
> > I think that we initially at least should take care of:
> > […]
> > More?
>
> Yes:
>
> * Buggy servers in the hkps:// pool (I still believe that “we” should get in touch with the people who manage this pool, encourage them to monitor it more closely and move out of the pool failing servers faster, but until this is done we’ll have to cope with the fact we sometimes hit the wrong server).

ACK. If it’s easy to make our test retry (and force a new server — I think this is related to the “DNS cache” issue below) we may consider that in the meantime too.

> Nothing else I can think of right now. Time will tell: once we start addressing this class of problems, we’ll stop implicitly ignoring them, and then I’m confident we’ll find more stuff to add to the list. (Same process / mindset change that happened for increasing the general robustness of the test suite.)

Agreed.

> > The idea is that we essentially simulate what a user would when facing these issues. Let’s look at a few examples:
>
> > The Pidgin IRC test: […]
>
> I believe a power-user would use Vidalia’s “New identity” feature, wait for the corresponding notification, and try to reconnect a few times. It’s less realistic that repeatedly clicking “reconnect” for 10 minutes, in terms of emulating average user behaviour, but I think it’s realistic enough, and saves up to 10 minutes of test suite runtime. (I hope that this would work, but I’m not sure. It might be that Pidgin is keeping the TCP connection open, and then we indeed won’t get a new circuit. But my personal experience seems to indicate that in practice it’ll work.)
>
> > Should we clear the DNS cache to force a new OFTC server?
>
> AFAIK we have no DNS cache anymore. Did I miss something?

Well, I think there’s some DNS caching somewhere here at play, either internally in Pidgin (or it’s that TCP connects are kept alive, like you suggest), or in Tor’s DNS resolver. Isn’t that evident from Pidgin’s behaviour when you’re banned and simply click “Reconnect” => you get the same pool member from irc.oftc.net that has banned you? Perhaps I’m confused.

I guess we can try the “New identity”, wait for Tor to have established the new identity, click “Reconnect” approach you suggest and see how it works, and if it doesn’t, go to the bottom of how all this works in detail.

> > Tor bootstrap failures: in my experience, restarting Tor is the only thing that works here. Users cannot do that directly, but unplugging and then plugging the network does it.
>
> I’ve seen actual users disconnect/reconnect so it’s not too far from reality. Not sure if we document that somewhere — if not, perhaps we should.

Ok. I guess we could simply restart Tor and then not risk issues with the stalled NetworkManager hooks from the first connection that will not disappear just because we (dis|re)connect. I think the result is that they get run twice in a row (in order) which shouldn’t be a problem, since it apparently works for our users, but let’s not risk introducing more instability, IMHO.

> > Some drawbacks with this approach are:
>
> > * Increased code complexity, since each place in the code where these errors can happen require logic for retrying. The solutions will be very ad-hoc.
>
> I understand this concern in theory, but so far we’ve only identified three places that need ad-hoc solutions. And presumably, at least one part of the 2 or 3 corresponding solutions (e.g. the “New identity” business) would work for most other similar problems, so the logic can perhaps be factored out and reused. Now, of course some retrying code will be very ad-hoc (e.g. Pidgin’s “reconnect” button), and I have no idea what the duplicated code that calls to the factorized common bits would look like, so it may be that I’m reasoning under totally wrong assumptions => I’m tempted to just trust anonym’s intuition on this topic.

Let’s say I’m just trying to be overly cautious so we don’t end up at a bad place complexity-wise in a year or so, as we discover more of these issues. As we know, the test suite is already quite complex, and there’s already plenty for a few WTF moments when reading the step definitions.

> > 3. Rerun test suite on failing scenarios only
> > […]
> > This seems like something that would fit into our CI infrastructure (i.e. jenkins)
>
> I’m not so sure about it: e.g. I’ve no idea how Jenkins (and its Cucumber plugins) would digest merged/concatenated .json output that runs the same scenario multiple times, which I think is what we need (see below).

Right. While looking into this, I saw that people used it in combination with Jenkins but…

> > Since we in general only are interested in failures (at least in the CI context) it actually fits pretty well that the reruns.txt from the last iteration only includes failures, if fany.
>
> I’m not convinced: we’re also interested in success (for stats and trend graphing) and runtime information.

… indeed, I now found that the solution is really messy (unless there are others).

> > A drawback is that this will indiscriminately rerun all failed tests, not just those that failed due to transient network errors (and no, we cannot tag scenarios so only they are affected). Hence if a change is made that introduces a genuine (abd probably deterministic) reason for tests to fail, it will be needlessly re-verified two additional times => increased run time.
>
> Indeed, it will hide tests that fail sometimes for reasons that have nothing to do with transient network problems, e.g. fragile tests (that should be made more robust) and bugs caused by race conditions (that should be fixed in Tails itself).

Yes, that’s very true. Let’s not go do this.

intrigeri wrote:
> > Simulate the Internet and the Tor network and run all services on the testing host
>
> Holy crap, you did it! :)

I’ve seen this one coming for a while… :) Now after looking at the actual drawbacks, and thinking a bit more about them I’m not quite as skeptical. Especially if we have @release tests that run using the real Internet + Tor network for some carefully selected tests so we cover most (or at least the most critical) of the parts that we deviate on when using the configuration needed for the simulator.

> > The drawbacks are:
[…]
> > * Similar to the above point, we may have to reconfigure other parts of Tails (security/upgrade check, incremental upgrades, Tor Browser home page, APT repos, whisperback, probably more stuff)
>
> + OpenPGP keyserver, check.torproject.org, etc.
>
> Indeed, that’s a crazy lot of stuff. And not only we need to reconfigure it in the system under test, but we also need to create good enough mockups of the functionality currently provided by the remote services we want to avoid hitting.
>
> I think it would be much more realistic to emulate the Tor network only, while still relying on real online services when we need them. At least it would be a pretty good first step.

That’s an excellent idea (assuming Shadow supports its exit nodes communicating with the real network and not just Shadow’s fake one)! That would of course only take care of transient errors due to Tor, but both instability and exit node blocking. E.g. OFTC blocking would be solved but crappy OpenPGP key servers would still be problematic.

Unfortunately, after looking into how to setup Shadow, I noticed a few more drawbacks:

  • It’s not packaged in Debian, and building it seems a bit painful (it depends on both the gcc and clang/llvm toolchains, for instance).
  • When using the Tor Network simulator, Tor needs to be built against patched openssl and libevent libraries. Hopefully that only applies to the network nodes, and not the client so we have to hot-patch the Tor inside Tails => more deviation from “real” Tails, more pain since we need to keep this patched Tor client strictly up-to-date with the Tor client version expected in the branch we are testing (urgh!).

> > This is something I possibly could consider working on in late 2016 or more likely 2017 or maybe even later.
>
> OK, cool. Please add it to the roadmap brainstorming part of the summit agenda, then :)

Done.

So, to summarize, a possible plan could be:

  1. Short-term: Add ad-hoc workarounds for retrying tests (in a manner more or less like a user would) that we know are prone to transient network issues.
  2. Mid-term: Use Shadow to simulate the Tor Network to eliminate transient errors due to Tor instability and Tor exit node blocking.
  3. Long-term: Use Shadow to simulate the rest of the Internet as well, and run all expected services on the host to eliminate transient errors due to issues with the end-points.

#8 Updated by intrigeri 2015-05-28 16:38:19

> I guess we can try the “New identity”, wait for Tor to have established the new identity, click “Reconnect” approach you suggest and see how it works, and if it doesn’t, go to the bottom of how all this works in detail.

Sounds like a good strategy. I feel no urge to go check in Pidgin source code if it lets the SOCKS proxy handle name resolution, or not.

>> > Tor bootstrap failures: in my experience, restarting Tor is the only thing that
>> > works here. Users cannot do that directly, but unplugging and then plugging the
>> > network does it.
>>
>> I’ve seen actual users disconnect/reconnect so it’s not too far from reality.
>> Not sure if we document that somewhere — if not, perhaps we should.

> Ok. I guess we could simply restart Tor and then not risk issues with the stalled NetworkManager hooks from the first connection that will not disappear just because we (dis|re)connect. I think the result is that they get run twice in a row (in order) which shouldn’t be a problem, since it apparently works for our users, but let’s not risk introducing more instability, IMHO.

I’m tempted to say that it depends what we would recommend users to do in such situations:

  • If we want to recommend them to disconnect+reconnect, then we should make it work; at least in the past, when people were asking for a button to restart Tor, our answer has sometimes been “you can disconnect/reconnect, so wontfix”; not sure what’s the current state of the art on this topic. I’ve no idea how NM handles its hooks — would it really let the previous connection’s ones run while starting the same hooks again for the new connection?
  • If we want them to restart Tor, then why not, but then we need to clearly document that this is supported, as opposed to disconnect+reconnect, so they don’t expect the latter to work, if we think it may introduce instability. Also, if we have means to detect specific problems that can be fixed by restarting Tor, then IMO it should somehow be implemented in Tails, not as a test-suite -specific improvement.

> Let’s say I’m just trying to be overly cautious so we don’t end up at a bad place complexity-wise in a year or so, […]

Got it, and appreciated :)

>> > 3. Rerun test suite on failing scenarios only

> … indeed, I now found that the solution is really messy (unless there are others).

Wow, indeed I don’t want to be involved in any such “solution”.

>> Indeed, it will hide tests that fail sometimes for reasons that have nothing to do with transient network problems, e.g. fragile tests (that should be made more robust) and bugs caused by race conditions (that should be fixed in Tails itself).

> Yes, that’s very true. Let’s not go do this.

OK, that’s one less option then. Makes things simpler.

>> I think it would be much more realistic to emulate the Tor network only, while still relying on real online services when we need them. At least it would be a pretty good first step.

> That’s an excellent idea (assuming Shadow supports its exit nodes communicating with the real network and not just Shadow’s fake one)!

:)

> That would of course only take care of transient errors due to Tor, but both instability and exit node blocking. E.g. OFTC blocking would be solved but crappy OpenPGP key servers would still be problematic.

True.

> Unfortunately, after looking into how to setup Shadow, I noticed a few more drawbacks:

Wow :/

I think there’s another Tor simulator that I hear about more often these days. Might it be called chutney, or similar?

> So, to summarize, a possible plan could be:

I like it.

#9 Updated by anonym 2015-05-29 11:01:46

intrigeri wrote:
> >> > Tor bootstrap failures: in my experience, restarting Tor is the only thing that
> >> > works here. Users cannot do that directly, but unplugging and then plugging the
> >> > network does it.
> >>
> >> I’ve seen actual users disconnect/reconnect so it’s not too far from reality.
> >> Not sure if we document that somewhere — if not, perhaps we should.
>
> > Ok. I guess we could simply restart Tor and then not risk issues with the stalled NetworkManager hooks from the first connection that will not disappear just because we (dis|re)connect. I think the result is that they get run twice in a row (in order) which shouldn’t be a problem, since it apparently works for our users, but let’s not risk introducing more instability, IMHO.
>
> I’m tempted to say that it depends what we would recommend users to do in such situations:
>
> * If we want to recommend them to disconnect+reconnect, then we should make it work; at least in the past, when people were asking for a button to restart Tor, our answer has sometimes been “you can disconnect/reconnect, so wontfix”; not sure what’s the current state of the art on this topic. I’ve no idea how NM handles its hooks — would it really let the previous connection’s ones run while starting the same hooks again for the new connection?
> * If we want them to restart Tor, then why not, but then we need to clearly document that this is supported, as opposed to disconnect+reconnect, so they don’t expect the latter to work, if we think it may introduce instability. Also, if we have means to detect specific problems that can be fixed by restarting Tor, then IMO it should somehow be implemented in Tails, not as a test-suite -specific improvement.

Indeed, if there is a problem with reconnecting before the NM hooks finish (I’m not sure) it’s of course a real issue in Tails and we should reconsider our recommendations. I don’t think it will be easy to fix, except by migrating all that to systemd.

For the test suite my take is that, for the sake of robustness, I prefer to have a dedicated test for things that have the potential to be problematic like this, and use more safe methods in the general case. Compare to the issue we had with the “Tor is ready” notification. So in this instance I think we should just restart Tor, plus add a test for reconnecting the network before Tor has finished, and then wait for all hooks to finish, and make sure things are sane (w.r.t. time syncing, Vidalia starting, upgrade check etc).

> I think there’s another Tor simulator that I hear about more often these days. Might it be called chutney, or similar?

Yes, it’s chutney probably. However, it doesn’t look as polished as Shadow (the first line of the README isn’t very encouraging: “This is chutney. It doesn’t do much so far. It isn’t ready for prime-time.”). It is, however, dirt simple to setup, which is nice:

git clone https://git.torproject.org/chutney.git
cd chutney
./chutney configure networks/basic
./chutney start networks/basic


And that’s literally it for this simple setup (I guess we want a slightly bigger network, plus bridges, which seems doable even if we have to make some templates for the latter ourselves). :) I could use the two clients that networks/basic defines, and the traffic would exit from my computer as expected, which is what we want for the mid-term goal. I suspect chutney will be trivial to package for Debian, as it only depends on python (2.7+). Yay!

Barring any issues due to it’s supposed immaturity, it seems chutney would indeed work well for the mid-term goal. As for the long-term goal, chutney clearly isn’t designed to “simulate the Internet” like Shadow is. I don’t know how much that matters, though. I seems fairly easy to setup a virtual network where the services we need would run directly on the testing host, and the exits would reach them. However, I wonder how it’ll if that network is using the private IP space. Relays/exit will work fine once we set ExitPolicyRejectPrivate 0, so that’s fine, but Tails client will go ballistic if it resolves a domain to a private address, right (that’s a feature of Tor, IIRC)? And perhaps the differences in resolving works in SOCKS4 vs 5 will cause problems (just brainstorming). Perhaps we can pick some random non-private IP range and use it in the local network, and play with the testing host’s routing table, to work around this? Network namespaces can probably be useful.

It should be noted that I have no idea if Shadow actually solves this better given our requirements. Also, this presumed advantage of Shadow should be weighed against the advantages of chutney. Hm. Not having to patch the Tor client + super easy setup sounds really compelling, even if we have to do some custom tricks for the long-term “simulate the Internet” goal. IMHO, chutney actually looks like the better option.

#10 Updated by intrigeri 2015-05-29 11:13:04

> Indeed, if there is a problem with reconnecting before the NM hooks finish (I’m not sure) it’s of course a real issue in Tails and we should reconsider our recommendations. I don’t think it will be easy to fix, except by migrating all that to systemd.

Agreed.

> For the test suite my take is that, for the sake of robustness, I prefer to have a dedicated test for things that have the potential to be problematic like this, and use more safe methods in the general case.

This pragmatic approach makes a lot of sense.

(Not replying to the bits about chutney/shadow here, since IMO that should be moved to a dedicated ticket.)

#11 Updated by anonym 2015-05-29 15:23:36

I started thinking about how to split this, but don’t have time now I realize. Currently I’m thinking like this:

Short-term:

  • Improve test suite robustness vs transient network errors
    - Restart Tor if bootstrapping stalls for too long
    - Retry with new OpenPGP key server pool member when they misbehave
    - Retry with new Tor circuit when OFTC blocks Tor

Mid- and long-term:

  • Make the test suite more deterministic through network simulation
    - Investigate the chutney Tor network simulator
    - Investigate the Shadow network simulator

Please let me know if you feel something is missing or poorly worded.

#12 Updated by kytv 2015-06-02 12:34:51

anonym wrote:
> I started thinking about how to split this, but don’t have time now I realize. Currently I’m thinking like this:
>
> Short-term:
> * Improve test suite robustness vs transient network errors
> - Restart Tor if bootstrapping stalls for too long
> - Retry with new OpenPGP key server pool member when they misbehave
> - Retry with new Tor circuit when OFTC blocks Tor
>
> Mid- and long-term:
> * Make the test suite more deterministic through network simulation
> - Investigate the chutney Tor network simulator
> - Investigate the Shadow network simulator
>
> Please let me know if you feel something is missing or poorly worded.

I like it. ACK

#13 Updated by anonym 2015-06-02 14:08:24

  • related to Feature #9515: Improve test suite robustness vs transient network errors added

#14 Updated by anonym 2015-06-02 14:17:24

  • related to Feature #9519: Make the test suite more deterministic through network simulation added

#15 Updated by anonym 2015-06-02 14:23:29

  • related to Feature #9521: Use the chutney Tor network simulator in our test suite added

#16 Updated by anonym 2015-06-02 14:23:31

  • related to Feature #9520: Investigate the Shadow network simulator for use in our test suite added

#17 Updated by anonym 2015-06-02 14:36:28

  • related to Feature #9516: Restart Tor if bootstrapping stalls for too long added

#18 Updated by anonym 2015-06-02 14:36:32

  • related to Feature #9518: Retry with new OpenPGP key server pool member when they misbehave added

#19 Updated by anonym 2015-06-02 14:36:35

  • related to Feature #9517: Retry connecting to OFTC when it fails added

#20 Updated by anonym 2015-06-02 14:37:46

  • Status changed from Confirmed to Resolved
  • Assignee deleted (anonym)
  • Target version changed from Tails_1.5 to Tails_1.4.1
  • % Done changed from 0 to 100

I’ve opened tickets Feature #9515 through Feature #9521.

#21 Updated by intrigeri 2015-06-04 21:24:21

  • related to deleted (Feature #9515: Improve test suite robustness vs transient network errors)

#22 Updated by intrigeri 2015-06-04 21:25:29

  • related to deleted (Feature #9516: Restart Tor if bootstrapping stalls for too long)

#23 Updated by intrigeri 2015-06-04 21:25:39

  • related to deleted (Feature #9517: Retry connecting to OFTC when it fails)

#24 Updated by intrigeri 2015-06-04 21:25:55

  • related to deleted (Feature #9518: Retry with new OpenPGP key server pool member when they misbehave)

#25 Updated by intrigeri 2015-06-04 21:27:57

  • related to Feature #9515: Improve test suite robustness vs transient network errors added

#26 Updated by intrigeri 2015-06-04 21:30:00

  • related to Feature #9516: Restart Tor if bootstrapping stalls for too long added

#27 Updated by intrigeri 2015-06-04 21:30:43

  • related to Feature #9518: Retry with new OpenPGP key server pool member when they misbehave added