Bug #10494: Retry htpdate when it fails

#1 Updated by anonym 2015-11-06 10:20:37

related to #8668 added

#2 Updated by anonym 2015-11-06 10:20:41

related to deleted (~~#8668~~)

#3 Updated by anonym 2015-11-06 10:22:10

blocks #8668 added

#4 Updated by anonym 2015-11-06 10:22:24

blocks Bug #10495: The 'the time has synced' step is fragile added

#5 Updated by anonym 2015-11-06 10:23:51

related to ~~Feature #9516~~: Restart Tor if bootstrapping stalls for too long added

#6 Updated by anonym 2015-11-06 10:24:28

Assignee set to kytv
Target version set to Tails_1.8
Deliverable for set to 270

#7 Updated by intrigeri 2015-12-05 05:46:50

On feature/jessie it might be simple to do that with tweaks to the systemd unit file (config/chroot_local-includes/lib/systemd/system/htpdate.service). I think this should be investigated before producing more Wheezy-area code.

#8 Updated by intrigeri 2015-12-05 15:28:54

Target version changed from Tails_1.8 to Tails_2.0

Better postpone this and focus on your more urgent tasks :)

#9 Updated by intrigeri 2015-12-19 10:29:52

It would be fine to postpone this. In any case, please prioritize your SponsorsM4 stuff (Icedove) higher.

#10 Updated by intrigeri 2016-01-04 13:17:58

Target version changed from Tails_2.0 to Tails_2.2

I think it’s too late to change such risky bits of our code base now for 2.0 => postponing.

#11 Updated by intrigeri 2016-02-05 13:48:26

Target version changed from Tails_2.2 to Tails_2.4

intrigeri wrote:
> Better postpone this and focus on your more urgent tasks :)

I think this is still valid.

#12 Updated by bertagaz 2016-03-03 15:42:35

Assignee changed from kytv to bertagaz

#13 Updated by anonym 2016-03-03 16:02:40

Assignee changed from bertagaz to anonym

Hopefully Chutney (~~Feature #9521~~) will fix the tordate parts.

#14 Updated by anonym 2016-03-03 16:04:03

related to ~~Bug #10238~~: Investigate why Tor sometimes is slow at or even fails bootstrapping added

#15 Updated by anonym 2016-03-03 16:07:48

Assignee changed from anonym to bertagaz

Oops! Wrong ticket! :S

#16 Updated by anonym 2016-03-03 16:09:50

related to deleted (~~~~Bug #10238~~: Investigate why Tor sometimes is slow at or even fails bootstrapping~~)

#17 Updated by bertagaz 2016-06-07 12:31:36

Target version changed from Tails_2.4 to Tails_2.5

#18 Updated by bertagaz 2016-06-25 04:42:31

Status changed from Confirmed to In Progress

intrigeri wrote:
> On feature/jessie it might be simple to do that with tweaks to the systemd unit file (config/chroot_local-includes/lib/systemd/system/htpdate.service). I think this should be investigated before producing more Wheezy-area code.

I’ve seen this recently while running the ~~Feature #6309~~ test. Seems to be one of the most important transient network problem left.

I’ve tried to get how to implement a restart with the systemd unit file if the service doesn’t start correctly, but it seems we can’t use Restart=on-failure with service of Type=oneshot. So I’m unsure how to deal with that using systemd. Maybe another service that waits for some time to see if the success file appears, and restart htpdate if it doesn’t after a certain time? Any other idea?

#19 Updated by bertagaz 2016-06-27 05:47:03

bertagaz wrote:
> intrigeri wrote:
> > On feature/jessie it might be simple to do that with tweaks to the systemd unit file (config/chroot_local-includes/lib/systemd/system/htpdate.service). I think this should be investigated before producing more Wheezy-area code.
>
> I’ve seen this recently while running the ~~Feature #6309~~ test. Seems to be one of the most important transient network problem left.
>
> I’ve tried to get how to implement a restart with the systemd unit file if the service doesn’t start correctly, but it seems we can’t use Restart=on-failure with service of Type=oneshot. So I’m unsure how to deal with that using systemd. Maybe another service that waits for some time to see if the success file appears, and restart htpdate if it doesn’t after a certain time? Any other idea?

OTOH I wonder if using this systemd restarting workaround is not just a ugly hack around a problem that lays in the htpdate code itself.

I’ve made some manuel tests of htpdate in a Tails 2.4 without network.

Here’s how I saw it running:

Send a first request to one http server of each pool.
Wait two minutes.
Send a new request to one new http server of each pool.
Wait two minutes.
…

It does this loop 5 times (until allowed_per_pool_failure_ratio is reached), effectively running 10 minutes each time I ran it.

Now in the test suite, we wait five minutes for it to complete, so it means that effectively htpdate only has time to send 2 batches of requests to gather time, which is not much on a not so reliable network.

We could use the --connect-timeout option of CURL to make sure htpdate is sending requests a bit more often, and not wait two minutes for an answer of each requests.

If we consider that we follow the test suite limit and want htpdate to be running 5 minutes max (or do we want htpdate to run indefinitely until it succeeded?), with a connection timeout of 30 seconds, it means we’d have to make it retry new servers 10 times. This could be reached by raising allowed_per_pool_failure_ratio and/or by adding new servers to the pool.

I’m testing a branch locally with --connect-timeout 30 and allowed_per_pool_failure_ratio set to 0.5 just to see if it fails less often on my shitty network. Will push it if it seems conclusive so that it gets some testing in Jenkins too.

#20 Updated by bertagaz 2016-06-29 06:40:38

% Done changed from 0 to 20
Feature Branch set to bugfix/10494-retry-htpdate

bertagaz wrote:
>
> I’m testing a branch locally with --connect-timeout 30 and allowed_per_pool_failure_ratio set to 0.5 just to see if it fails less often on my shitty network. Will push it if it seems conclusive so that it gets some testing in Jenkins too.

Did not see htpdate errors at home anymore while running this branch, so pushing to see how it behaves in Jenkins, even if I couldn’t find an occurrence of this error there.

#21 Updated by bertagaz 2016-07-05 09:55:23

bertagaz wrote:
> bertagaz wrote:
> >
> > I’m testing a branch locally with --connect-timeout 30 and allowed_per_pool_failure_ratio set to 0.5 just to see if it fails less often on my shitty network. Will push it if it seems conclusive so that it gets some testing in Jenkins too.
>
> Did not see htpdate errors at home anymore while running this branch, so pushing to see how it behaves in Jenkins, even if I couldn’t find an occurrence of this error there.

10 runs in Jenkins without any time syncing failure. I’ve pushed a bugfix/10494-retry-htpdate-with-more-fragile-test branch which includes more online scenarios (~~Bug #10444~~, ~~Bug #10441~~ and ~~Bug #10496~~). The current branch does not have that much of them, let’s use bate on the bug.

Would be interested to know what others think about this new settings.

#22 Updated by bertagaz 2016-07-09 03:06:37

Assignee changed from bertagaz to intrigeri
QA Check set to Info Needed

bertagaz wrote:
> 10 runs in Jenkins without any time syncing failure. I’ve pushed a bugfix/10494-retry-htpdate-with-more-fragile-test branch which includes more online scenarios (~~Bug #10444~~, ~~Bug #10441~~ and ~~Bug #10496~~). The current branch does not have that much of them, let’s use bate on the bug.

And then it failed (here and twice here at least)… So the timeout trick does not seem to resolve the bug entirely, still the reasoning behind makes sense to me and at home I’ve seen this helping out when network was in bad shape. So the following is still true:

> Would be interested to know what others think about this new settings.

But in the end, it seems we’ll still have to fallback on the proposed implementation of this ticket: restart htpdate if it fails. Three options were possible to implement it:

in htpdate perl’s code
using systemd features
using a shell wrapper to the service

First one is out of bounds to me (but not others maybe?).
I’ve made some researches on the second, and it seems systemd service auto-restarting on failure is only possible for simple type of service. On this I’m unsure, and may need guidance.
The third option sounds like the easiest for me.

Assigning to intrigeri, as he has more insights about systemd, and may have inputs to help here. I’ll keep on digging this option for now.

#23 Updated by bertagaz 2016-07-11 05:22:08

Assignee changed from intrigeri to bertagaz
% Done changed from 20 to 30
QA Check deleted (~~Info Needed~~)

bertagaz wrote:
> I’ve made some researches on the second, and it seems systemd service auto-restarting on failure is only possible for simple type of service. On this I’m unsure, and may need guidance.
> The third option sounds like the easiest for me.
>
> Assigning to intrigeri, as he has more insights about systemd, and may have inputs to help here. I’ll keep on digging this option for now.

I did some more researches and try outs on the second option (systemd). Switching to a service of type simple does not really works to handle the restarts through systemd (Restart=on-failure service option). For it to work, it seems we’d have to remove the RemainAfterExit option of the htpdate service, but that breaks it. So I don’t think that’s the way to go.

So I’ve pushed a shell wrapper in the branch (and merged it in the bugfix/10494-retry-htpdate-with-more-fragile-test branch). I’ve also reverted the commit that changed allowed_per_pool_failure_ratio, as I don’t think that’s the source of the problem. I’ve kept the connect-timeout change though, I think it still makes sense.

One thing I didn’t get very well is why the CapabilityBoundingSet option of the service was preventing the wrapper to reading the Tor controller auth cookie. I had to remove this option from the service file so that the wrapper was able to send the NEWNYM signal. Maybe someone has a clue?

Works pretty well at home, let see how it works in Jenkins anyway.

#24 Updated by bertagaz 2016-07-11 09:40:09

Pushed a scenario testing the restarts of htpdate.

#25 Updated by bertagaz 2016-07-12 09:46:04

I’ve run 120 times this scenario:

Given I have started Tails from DVD without network and logged in
When the network is plugged
And Tor has built a circuit
Then the time has synced

With an ISO that has the htpdate restarting feature, but not the --connect-timeout option to CURL. It failed only twice. Not bad. I added a patch that count and reports the number of time htpdate is restarted, and seen it happen (up to 4 restarts) and save some runs.

I’m now running the same with an ISO that also has the --connect-timeout trick. Will report on completion.

#26 Updated by intrigeri 2016-07-13 14:00:04

blocked by deleted (~~#8668~~)

#27 Updated by intrigeri 2016-07-13 14:00:57

related to deleted (~~~~Feature #9516~~: Restart Tor if bootstrapping stalls for too long~~)

#28 Updated by intrigeri 2016-07-13 14:00:59

blocked by ~~Bug #10441~~: Synaptic test is fragile added

#29 Updated by intrigeri 2016-07-13 14:01:41

blocks deleted (~~~~Bug #10441~~: Synaptic test is fragile~~)

#30 Updated by bertagaz 2016-07-14 03:38:08

Assignee changed from bertagaz to anonym
% Done changed from 30 to 50
QA Check set to Ready for QA

I’ve fixed 3 of the urls in 2 pools that were buggy. I guess it’ll help.

Setting this ticket as RfQA, as its purpose is implemented already, and from my testing it works quite well and certainly ameliorates the situation.

bertagaz wrote:
>
> With an ISO that has the htpdate restarting feature, but not the --connect-timeout option to CURL. It failed only twice. Not bad. I added a patch that count and reports the number of time htpdate is restarted, and seen it happen (up to 4 restarts) and save some runs.
>
> I’m now running the same with an ISO that also has the --connect-timeout trick. Will report on completion.

I’ll report my researches regarding the test suite on Bug #10495, which is the canonical place for that.

#31 Updated by bertagaz 2016-07-14 03:42:58

bertagaz wrote:
> I’ve fixed 3 of the urls in 2 pools that were buggy. I guess it’ll help.

Not that I’ve created Bug #11562, as I believe we should prevent this from happening.

#32 Updated by bertagaz 2016-07-14 04:57:28

related to ~~Feature #9477~~: Write regression tests and tests for new features added

#33 Updated by bertagaz 2016-07-14 04:57:51

related to deleted (~~~~Feature #9477~~: Write regression tests and tests for new features~~)

#34 Updated by bertagaz 2016-07-14 04:58:11

related to ~~Feature #9477~~: Write regression tests and tests for new features added

#35 Updated by intrigeri 2016-07-15 09:24:18

Assignee changed from anonym to bertagaz
QA Check changed from Ready for QA to Info Needed

> I’ve fixed 3 of the urls in 2 pools that were buggy. I guess it’ll help.

Great!

> Setting this ticket as RfQA, as its purpose is implemented already, and from my testing it works quite well and certainly ameliorates the situation.

Now I’m slightly confused. It looks like the test we have in place has identified a bug in Tails (buggy htpdate pools configuration), that was causing some part of the htpdate failures we’ve seen. So I can’t help but ask:

is the proposed workaround (retrying htpdate when it fails) needed, given a correct htpdate pool config?
isn’t the proposed workaround likely to hide such configuration problems in the future? (I know that we have Bug #11562 now, but still.)

Also, sorry to raise these concerns this late, but I’m worried about the consequences on the security design of the whole thing. I don’t remember the details, but IIRC we had --allowed_per_pool_failure_ratio for a reason; what are the consequences of the proposed change, in terms of increased power for an attacker who blocks htpdate connections to selected servers in these pools, until we use the ones he controls? I didn’t do the maths, but at first glance it seems to me that we might be essentially killing the “if there are too many that fail for any given pool, e.g. because of failed certificate checking or being unreachable, the pool is considered to be potentially compromised” check. I’m not saying all this is bad, but at the very least I’d like to see the design doc updated accordingly, and some indication that the security impact of this change has been thought through.

In passing: MAX_RESTARTS seems to actually specify the number of attempts, so its name is misleading.

About commit:d4bb25b: I’d be sad to drop the protection entirely, after putting some effort myself into setting it up. I’d rather see the list of needed (=> allowed) capabilities updated. Why not just add the other capabilities you need?

retry_htpdate should not be in the $PATH, as it’s clearly not meant to be used by hand.

With all that in mind, how about we take a shortcut and cherry-pick the htpdate configuration bugfix into a dedicated branch based on stable, and merge it ASAP (I can review+merge)? This should 1. improve UX in the real world as soon as 2.5 is out; 2. help answer my question above, about whether we still need a workaround after we’ve fixed the config, thanks to more Jenkins data; 3. give us time to think about the security consequences, without being in a hurry. What do you think?

#36 Updated by intrigeri 2016-07-15 09:31:15

intrigeri wrote:
> With all that in mind, how about we take a shortcut and cherry-pick the htpdate configuration bugfix into a dedicated branch based on stable, and merge it ASAP (I can review+merge)?

IMO that bugfix branch should also have the lowered --connect-timeout, because it seems to make a lot of sense, regardless of restarting htpdate N times or not (in particular given the 5 vs. 10 minutes thing you discovered). Keep in mind that tails.git is not the canonical location for the htpdate script, though: it lives in our htp.git repo, and from there we copy new versions to tails.git when needed.

#37 Updated by bertagaz 2016-07-16 03:10:47

intrigeri wrote:
> Now I’m slightly confused. It looks like the test we have in place has identified a bug in Tails (buggy htpdate pools configuration), that was causing some part of the htpdate failures we’ve seen. So I can’t help but ask:
>
> * is the proposed workaround (retrying htpdate when it fails) needed, given a correct htpdate pool config?

Yes, I think it still makes sense. I don’t think the failures are only due to the 3 faulty HTTP servers in the pool (even if it’s certainly the reason of some of them), but also to Tor bootstrapping problems, or crappy network. This htpdate requests happen very early in Tails boot process, at a moment where it’s possible that Tor isn’t really ready yet.

> * isn’t the proposed workaround likely to hide such configuration problems in the future? (I know that we have Bug #11562 now, but still.)

Yes, but we can’t say either that without it, we’ll notice the problem. In fact we already didn’t. That’s why I created Bug #11562.

> Also, sorry to raise these concerns this late, but I’m worried about the consequences on the security design of the whole thing. I don’t remember the details, but IIRC we had --allowed_per_pool_failure_ratio for a reason; what are the consequences of the proposed change, in terms of increased power for an attacker who blocks htpdate connections to selected servers in these pools, until we use the ones he controls? I didn’t do the maths, but at first glance it seems to me that we might be essentially killing the “if there are too many that fail for any given pool, e.g. because of failed certificate checking or being unreachable, the pool is considered to be potentially compromised” check. I’m not saying all this is bad, but at the very least I’d like to see the design doc updated accordingly, and some indication that the security impact of this change has been thought through.

Yes, I had this question in mind, that’s why I asked for inputs about that (at the worth moment sadly). On this point I admit I’m not sure. Maybe the pool length are a bit short (5 tries per pool before dying is not so much) and we should add more urls in each. But then, we introduce more maintenance troubles. I’ll give more thought on that, but my sole brain might be not enough to get a real answer. ;)

> In passing: MAX_RESTARTS seems to actually specify the number of attempts, so its name is misleading.

Right, I noticed that but didn’t correct the variable name.

> About commit:d4bb25b: I’d be sad to drop the protection entirely, after putting some effort myself into setting it up. I’d rather see the list of needed (=> allowed) capabilities updated. Why not just add the other capabilities you need?

Well, I’ve tried to add some to the setting, but it didn’t change anything. I think I even tried to add CAP_SYS_ADMIN IIRC. I didn’t find which capability was needed here. Help on this would be welcome.

> retry_htpdate should not be in the $PATH, as it’s clearly not meant to be used by hand.

Good point!

> With all that in mind, how about we take a shortcut and cherry-pick the htpdate configuration bugfix into a dedicated branch based on stable, and merge it ASAP (I can review+merge)? This should 1. improve UX in the real world as soon as 2.5 is out; 2. help answer my question above, about whether we still need a workaround after we’ve fixed the config, thanks to more Jenkins data; 3. give us time to think about the security consequences, without being in a hurry. What do you think?

I can certainly create another branch for that (though I don’t know how to cherry-pick this commits without changing their commit ID, any idea?). But as you’ve seen in my last note on Bug #10495, I’ve already seen failures despite the pool fix.

#38 Updated by intrigeri 2016-07-16 04:26:56

>> * is the proposed workaround (retrying htpdate when it fails) needed, given a correct htpdate pool config?

> Yes, I think it still makes sense. I don’t think the failures are only due to the 3 faulty HTTP servers in the pool (even if it’s certainly the reason of some of them), but also to Tor bootstrapping problems, or crappy network.

I have the same guts feelings (except the “Tor bootstrapping problems” which I think should not be an issue — see bellow — and even more so on Jenkins with chutney), but now I’d like to wait a bit and see more data about it. We’ll have it once we have merged the subset of this branch that we’re discussing below :)

> This htpdate requests happen very early in Tails boot process, at a moment where it’s possible that Tor isn’t really ready yet.

I’m surprised, as I’m pretty sure we’ve set it up so that htpdate is started after wait_for_working_tor. If you really see that happen, then that’s a serious bug. Can you please clarify?

>> […] at the very least I’d like to see the design doc updated accordingly, and some indication that the security impact of this change has been thought through.

> Yes, I had this question in mind, that’s why I asked for inputs about that (at the worth moment sadly). On this point I admit I’m not sure. Maybe the pool length are a bit short (5 tries per pool before dying is not so much) and we should add more urls in each. But then, we introduce more maintenance troubles. I’ll give more thought on that, but my sole brain might be not enough to get a real answer. ;)

OK, so for now we’re blocking on the security aspect. I would advise creating a dedicated Research subtask, and postponing the whole thing (except bits discussed below) to 2.6, so that it’s clear what’s the big blocker.

>> In passing: MAX_RESTARTS seems to actually specify the number of attempts, so its name is misleading.

> Right, I noticed that but didn’t correct the variable name.

I suspect that fixing it now would eat less time than the book-keeping needed to remember that we’ll have to fix it later if we take this branch eventually. YMMV :)

>> About commit:d4bb25b: I’d be sad to drop the protection entirely, after putting some effort myself into setting it up. I’d rather see the list of needed (=> allowed) capabilities updated. Why not just add the other capabilities you need?

> Well, I’ve tried to add some to the setting, but it didn’t change anything. I think I even tried to add CAP_SYS_ADMIN IIRC. I didn’t find which capability was needed here. Help on this would be welcome.

I don’t remember how I came up with the initial list exactly, but I certainly did not guess, so there’s surely a way to find out what exact operation is blocked, and from this to infer what capability is needed to unblock it. I’m sorry I don’t remember how :/ … but that’s the way I would do it. I’d say subtask, and don’t bother until we know we really want to do that restart dance (i.e. blocked by the security aspects).

> I can certainly create another branch for that

Yes, please. I’d like to see the HTP pool fixed in 2.5, and it’s quite clear to me that this htpdate restart work won’t make it into 2.5: we need time for the security discussion, and it’s disputable if the change qualifies for a point-release.

> (though I don’t know how to cherry-pick this commits without changing their commit ID, any idea?).

You can’t, given how these IDs are computed (from data that includes the parent commit ID).

> But as you’ve seen in my last note on Bug #10495, I’ve already seen failures despite the pool fix.

We don’t have to fix all problems in July :)

#39 Updated by bertagaz 2016-07-17 04:17:17

blocked by ~~Bug #11574~~: Investigate security consequences of restarting htpdate until it succeeds added

#40 Updated by bertagaz 2016-07-17 04:44:36

Assignee changed from bertagaz to intrigeri
QA Check changed from Info Needed to Ready for QA

intrigeri wrote:
> > This htpdate requests happen very early in Tails boot process, at a moment where it’s possible that Tor isn’t really ready yet.
>
> I’m surprised, as I’m pretty sure we’ve set it up so that htpdate is started after wait_for_working_tor. If you really see that happen, then that’s a serious bug. Can you please clarify?

Well, it seems to me that even if Tor reports having 100% bootstrapped, it still have difficulties to provide a functional circuit right after that. But I don’t have data or proof, it’s more of a feeling and would need more debugging to see if there’s really a problem here.

> OK, so for now we’re blocking on the security aspect. I would advise creating a dedicated Research subtask, and postponing the whole thing (except bits discussed below) to 2.6, so that it’s clear what’s the big blocker.

So that’s ~~Bug #11574~~.

> >> In passing: MAX_RESTARTS seems to actually specify the number of attempts, so its name is misleading.
>
> > Right, I noticed that but didn’t correct the variable name.
>
> I suspect that fixing it now would eat less time than the book-keeping needed to remember that we’ll have to fix it later if we take this branch eventually. YMMV :)

commit:2e10750

> >> About commit:d4bb25b: I’d be sad to drop the protection entirely, after putting some effort myself into setting it up. I’d rather see the list of needed (=> allowed) capabilities updated. Why not just add the other capabilities you need?
>
> > Well, I’ve tried to add some to the setting, but it didn’t change anything. I think I even tried to add CAP_SYS_ADMIN IIRC. I didn’t find which capability was needed here. Help on this would be welcome.
>
> I don’t remember how I came up with the initial list exactly, but I certainly did not guess, so there’s surely a way to find out what exact operation is blocked, and from this to infer what capability is needed to unblock it. I’m sorry I don’t remember how :/ … but that’s the way I would do it. I’d say subtask, and don’t bother until we know we really want to do that restart dance (i.e. blocked by the security aspects).

Created ~~Bug #11575~~ then.

> > I can certainly create another branch for that
>
> Yes, please. I’d like to see the HTP pool fixed in 2.5, and it’s quite clear to me that this htpdate restart work won’t make it into 2.5: we need time for the security discussion, and it’s disputable if the change qualifies for a point-release.
>
> > (though I don’t know how to cherry-pick this commits without changing their commit ID, any idea?).
>
> You can’t, given how these IDs are computed (from data that includes the parent commit ID).

Right. Pushed bugfix/10494-fix-htpdate-pools

So setting this ticket RfQA, but that’s only for this branch. Once merged, this ticket shouldn’t be closed. Assigning to our 2.5 RM.

> > But as you’ve seen in my last note on Bug #10495, I’ve already seen failures despite the pool fix.
>
> We don’t have to fix all problems in July :)

Seen one of this failure in a new batch of 100 runs btw.

> retry_htpdate should not be in the $PATH, as it’s clearly not meant to be used by hand.

commit:c15b2c5

> IMO that bugfix branch should also have the lowered —connect-timeout, because it seems to make a lot of sense, regardless of restarting htpdate N times or not (in particular given the 5 vs. 10 minutes thing you discovered).

Agreed, imported in the previously mentioned new branch.

> Keep in mind that tails.git is not the canonical location for the htpdate script, though: it lives in our htp.git repo, and from there we copy new versions to tails.git when needed.

Didn’t remember that. Pushed the timeout change (and fixed an old forgotten one) in this repo.

#41 Updated by bertagaz 2016-07-17 04:44:59

Target version changed from Tails_2.5 to Tails_2.6

#42 Updated by bertagaz 2016-07-17 05:13:36

intrigeri wrote:
> intrigeri wrote:
> > With all that in mind, how about we take a shortcut and cherry-pick the htpdate configuration bugfix into a dedicated branch based on stable, and merge it ASAP (I can review+merge)?
>
> IMO that bugfix branch should also have the lowered --connect-timeout, because it seems to make a lot of sense, regardless of restarting htpdate N times or not (in particular given the 5 vs. 10 minutes thing you discovered).

Note that with this timeout patch, a run of htpdate now takes only 2mn30 sec (30 sec * 5 tries per pool), so we’re lower than the 5 minutes delay of this try_for.

#43 Updated by intrigeri 2016-07-18 07:23:47

related to ~~Bug #11577~~: Fix htpdate pool and timeout added

#44 Updated by intrigeri 2016-07-18 07:26:19

> Right. Pushed bugfix/10494-fix-htpdate-pools So setting this ticket RfQA, but that’s only for this branch. Once merged, this ticket shouldn’t be closed. Assigning to our 2.5 RM.

Thanks! I’ve created ~~Bug #11577~~ to track that other branch, otherwise I’ll become insane trying to follow the Redmine semantics at play.

#45 Updated by intrigeri 2016-07-18 07:26:32

Deliverable for changed from 270 to SponsorS_Internal

#46 Updated by intrigeri 2016-07-18 07:26:59

Assignee changed from intrigeri to bertagaz
QA Check changed from Ready for QA to Dev Needed

#47 Updated by bertagaz 2016-07-19 05:35:18

Now I’m starting to think that rather than using a shell wrapper we should implement the retrying at the CURL level. That would just cut the ‘killing the $allowed_per_pool_failure_ratio feature’ debate. If we retry several times each pool HTTP servers, then we wouldn’t pick new ones. I don’t know why I didn’t think about that before, maybe a melt of “oh-their-god-it’s-perl” feeling with the lack of consideration of the security implications of a shell wrapper. But I’m not sure that it would be the most robust solution.

I’ll push a new branch implementing the retry this way, so that we’ll start to have some Jenkins stats about such change to decide after 2.5 is out.

#48 Updated by bertagaz 2016-07-23 10:25:49

related to deleted (~~~~Feature #9477~~: Write regression tests and tests for new features~~)

#49 Updated by anonym 2016-09-20 16:53:58

Target version changed from Tails_2.6 to Tails_2.7

#50 Updated by bertagaz 2016-10-03 09:09:29

bertagaz wrote:
> I’ll push a new branch implementing the retry this way, so that we’ll start to have some Jenkins stats about such change to decide after 2.5 is out.

In the end I didn’t do it: after some live discussions, the best way we found to implement this was to do it inside htpdate itself. It should call the curl command several times with the same webserver URIs, and request NEWNYMs in between.

This way it won’t try more webservers than what it does now, only try the same list multiple times, hence avoiding to raise the question about ~~Bug #11574~~.

This also means the systemd unit changes in htpdate.service that were to be reverted by ~~Bug #11575~~ are unnecessary, as they won’t be merged anyway.

Next is to implement a loop system in /usr/local/sbin/htpdate around curl invocation.

#51 Updated by intrigeri 2016-10-03 09:31:28

> This way it won’t try more webservers than what it does now, only try the same list multiple times, hence avoiding to raise the question about ~~Bug #11574~~.

Sounds better indeed!

#52 Updated by intrigeri 2016-10-03 09:33:15

> Next is to implement a loop system in /usr/local/sbin/htpdate around curl invocation.

These changes might be too invasive to go into a minor release. I suggest you avoid this question entirely by postponing to 2.8 and focusing on non-internal SponsorS tasks during the 2.7 cycle.

#53 Updated by bertagaz 2016-10-03 11:22:13

Target version changed from Tails_2.7 to Tails_2.9.1

intrigeri wrote:
> These changes might be too invasive to go into a minor release. I suggest you avoid this question entirely by postponing to 2.8 and focusing on non-internal SponsorS tasks during the 2.7 cycle.

Agree!

#54 Updated by intrigeri 2016-12-06 14:05:27

Target version changed from Tails_2.9.1 to Tails 2.10

As said above: “These changes might be too invasive to go into a minor release”.

#55 Updated by anonym 2017-01-24 20:48:49

Target version changed from Tails 2.10 to Tails_2.11

#56 Updated by bertagaz 2017-02-27 13:15:54

Target version changed from Tails_2.11 to Tails_2.12

As said above: “These changes might be too invasive to go into a minor release”.

#57 Updated by intrigeri 2017-03-03 13:37:32

Deliverable for deleted (~~SponsorS_Internal~~)

#58 Updated by bertagaz 2017-03-08 11:09:20

Target version changed from Tails_2.12 to Tails_3.0

#59 Updated by bertagaz 2017-05-15 13:33:18

Target version changed from Tails_3.0 to Tails_3.1

#60 Updated by bertagaz 2017-05-21 16:37:36

Target version changed from Tails_3.1 to Tails_3.2

#61 Updated by bertagaz 2017-08-14 17:10:57

Feature Branch changed from bugfix/10494-retry-htpdate to bugfix/10494-htpdate-fix-date-header-regexp

bertagaz wrote:
> Next is to implement a loop system in /usr/local/sbin/htpdate around curl invocation.

I’ve been spending some hours implementing that (given the failure ratio in the test suite), and have something working close to be pushed.

But doing so, I discovered another problem seeing how much time we get the ‘Could not find any date header’ errors in the htpdate log.

Looking at the headers the web servers send, I noticed that some of us do not use the first-letter-uppercased convention for HTTP headers. So the regexp in newestDateHeader fails eventually. I’ve pushed a tiny commit:aeb903a6fe8422c3beb677110c56f04b28c6108b that just fix that (hoping it does not hurt regexp laws). Here htpdate don’t fail anymore since then, so it should greatly help. But let see how it behaves in Jenkins first.

I’ll push the retry implementation in an another branch later, but it should get much less important to do with the above change hopefully.

#62 Updated by bertagaz 2017-08-15 10:33:53

bertagaz wrote:
> I’ll push the retry implementation in an another branch later, but it should get much less important to do with the above change hopefully.

Pushed in the bugfix/10494-retry-curl-in-htpdate branch. The way to request a new Tor circuit is very Tails specific though, as it’s using the Tails shell library to send the NEWNYM command. But I don’t know if other projects are using this software, and we seem to be the upstream of the Perl version. This functionnality is deactivated by default.

#63 Updated by bertagaz 2017-08-15 13:16:01

Assignee changed from bertagaz to anonym
QA Check changed from Dev Needed to Ready for QA

bertagaz wrote:
> Looking at the headers the web servers send, I noticed that some of us do not use the first-letter-uppercased convention for HTTP headers. So the regexp in newestDateHeader fails eventually. I’ve pushed a tiny commit:aeb903a6fe8422c3beb677110c56f04b28c6108b that just fix that (hoping it does not hurt regexp laws). Here htpdate don’t fail anymore since then, so it should greatly help. But let see how it behaves in Jenkins first.

Ok, so Jenkins ran that 15 times (https://jenkins.tails.boum.org/job/test_Tails_ISO_bugfix-10494-htpdate-fix-date-header-regexp/ and https://jenkins.tails.boum.org/job/test_Tails_ISO_bugfix-10494-13541-htpdate-fix-date-header-regexp/) without any htpdate error, which should statistically have happened if you look how much time it happens on other branches. So it confirms my finding.

So I’m putting the bugfix/10494-htpdate-fix-date-header-regexp branch RfQA to the 3.2 RM. Also note that if you merge it, you have to merge the same branch in the htp.git repo we have on immerda.

Once merged, don’t close this ticket yet, but reassign it to me so that I’l keep tracking this problem and see if we need the other retry branch. Hope so, spend some time hacking perl… :)

#64 Updated by bertagaz 2017-08-16 11:02:00

Assignee changed from anonym to bertagaz
QA Check changed from Ready for QA to Dev Needed
Feature Branch changed from bugfix/10494-htpdate-fix-date-header-regexp to bugfix/10494-retry-curl-in-htpdate

bertagaz wrote:
> Once merged, don’t close this ticket yet, but reassign it to me so that I’l keep tracking this problem and see if we need the other retry branch. Hope so, spend some time hacking perl… :)

So it has been (fixed) and merged. Now let see if we ever see time syncing error again and need to retyr curl or not.

#65 Updated by intrigeri 2017-09-02 15:21:07

bertagaz wrote:
> bertagaz wrote:
> > Once merged, don’t close this ticket yet, but reassign it to me so that I’l keep tracking this problem and see if we need the other retry branch. Hope so, spend some time hacking perl… :)
>
> So it has been (fixed) and merged. Now let see if we ever see time syncing error again and need to retyr curl or not.

Please ensure the corresponding change is merged in htp.git too.

#66 Updated by bertagaz 2017-09-07 13:02:35

Target version changed from Tails_3.2 to Tails_3.3

#67 Updated by bertagaz 2017-10-03 10:53:06

Target version changed from Tails_3.3 to Tails_3.5

#68 Updated by bertagaz 2017-10-30 12:54:43

QA Check changed from Dev Needed to Ready for QA

Flagging this ticket RfQA to test ~~Feature #12633~~.

#69 Updated by bertagaz 2017-10-30 16:08:05

QA Check changed from Ready for QA to Dev Needed

Unflagging.

#70 Updated by bertagaz 2017-11-15 14:39:50

Target version changed from Tails_3.5 to Tails_3.6

#71 Updated by Anonymous 2018-01-19 16:27:49

related to Bug #11562: Monitor servers from the htpdate pools added

#72 Updated by bertagaz 2018-03-14 11:32:09

Target version changed from Tails_3.6 to Tails_3.7

#73 Updated by bertagaz 2018-05-10 11:09:15

Target version changed from Tails_3.7 to Tails_3.8

#74 Updated by intrigeri 2018-06-26 16:27:52

Target version changed from Tails_3.8 to Tails_3.9

#75 Updated by Anonymous 2018-08-18 10:34:53

@bertagaz:

“Now let see if we ever see time syncing error again and need to retyr curl or not.” → did this happen?

“Please ensure the corresponding change is merged in htp.git too.” → please make sure it was.

#76 Updated by intrigeri 2018-09-05 16:26:52

Target version changed from Tails_3.9 to Tails_3.10.1

#77 Updated by intrigeri 2018-10-24 17:03:37

Target version changed from Tails_3.10.1 to Tails_3.11

#78 Updated by CyrilBrulebois 2018-12-16 13:53:33

Target version changed from Tails_3.11 to Tails_3.12

#79 Updated by anonym 2019-01-30 11:59:14

Target version changed from Tails_3.12 to Tails_3.13

#80 Updated by CyrilBrulebois 2019-03-20 14:35:09

Target version changed from Tails_3.13 to Tails_3.14

#81 Updated by intrigeri 2019-04-13 07:14:07

Status changed from In Progress to Rejected
Assignee deleted (~~bertagaz~~)
Target version deleted (~~Tails_3.14~~)

u wrote:
> @bertagaz:
> “Now let see if we ever see time syncing error again and need to retyr curl or not.” → did this happen?

I see little value in keeping this ticket open for more than 1.5 years, merely in the hope we’ll track whether we still have a problem or not, especially given we have Bug #10495 that’s about the problem, while this very ticket is about one possible solution ⇒ rejecting (this ticket was used to fix a related problem, which is awesome; but the solution this ticket is about was not merged and it’s unclear whether we’ll ever need it, so closing as resolved would be misleading).

Let’s reopen once Bug #11562 is done, if we notice that 1. we still have problems in this area, that are not caused by an obsolete HTP server pool; and 2. the solution proposed here would address such problems.

> “Please ensure the corresponding change is merged in htp.git too.” → please make sure it was.

It was!

Related to Tails - ~~Bug #11577~~: Fix htpdate pool and timeout	Resolved	2016-07-18
Related to Tails - Bug #11562: Monitor servers from the htpdate pools	Confirmed	2016-07-14
Blocks Tails - Bug #10495: The 'the time has synced' step is fragile	In Progress	2015-11-06
Blocked by Tails - ~~Bug #11574~~: Investigate security consequences of restarting htpdate until it succeeds	Rejected	2016-07-17

Tails

Bug #10494

Retry htpdate when it fails

History