Bug #10501
Step 'the "10CC5BC7" key is in the live user's public keyring' is fragile
100%
Description
We’ve seen it failing a bit in Jenkins (2 times in September, 3 times in October).
We’re not sure of why it fails, so bertagaz will have to dig in the history but it may have been erased with the 1.7 release.
Meanwhile there’s been a bump in the retry_tor number that may have helped to workaround it.
One question raised that may help is that currently we fetch the key and check if it’s actually been fetched within 2 minutes.
Should we perhaps enforce a limit in the fetch step and cancel the fetch if it’s taking too long, then retry?
IIRC that was suggested for the Git tests.
Subtasks
History
#1 Updated by bertagaz 2015-11-06 10:58:39
- Deliverable for changed from 267 to 270
#2 Updated by bertagaz 2015-11-12 05:18:43
- Assignee changed from bertagaz to anonym
- Type of work changed from Research to Discuss
Here are infos about why it fails:
- in job
test_Tails_ISO_testing #35
, fails because the keyserver replied with “Bad gateway” error code:
https://jenkins.tails.boum.org/view/Tails_ISO/job/test_Tails_ISO_testing/35/artifact/build-artifacts/03%3A16%3A02_Fetching_OpenPGP_keys_using_Seahorse_via_the_Tails_OpenPGP_Applet_should_work_and_be_done_over_Tor..png
- in job
test_Tails_ISO_testing #30
, fails because the keyserver replied with “Gateway time-out” error code:
https://jenkins.tails.boum.org/view/Tails_ISO/job/test_Tails_ISO_testing/30/artifact/build-artifacts/03%3A22%3A58_Fetching_OpenPGP_keys_using_Seahorse_should_work_and_be_done_over_Tor..png
- in several jobs, it fails because of hostname resolution error for pool.sks-keyservers.net:
https://jenkins.tails.boum.org/view/Tails_ISO/job/test_Tails_ISO_stable/21/artifact/build-artifacts/03%3A14%3A36_Fetching_OpenPGP_keys_using_Seahorse_via_the_Tails_OpenPGP_Applet_should_work_and_be_done_over_Tor..png
https://jenkins.tails.boum.org/view/Tails_ISO/job/test_Tails_ISO_isotester1_metrics/5/artifact/build-artifacts/torified_gnupg-2015-09-28T18%3A37%3A25-07%3A00.png
https://jenkins.tails.boum.org/view/Tails_ISO/job/test_Tails_ISO_isotester1_metrics/6/artifact/build-artifacts/torified_gnupg-2015-09-29T02%3A00%3A11-07%3A00.png
- in job
test_Tails_ISO_isotester1_metrics #42
, fails with error “Cannot connect to destination”:
https://jenkins.tails.boum.org/view/Tails_ISO/job/test_Tails_ISO_isotester1_metrics/42/artifact/build-artifacts/torified_gnupg-2015-10-10T12%3A36%3A19-07%3A00.png
- in job
test_Tails_ISO_feature-5926-freezable-apt-repository #7
, took more than the 2 minutes timeout to fetch the key:
https://jenkins.tails.boum.org/view/Tails_ISO/job/test_Tails_ISO_feature-5926-freezable-apt-repository/7/artifact/build-artifacts/03%3A06%3A17_Fetching_OpenPGP_keys_using_Seahorse_via_the_Tails_OpenPGP_Applet_should_work_and_be_done_over_Tor..png
So it seems some retry magics could help, given it seems these are mostly network errors, either from our side but also on the sks keyserver one.
Note that there is always this stop icon in the error window of seahorse, could maybe be helpful to know when to retry?
Assigning to anonym (and adding kytv as watcher), so that you can organize on the next steps (define fix, and update ticket metadatas).
#3 Updated by kytv 2015-11-12 10:25:54
I’m confident that my work on Bug #9095 (& incidentally Bug #9791) will increase general robustness of the entire features/torified_gnupg.feature
feature. :)
#4 Updated by kytv 2015-11-12 10:31:49
One of the problems that I discovered is that Seahorse WILL always segfault if there’s a network error. It may not segfault until after the close button is clicked, but it will segfault. The way I decided to handle it is to always kill seahorse
during the keysyncing step and restart it since we know that it will segfault 100% of the time.
For seahorse, I’m going to assume that if there’s a close button on the screen, that’s an error. Much of the code I added earlier has been refactored and made less convoluted, partly because I’m using what I learned about ruby since I wrote it, partly because I understand seahorse’s failure modes better. :)
I’m also going to kill the gpg binary if key fetching isn’t successful within 60 seconds, force a new Tor circuit, then retry.
#5 Updated by kytv 2015-11-12 10:33:25
- Status changed from Confirmed to In Progress
- Assignee changed from anonym to kytv
- Type of work changed from Discuss to Code
I’ll take this one since I’m pretty much doing it already in Bug #9095 & Bug #9791.
#6 Updated by kytv 2015-11-13 13:36:58
- Assignee changed from kytv to anonym
- % Done changed from 0 to 50
- QA Check set to Ready for QA
- Feature Branch set to kytv/test/9095-robust-seahorse
I think this was solved over in Bug #9095.
#7 Updated by bertagaz 2015-12-05 13:27:35
- Assignee changed from anonym to bertagaz
Want to try this!
#8 Updated by intrigeri 2015-12-05 13:44:01
- Assignee changed from bertagaz to intrigeri
#9 Updated by intrigeri 2015-12-07 12:42:24
- Status changed from In Progress to Fix committed
- % Done changed from 50 to 100
Applied in changeset commit:9f0692b76f00c6a46b110b8aa6400be02e283067.
#10 Updated by intrigeri 2015-12-07 12:43:41
- Assignee deleted (
intrigeri) - QA Check changed from Ready for QA to Pass
#11 Updated by anonym 2015-12-16 11:33:51
- Status changed from Fix committed to Resolved