Bug #14948

zombie processes when running the test suite

Added by groente 2017-11-10 20:27:16 . Updated 2019-12-28 08:58:03 .

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Test suite
Target version:
Start date:
2017-11-10
Due date:
% Done:

100%

Feature Branch:
bugfix/14948-avoid-zombies-in-testsuite-runs+force-all-tests
Type of work:
Code
Blueprint:

Starter:
Affected tool:
Deliverable for:

Description

the build process leaves quite a few zombie processes (mostly avconv and grep) on the isotesters.


Subtasks


Related issues

Related to Tails - Bug #17216: Make the test suite clean up after itself even in most tricky failure modes Confirmed
Related to Tails - Bug #17551: Test suite has defunct processes after sikuli removal Resolved
Blocks Tails - Feature #16209: Core work: Foundations Team Confirmed

History

#1 Updated by intrigeri 2017-11-10 22:39:58

  • Subject changed from zombie processes on isotesters to zombie processes when running the test suite
  • Category changed from Continuous Integration to Test suite
  • Target version deleted (Tails_3.5)
  • Type of work changed from Sysadmin to Code

(I think we have plenty of other stuff that deserves being worked on in the next 2 months more than this one => dropping the milestone. And our CI is not affected to this bug in practice, it can only cause problems for developers running the test suite manually => setting metadata accordingly.)

#2 Updated by CyrilBrulebois 2019-04-02 09:55:47

  • Status changed from Confirmed to In Progress

Applied in changeset commit:tails|b1d5e33f2410c33dbb4a44e9325303d8832b8280.

#3 Updated by CyrilBrulebois 2019-04-02 09:56:58

  • Status changed from In Progress to Confirmed
  • QA Check set to Ready for QA
  • Feature Branch set to bugfix/14948-avoid-zombies-in-testsuite-runs

I’ve worked on that and here’s a patch that helped back in January. Should probably be considered for other branches?

#4 Updated by CyrilBrulebois 2019-04-02 11:01:00

  • Status changed from Confirmed to In Progress

Applied in changeset commit:tails|f7976eb09103eace09e42bc05875cde0daf23416.

#5 Updated by CyrilBrulebois 2019-04-02 11:03:37

Oops, initially posted a branch with WIP commits. Force-pushed seconds ago. And in any case: only the last commit is what matters…

#6 Updated by intrigeri 2019-04-10 07:33:27

#7 Updated by intrigeri 2019-04-10 07:37:37

  • Feature Branch changed from bugfix/14948-avoid-zombies-in-testsuite-runs to bugfix/14948-avoid-zombies-in-testsuite-runs+force-all-tests

I’ve merged feature/buster into the branch so it hopefully builds again on Jenkins; and I’ve added the +force-all-tests suffix so we get more data from Jenkins about it, which should inform the “shall we merge this?” decision :)

#8 Updated by intrigeri 2019-04-14 07:20:48

Our test suite is not good enough on Buster yet for Jenkins to give us sufficient insight & confidence wrt. whether this branch breaks anything. So I’ve transplanted the topic branch onto stable and forced pushed.

#9 Updated by intrigeri 2019-04-29 09:04:23

  • Assignee changed from anonym to intrigeri

(As agreed elsewhere.)

#10 Updated by intrigeri 2019-04-29 11:08:15

The Process.wait() bits LGTM and should indeed fix part of the problem this ticket is about. Jenkins is happy about this branch so I’ll cherry-pick them right away.

Wrt. the .close bits, I have a doubt. I think they do no harm so I’d be fine with merging them, but it seems there’s a misunderstanding, quite possibly on my side, so first I’d like to ensure I understand the intent better. What https://ruby-doc.org/core-2.2.0/IO.html#method-i-close does is “Closes ios and flushes any pending writes to the operating system”. In itself, I understand it does not reap the child process started with IO.popen. But the commit message suggests that we run .close in order to reap child processes, so I’m confused. I guess that the idea in this branch is to increase the chances that the child process we kill manages to exit (instead of being left as a zombie), by closing any communication channels we may have left open with it — right? But then, shouldn’t we run .close before we kill that child process, to increase the success chances of whatever signal handler that process has setup?

Finally, when we kill Chutney, shouldn’t we run Process.wait, to ensure the Chutney processes have actually exited before we move on? I see you’re adding $chutney_onionservice_job.close but as said above, I don’t think that’s sufficient to avoid leaving zombie processes around.

#11 Updated by intrigeri 2019-04-29 11:14:26

  • Assignee changed from intrigeri to CyrilBrulebois
  • QA Check changed from Ready for QA to Info Needed

intrigeri wrote:
> The Process.wait() bits LGTM and should indeed fix part of the problem this ticket is about. Jenkins is happy about this branch so I’ll cherry-pick them right away.

Done!

#12 Updated by intrigeri 2019-06-02 15:21:52

  • QA Check deleted (Info Needed)

(Preparing to drop the “QA Check” field as per “[Tails-dev] Proposal: Redmine workflow change”.)

#13 Updated by intrigeri 2019-11-09 11:00:25

  • related to Bug #17216: Make the test suite clean up after itself even in most tricky failure modes added

#14 Updated by CyrilBrulebois 2019-11-26 16:19:18

TBH I don’t recall the exact chain of thoughts/internet searches that led me to believe those .close additions were the right thing to do. I wish I had explained things better in the commit message… I think I tried to gather all code paths that could lead to external processes (either by observation or by grepping through the code), and I suppose I stackoverflow’d some answer mentioning .close for IO.popen cases, but I’m not finding that again right now, and the API indeed doesn’t seem to suggest this should be what we need to do to chase zombies…

#15 Updated by intrigeri 2019-12-15 09:20:27

  • Status changed from In Progress to Needs Validation
  • Assignee changed from CyrilBrulebois to intrigeri

> TBH I don’t recall the exact chain of thoughts/internet searches that led me to believe those .close additions were the right thing to do. I wish I had explained things better in the commit message… I think I tried to gather all code paths that could lead to external processes (either by observation or by grepping through the code), and I suppose I stackoverflow’d some answer mentioning .close for IO.popen cases, but I’m not finding that again right now, and the API indeed doesn’t seem to suggest this should be what we need to do to chase zombies…

Thanks for looking into this again!

I’ve merged stable into the topic branch, resolved merge conflicts (since then I have fixed one of the problems this branch was about via commit:0d3dc2a0), and what’s left in the remaining diff is:

  • Adding the call to .close after killing the child process → as per above analysis, I’ve pushed commit:5e7ff8e9e644374bea8120096706855bc3c1b5b2 that moves this call to .close to before we send the TERM signal.
  • Store the return value of IO.popen(['grep', …]) in a variable and run .close on it. I’m not sure I understand how this one can help but I’m comfortable with taking this change anyway.

Let’s see what Jenkins thinks. If it does not spot regressions, I’ll merge and we’ll be done for this iteration.

IMO, longer-term we need a more robust and generic solution. I’ve described one on Bug #17216 :)

#16 Updated by intrigeri 2019-12-15 09:27:34

  • Status changed from Needs Validation to In Progress

Applied in changeset commit:tails|5e7ff8e9e644374bea8120096706855bc3c1b5b2.

#17 Updated by intrigeri 2019-12-27 10:18:10

  • Target version set to Tails_4.2

(I’m close to wrapping this up and I’d like to reduce our amount of WIP.)

#18 Updated by intrigeri 2019-12-28 08:55:26

  • Status changed from In Progress to Resolved
  • % Done changed from 0 to 100

Applied in changeset commit:tails|225224104768dee7a5cf18d89920f50f28bc0657.

#19 Updated by intrigeri 2019-12-28 08:58:03

  • Assignee deleted (intrigeri)

OK, we’re done here for at least redir and grep, in the best case scenario when the test suite manages to run its cleanup hooks. For the worst case scenarios, see you on Bug #17216!

#20 Updated by intrigeri 2020-04-15 10:30:40

  • related to Bug #17551: Test suite has defunct processes after sikuli removal added