Bug #16915

Weblate blocks completely every 3 minutes, for a whole minute, making translation frustrating and slow

Added by emmapeel 2019-07-29 07:19:44 . Updated 2019-09-12 14:27:48 .

Status:
Resolved
Priority:
Normal
Assignee:
emmapeel
Category:
Target version:
Start date:
Due date:
% Done:

0%

Feature Branch:
Type of work:
Sysadmin
Blueprint:

Starter:
Affected tool:
Translation Platform
Deliverable for:

Description

Since some days weblate blocks very often, all components, each three minutes.

It makes it hard to work, as you are confronted by errors all the time and need to go back the page, sometimes you loose your suggestion, etc.

You can see the blocks at https://translate.tails.boum.org/projects/tails/#history


Subtasks


History

#1 Updated by zen 2019-07-29 19:23:05

  • Assignee set to intrigeri

#2 Updated by intrigeri 2019-08-01 14:56:44

  • Target version set to Tails_3.16

#3 Updated by intrigeri 2019-08-01 15:13:01

  • Status changed from Confirmed to Needs Validation
  • Assignee changed from intrigeri to emmapeel

> Since some days weblate blocks very often, all components, each three minutes.

I believe this is a direct consequence of https://git-tails.immerda.ch/puppet-tails/commit/?id=254ff4641cee2f716d5e645366a9c6453a2c1aff (that we had to do to ensure that Weblate does not fiddle with the Git repo while we’re updating it; previously, we would occasionally lose work done on Weblate or erroneously include changes in the wrong commit).

> It makes it hard to work, as you are confronted by errors all the time and need to go back the page, sometimes you loose your suggestion, etc.

ACK. I’ve just experienced this myself (after clicking “Suggest”, my suggestion was lost). I agree it’s a huge pain.

To clarify, the trade-off is between this sort of problems and those caused by running the cronjob less often (e.g. one can waste time translating a string that disappeared/changed in the canonical Git repo, accepted translations take more time to go live, etc.).

We’ve been runing that cronjob every 5 minutes so far. I’ve just changed this so it now runs only every 30 minutes.

Please let me know how annoying this now is in practice, with this change applied. If needed, we can probably go a bit further and run it every hour only.

#4 Updated by emmapeel 2019-08-06 07:04:48

  • Assignee changed from emmapeel to intrigeri

It is still very annoying.

#5 Updated by intrigeri 2019-08-06 10:41:02

  • Assignee changed from intrigeri to emmapeel

> It is still very annoying.

Thanks for this quick feedback! I’ve made the cronjob run only every 4 hours. Good enough?

Note to myself, in case that’s still not good enough:

  • It might be because the less often we run the cronjob, the longer it blocks Weblate.
  • Ultimately, Weblate losing work while it’s locked is a Weblate bug. We could check whether newer versions improve this (after we’ve upgraded) and if not, report a bug upstream.
  • If this can’t be resolved on the Weblate side, we could try mitigating the problem by making our code that is guarded by this lock run faster: profile it to find where the most time is spent, look for low-hanging optimization fruits.

#6 Updated by emmapeel 2019-08-07 09:24:58

  • Assignee changed from emmapeel to intrigeri

now it seems is blocked since last night

#7 Updated by intrigeri 2019-08-07 10:05:16

> now it seems is blocked since last night

Yeah, I see that the commit_pending background task has been running for hours. It’s been running the same Git command in a loop since then. I don’t know why but it might have been stuck due to a disk space issue back then.

I’ve killed that process, re-run the cronjob by hand, and it does the same silly thing again. I’m investigating further.

#8 Updated by emmapeel 2019-08-07 10:26:56

hmm… unlocked now and working ok after groente solved another rule… maybe that was the culprit?

#9 Updated by intrigeri 2019-08-07 10:44:30

  • Assignee changed from intrigeri to emmapeel

>> now it seems is blocked since last night

> I’m investigating further.

Fixed and documented (in translate-server.git) how I did it.

It turned out to be unrelated to this ticket ⇒ back to your plate :)

#10 Updated by intrigeri 2019-08-07 10:53:46

> hmm… unlocked now and working ok after groente solved another rule… maybe that was the culprit?

As clarified on XMPP: no, the timing was a coincidence.

#11 Updated by hefee 2019-08-16 23:01:39

@intrigeri: The issue why we needed the locking in first place, was that we did the integration part in the same repo than Weblate is committing to. We never tried after we introduced the integration repo, if we still need the locking. IMO it should be possible to run the cronjob without any locking. Because the only writer to the Weblate repo is Weblate itself. The only other change that is happening is a git pull with ff_only=True. So if Weblate would commit anything, than git pull would fail and we try the fast-forward again in the next run.

#12 Updated by intrigeri 2019-08-17 15:13:45

Hi @hefee,

> The issue why we needed the locking in first place, was that we did the integration part in the same repo than Weblate is committing to. We never tried after we introduced the integration repo, if we still need the locking.

Well, https://git-tails.immerda.ch/puppet-tails/commit/?id=254ff4641cee2f716d5e645366a9c6453a2c1aff happened the day after I merged 16844-different-working-directory. Unfortunately, I don’t remember if that was because we had seen actual problems or because previous issues lead us to err on the safe side.

> IMO it should be possible to run the cronjob without any locking. Because the only writer to the Weblate repo is Weblate itself. The only other change that is happening is a git pull with ff_only=True.

Except “Update Weblate components in Weblate repo”, no? This operation should not happen while Weblate is itself modifying its Git repo, and it needs locking, right?
But perhaps that’s the only thing that needs locking, indeed!

#13 Updated by hefee 2019-08-18 23:23:54

Hi @intrigeri,

> Well, https://git-tails.immerda.ch/puppet-tails/commit/?id=254ff4641cee2f716d5e645366a9c6453a2c1aff happened the day after I merged 16844-different-working-directory. Unfortunately, I don’t remember if that was because we had seen actual problems or because previous issues lead us to err on the safe side.

So far I remember, we wanted to be on the super safe side and hadn’t spotted any issue after we merged 16844-different-working-directory. It was just, that we found the lock_translation feature in the day after we merged the 16844 and thought, that it may be useful, but this issue shows, that it had bad side effects.

> > IMO it should be possible to run the cronjob without any locking. Because the only writer to the Weblate repo is Weblate itself. The only other change that is happening is a git pull with ff_only=True.
>
> Except “Update Weblate components in Weblate repo”, no? This operation should not happen while Weblate is itself modifying its Git repo, and it needs locking, right?

Well the updating Weblate components in Weblate repo don’t need locking, because we simply bring the Weblate components in line with the files. The only issue I can think of is that a translate unit disappear in the moment someone tries to store a suggestion for the disappearing unit.

Weblate itself is using a own lock file to communicate, that something changes the git repository. That lock mechanism is already used by update_weblate_components.py for the complete operation (https://git-tails.immerda.ch/puppet-tails/commit/?id=adb28d5c960206d534a1b1836f4e6d657b96cbf0). That means as long as the script is running nothing else can change the git repository.

> But perhaps that’s the only thing that needs locking, indeed!

I see two options to test:
1. remove the lock completely, run the cron every */5 and look if we spot issues
2. only lock the update_weblate_components.py, makes a measurement how long the lock exists and than run the cron more often if possible

Me personally would give option 1 a try.

#14 Updated by intrigeri 2019-08-19 05:49:43

  • Status changed from Needs Validation to In Progress
  • Assignee changed from emmapeel to hefee

Hi @hefee!

>> Except “Update Weblate components in Weblate repo”, no? This operation should not happen while Weblate is itself modifying its Git repo, and it needs locking, right?

> Well the updating Weblate components in Weblate repo don’t need locking, because we simply bring the Weblate components in line with the files.

FTR it seems that this is incorrect: this script does modify the repo with (the Python equivalent of) git pull. But it does not matter because as you nicely explained below, this script locks the repo itself:

> Weblate itself is using a own lock file to communicate, that something changes the git repository. That lock mechanism is already used by update_weblate_components.py for the complete operation (https://git-tails.immerda.ch/puppet-tails/commit/?id=adb28d5c960206d534a1b1836f4e6d657b96cbf0). That means as long as the script is running nothing else can change the git repository.

Great! In passing, do you have any idea what’s the UX for translators while the repo is locked (as opposed to locking translations with lock_translation)?

> I see two options to test:
> 1. remove the lock completely, run the cron every */5 and look if we spot issues
> 2. only lock the update_weblate_components.py, makes a measurement how long the lock exists and than run the cron more often if possible

> Me personally would give option 1 a try.

Yeah, let’s do this!
Let’s coordinate on XMPP and do this at a well-chosen time, when we can keep an eye on things continuously for a few hours and revert if anything goes wrong.
I should be around every day this week except Tuesday, Wednesday, and Sunday.

#15 Updated by hefee 2019-08-20 19:27:32

  • Assignee changed from hefee to intrigeri

intrigeri wrote:
> Hi hefee! > > >> Except "Update Weblate components in Weblate repo", no? This operation should not happen while Weblate is itself modifying its Git repo, and it needs locking, right? > > > Well the updating Weblate components in Weblate repo don't need locking, because we simply bring the Weblate components in line with the files. > > FTR it seems that this is incorrect: this script does modify the repo with (the Python equivalent of) git pull. But it does not matter because as you nicely explained below, this script locks the repo itself: > > > Weblate itself is using a own lock file to communicate, that something changes the git repository. That lock mechanism is already used by update_weblate_components.py for the complete operation ("$":https://git-tails.immerda.ch/puppet-tails/commit/?id=adb28d5c960206d534a1b1836f4e6d657b96cbf0). That means as long as the script is running nothing else can change the git repository. > > Great! In passing, do you have any idea what's the UX for translators while the repo is locked (as opposed to locking translations with lock_translation@)?

Nope I have no idea, but the circumstances, when Weblate will interact with the repo are not that often. And the filelock has a Timeout of 120secs. So I expect, that those operations may take longer. Additionally it is normal, to run cronjobs, that lock the repo in background like commit_pending and also if several users trigger a write event, than this lock mechanism takes care that the repo don’t blow up.

> > I see two options to test:
> > 1. remove the lock completely, run the cron every */5 and look if we spot issues
> > 2. only lock the update_weblate_components.py, makes a measurement how long the lock exists and than run the cron more often if possible
>
> > Me personally would give option 1 a try.
>
> Yeah, let’s do this!

Geat.

> Let’s coordinate on XMPP and do this at a well-chosen time, when we can keep an eye on things continuously for a few hours and revert if anything goes wrong.
> I should be around every day this week except Tuesday, Wednesday, and Sunday.

Let’s do it on Thursday 10am CEST?

#16 Updated by intrigeri 2019-08-21 08:43:21

Hi @hefee,

>> Let’s coordinate on XMPP and do this at a well-chosen time, when we can keep an eye on things continuously for a few hours and revert if anything goes wrong.

> Let’s do it on Thursday 10am CEST?

Deal! I have another meeting at 11am CEST (that should last 1-2 hours maximum), during which the best I’ll be able to do is to quickly revert stuff if it is really hurtful. But apart of that I should be online most of the day so we can adjust stuff as needed :)

#17 Updated by hefee 2019-08-22 10:22:52

Currently we removed the locks and run the cronjob every 5 minutes, so far we don’t see any issues.

#18 Updated by intrigeri 2019-08-22 10:26:32

  • Status changed from In Progress to Needs Validation
  • Assignee changed from intrigeri to emmapeel

We’ve removed the lock around 08:17 UTC today. Please let us know if this problem is solved :)

#19 Updated by CyrilBrulebois 2019-09-05 00:05:42

  • Target version changed from Tails_3.16 to Tails_3.17

#20 Updated by intrigeri 2019-09-12 09:08:25

  • Status changed from Needs Validation to Resolved

intrigeri wrote:
> We’ve removed the lock around 08:17 UTC today. Please let us know if this problem is solved :)

Three weeks later without any new breakage report, I’ll optimistically assume that this problem is gone. Please reopen if this problem happens again :)

#21 Updated by intrigeri 2019-09-12 14:27:48

  • Target version changed from Tails_3.17 to Tails_4.0