Bug #11874

Test suite uses invalid libvirt/QEMU settings when no CDROM is inserted which breaks tests

Added by bertagaz 2016-10-12 12:22:31 . Updated 2016-11-15 18:23:36 .

Status:
Resolved
Priority:
High
Assignee:
Category:
Test suite
Target version:
Start date:
2016-10-12
Due date:
% Done:

100%

Feature Branch:
test/11874-plug-cdrom-only-when-used
Type of work:
Code
Blueprint:

Starter:
Affected tool:
Deliverable for:

Description

Since what seems to be this test run in Jenkins, a lot of scenarios fail with this error:

Call to virDomainRevertToSnapshot failed: internal error: process exited while connecting to monitor: \
2016-10-10T21:06:02.792162Z qemu-system-x86_64: -drive if=none,media=cdrom,id=drive-sata0-0-2,readonly=on,format=raw: \
Can't use 'raw' as a block driver for the protocol level
(Libvirt::Error)
./features/support/helpers/vm_helper.rb:616:in `revert_to_snapshot'
./features/support/helpers/vm_helper.rb:616:in `restore_snapshot'
./features/step_definitions/snapshots.rb:156:in `reach_checkpoint'
./features/step_definitions/snapshots.rb:200:in `/^I\ have\ started\ Tails\ without\ network\ from\ a\ USB\ drive\ with\ a\ persistent\ partition\ enabled\ and\ logged\ in$/'
features/persistence.feature:24:in `Given I have started Tails without network from a USB drive with a persistent partition enabled and logged in'

There has been an upgrade approx. at that time to qemu 2.6 in Jessie backports, that might explain why.


Subtasks


History

#1 Updated by bertagaz 2016-10-12 13:08:51

  • Priority changed from Normal to High

#2 Updated by intrigeri 2016-10-15 13:46:21

  • Subject changed from Isotesters can't boot TailsToaster VM from snapshots to Test suite uses invalid libvirt/QEMU settings when no CDROM is inserted which breaks tests

All recent test suite runs from stable are affected since 2016-10-10, but this seems to be the case for many tests even if they don’t use snapshots:

  Scenario: Tails can boot from live systems stored on hard drives # features/untrusted_partitions.feature:45
    Given a computer                                               # features/step_definitions/common_steps.rb:122
    And I temporarily create a 2 GiB disk named "live_hd"          # features/step_definitions/common_steps.rb:139
libguestfs: trace: set_autosync true
libguestfs: trace: set_autosync = 0
libguestfs: trace: add_drive "/var/lib/jenkins/workspace/test_Tails_ISO_stable/tmp/tails-i386-stable-2.7-20161015T0715Z-e1ed160.iso" "readonly:true" "format:raw"
libguestfs: trace: get_tmpdir
libguestfs: trace: get_tmpdir = "/tmp/TailsToaster"
libguestfs: trace: disk_create "/tmp/TailsToaster/libguestfsIJ9z6q/overlay1" "qcow2" -1 "backingfile:/var/lib/jenkins/workspace/test_Tails_ISO_stable/tmp/tails-i386-stable-2.7-20161015T0715Z-e1ed160.iso" "backingformat:raw"
libguestfs: trace: disk_create = 0
libguestfs: trace: add_drive = 0
libguestfs: trace: add_drive "/tmp/TailsToaster/TailsToasterStorage/live_hd" "format:qcow2"
libguestfs: trace: add_drive = 0
libguestfs: trace: launch
libguestfs: trace: get_backend_setting "force_tcg"
libguestfs: trace: get_backend_setting = NULL (error)
libguestfs: trace: get_cachedir
libguestfs: trace: get_cachedir = "/tmp/TailsToaster"
libguestfs: trace: get_backend_setting "gdb"
libguestfs: trace: get_backend_setting = NULL (error)
libguestfs: trace: launch = 0
libguestfs: trace: list_devices
libguestfs: trace: list_devices = ["/dev/sda", "/dev/sdb"]
libguestfs: trace: copy_device_to_device "/dev/sda" "/dev/sdb"
libguestfs: trace: copy_device_to_device = 0
libguestfs: trace: close
libguestfs: trace: internal_autosync
libguestfs: trace: internal_autosync = 0
    And I cat an ISO of the Tails image to disk "live_hd"          # features/step_definitions/untrusted_partitions.rb:30
    And the computer is set to boot from sata drive "live_hd"      # features/step_definitions/common_steps.rb:135
    And I set Tails to boot with options "live-media="             # features/step_definitions/common_steps.rb:185
    When I start Tails with network unplugged and I login          # features/step_definitions/common_steps.rb:196
      Call to virDomainCreateWithFlags failed: internal error: process exited while connecting to monitor: 2016-10-15T08:38:18.432658Z qemu-system-x86_64: -drive if=none,media=cdrom,id=drive-sata0-0-2,readonly=on,format=raw: Can't use 'raw' as a block driver for the protocol level
       (Libvirt::Error)
      ./features/support/helpers/vm_helper.rb:656:in `create'
      ./features/support/helpers/vm_helper.rb:656:in `start'
      ./features/step_definitions/common_steps.rb:192:in `/^I start the computer$/'
      ./features/step_definitions/common_steps.rb:203:in `/^I start Tails( from DVD)?( with network unplugged)?( and I login)?$/'
      features/untrusted_partitions.feature:51:in `When I start Tails with network unplugged and I login'
    Then Tails is running from sata drive "live_hd"                # features/step_definitions/usb.rb:408
      Scenario failed at time 00:20:40

It looks like we configure a cdrom and pass format=raw, but don’t pass the filename, which makes QEMU very unhappy.

On 2016-10-10, isotesters got QEMU upgraded from 2.5+dfsg-4~bpo8+1 to 2.6+dfsg-3.1~bpo8+1. I bet that it has become a bit more picky, and perhaps Jessie’s libvirt does not know about it. I think we have to live with that combination of QEMU+libvirt versions, as the other options (reverting to Jessie’s QEMU, maintaining backports of libvirt) don’t seem very appealing. Note that on my sid system (QEMU 1:2.6+dfsg-3.1), for the same test libvirt generates a -drive argument that has not format=raw, which does not trigger the bug.

I guess the easiest fix would be to dynamically drop the cdrom reader when we don’t insert any cdrom, right? Then this code can be dropped whenever we use Stretch in our isotesters.

anonym, what do you think?

#3 Updated by anonym 2016-10-18 02:40:24

  • Category changed from Infrastructure to Test suite
  • Status changed from Confirmed to In Progress
  • % Done changed from 0 to 10
  • QA Check set to Ready for QA
  • Feature Branch set to test/11874-remote-cdrom-source-tag
  • Type of work changed from Sysadmin to Code

Yeah, I recall (I think from the virsh sources) that <source file=''/> is invalid in the sense that the resulting qemu commandline would fail if libvirt did not specifically work around it.

The solution is probably to kill the <source file=''/> line and make sure we never generate it — I can see that our set_cdrom_image actually seems to be written with that in mind, so killing that line in the xml template seems enough. I pushed an untested branch so you can see if that works on Jenkins.

#4 Updated by anonym 2016-10-18 10:05:15

  • QA Check changed from Ready for QA to Dev Needed
  • Feature Branch changed from test/11874-remote-cdrom-source-tag to test/11874-empty-cdrom

anonym wrote:
> Yeah, I recall (I think from the virsh sources) that <source file=''/> is invalid in the sense that the resulting qemu commandline would fail if libvirt did not specifically work around it.
>
> The solution is probably to kill the <source file=''/> line and make sure we never generate it — I can see that our set_cdrom_image actually seems to be written with that in mind, so killing that line in the xml template seems enough. I pushed an untested branch so you can see if that works on Jenkins.

This didn’t work, so I guess I have some wrong idea here. I renamed the (mistyped!) branch to something more general so that we don’t lose this commit, cause it still makes sense to try to be more correct.

So…

bertagaz wrote:
> I guess the easiest fix would be to dynamically drop the cdrom reader when we don’t insert any cdrom, right? Then this code can be dropped whenever we use Stretch in our isotesters.

… this indeed seems to be the way to go. Will you give it a shot?

#5 Updated by bertagaz 2016-10-18 11:05:38

anonym wrote:
> bertagaz wrote:
> > I guess the easiest fix would be to dynamically drop the cdrom reader when we don’t insert any cdrom, right? Then this code can be dropped whenever we use Stretch in our isotesters.
>
> … this indeed seems to be the way to go. Will you give it a shot?

I didn’t write that, but I can totally can try to fix that.

#6 Updated by bertagaz 2016-10-19 11:41:12

  • % Done changed from 10 to 30
  • QA Check deleted (Dev Needed)
  • Feature Branch changed from test/11874-empty-cdrom to test/11874-plug-cdrom-only-when-used

intrigeri wrote:
> I guess the easiest fix would be to dynamically drop the cdrom reader when we don’t insert any cdrom, right? Then this code can be dropped whenever we use Stretch in our isotesters.

The more I think about it, the more I believe that’s a bug in QEMU. After all, a VM should be able to have an empty CDROM device at boot time (so with srouce file empty), to emulate how it works in reality.

But even if we agree and report this upstream, we still need a urgent fix for our test suite to run while it’s not solved.

I’ve pushed a branch (did not see anonym’s one when I started to work on this), that requires a bit more polishing before being set RfQA, and may hit a few corner cases in the way we set devices in our test suite, but it should fix it. Let see how it behaves in Jenkins (starting from https://jenkins.tails.boum.org/job/test_Tails_ISO_test-11874-plug-cdrom-only-when-used/2/)

#7 Updated by bertagaz 2016-10-19 11:41:49

  • QA Check set to Dev Needed

#8 Updated by anonym 2016-10-19 12:43:17

anonym wrote:
> > … this indeed seems to be the way to go. Will you give it a shot?
> I didn’t write that,

Woah… I’m not sure how I mixed that up! :S

> but I can totally can try to fix that.

Thanks!

> intrigeri wrote:
> > I guess the easiest fix would be to dynamically drop the cdrom reader when we don’t insert any cdrom, right? Then this code can be dropped whenever we use Stretch in our isotesters.
>
> The more I think about it, the more I believe that’s a bug in QEMU. After all, a VM should be able to have an empty CDROM device at boot time (so with srouce file empty), to emulate how it works in reality.

qemu can emulate an empty cdrom device but noy by setting an empty path. libvirt just allows the empty path by special-casing that particular value, and then converting it to the correct qemu solution. I think we now just have some unlucky combination of libvirt and qemu where whatever libvirt thinks that is is not matching what qemu expects. So, sure, there seems to be a compatibility bug…

> But even if we agree and report this upstream, we still need a urgent fix for our test suite to run while it’s not solved.

… which I’m not sure we should spend time on reporting. All will be well once we only support running the test suite on Debian Stretch.

> I’ve pushed a branch (did not see anonym’s one when I started to work on this),

I pushed a commit to your branch that includes the good from my branch.

> that requires a bit more polishing before being set RfQA, and may hit a few corner cases in the way we set devices in our test suite, but it should fix it.

A tip is that this:

domain_xml.elements.each('domain/devices/disk') do |e|
    if e.attribute('device').to_s == "cdrom"
        domain_xml.elements['domain/devices'].delete_element(e)
        update(domain_xml.to_s)
    end
end

can be simplified to:

domain_rexml.delete_element("domain/devices/disk[@device='cdrom']")
update(domain_xml.to_s)

thanks to xpath. Our XML manipulation can be simplified all over the place thanks xpaths, and if you like you can start. :)

> Let see how it behaves in Jenkins (starting from https://jenkins.tails.boum.org/job/test_Tails_ISO_test-11874-plug-cdrom-only-when-used/2/)

Yay! I’m slightly worried that your changes to remove_cdrom() might break the media unplugging scenarios in emergency_shutdown.feature, but we’ll see. In any case, I don’t think I like that you switched from removing the media on the VM layer (what remove_cdrom() did before your commits) to the software layer inside the guest (eject). I think the old way was truer to what we are trying to test, so personally I’d like to see a separate, new method, so we have:

  • remove_cdrom() - only remove the media, like it was before your commits. Remember, “CDROM” = “compact disk read only media”, so it doesn’t refer to the reader device, just the virtual plastic disk. :)
  • remove_cdrom_device() - remove the whole reader, i.e. what remove_cdrom() does with your commits.

Also, please add a comment (#XXX: Stretch: this workaround can be removed when we end support for running the test suite on Jessie) at an appropriate place, e.g. (after my suggested change above) where we call remove_cdrom_device() instead of remove_cdrom().

And while we’re at it, maybe we should stop pretending we deal with CDs, and move to DVDs. :P

#9 Updated by intrigeri 2016-10-24 16:39:42

bertagaz, what’s your ETA here? Need any help to get this 2-weeks-old problem solved for good?

#10 Updated by bertagaz 2016-10-25 12:02:16

intrigeri wrote:
> bertagaz, what’s your ETA here? Need any help to get this 2-weeks-old problem solved for good?

ETA is to make it so someone can merge it before the freeze announced for 2.7 (even if that’s not tight to a release). :D

I plan to work on this one in the coming days, as well as on Bug #11786, Feature #6304 and Bug #10502.

Right now I just need to find a way to emulate hardware DVD ejection, so that I can answer anonym’s comment. If you have any clue how I can do that with ruby-libvirt, that might help.

#11 Updated by anonym 2016-10-25 13:54:29

bertagaz wrote:
> intrigeri wrote:
> > bertagaz, what’s your ETA here? Need any help to get this 2-weeks-old problem solved for good?
>
> ETA is to make it so someone can merge it before the freeze announced for 2.7 (even if that’s not tight to a release). :D
>
> I plan to work on this one in the coming days, as well as on Bug #11786, Feature #6304 and Bug #10502.

My only problem with this ETA is that these failures currently makes it harder to see how some of our other “test suite robustness improvement” branches are doing. But I get the situation (ENOTIME etc.).

> Right now I just need to find a way to emulate hardware DVD ejection, so that I can answer anonym’s comment. If you have any clue how I can do that with ruby-libvirt, that might help.

Sorry if I was not clear: without your branch, that is just what remove_cdrom() does — it calls set_cdrom(nil), which in turn removes the <source> tag in the CDROM device’s section in the XML, which is the correct way of doing it (it makes QEMU generate an “eject tray” event on the hardware level). Note that at least some version of libvirt threw an exception when that was done, but (again) note that we catch it and deal with it in remove_cdrom(), and it still works.

So, in your branch, I think you should do this:

  1. rename remove_cdrom() -> remove_cdrom_device() like I said earlier.
  2. completely rewrite eject_cdrom(): make it look for the CDROM device (if any — I guess an exception is appropriate if there’s none), and remove its <source> tag (if any — imho we do nothing if it’s already removed).
  3. move the exception handling in remove_cdrom_device() (where it currently does nothing) into eject_cdrom(), where it might be needed still for some (all?) versions of libvirt.

Please let me know if there’s anything else you need!

#12 Updated by anonym 2016-10-26 14:29:40

  • Assignee changed from bertagaz to anonym

I take this one over, since it’s blocking me and in general is making our test suite results on Jenkins a lot less useful (well, a lot more investigation is needed when interpreting results).

#13 Updated by anonym 2016-10-26 14:39:19

  • % Done changed from 30 to 40

I’ve pushed a very non-atomic fix (sorry!). Now I only try to eject, but if we actually have to remove the CDROM device, I’ll push a commit with this diff:

--- a/features/support/helpers/vm_helper.rb
+++ b/features/support/helpers/vm_helper.rb
@@ -335,12 +335,12 @@ class VM
     plug_drive(name, type) if not(disk_plugged?(name))
     set_boot_device('hd')
     # XXX:Stretch: since our isotesters upgraded QEMU from
-    # 2.5+dfsg-4~bpo8+1 to 2.6+dfsg-3.1~bpo8+1 it seems we must eject
-    # the CDROM to allow disk boot. This is not the case with the same
+    # 2.5+dfsg-4~bpo8+1 to 2.6+dfsg-3.1~bpo8+1 it seems we must remove
+    # the CDROM device to allow disk boot. This is not the case with the same
     # version on Debian Sid. Let's hope we can remove this ugly
     # workaround when we only support running the automated test suite
     # on Stretch.
-    eject_cdrom
+    remove_cdrom_device
   end

   # XXX-9p: Shares don't work together with snapshot save+restore. See

#14 Updated by anonym 2016-10-26 16:36:38

Just ejecting the CDROM didn’t work, so I pushed a commit making it remove the CDROM device.

#15 Updated by anonym 2016-10-29 02:54:55

  • Assignee changed from anonym to intrigeri
  • % Done changed from 40 to 50
  • QA Check changed from Dev Needed to Ready for QA

See builds >= 16 here: https://jenkins.tails.boum.org/job/test_Tails_ISO_test-11874-plug-cdrom-only-when-used/

There are some completely successful runs, and among the errors none are the one this ticket is about. So, it seems it’s fixed! \o/

#16 Updated by intrigeri 2016-10-29 12:20:08

  • Status changed from In Progress to Fix committed
  • % Done changed from 50 to 100

Applied in changeset commit:410fa0dfc46a0aadc696835cc2b8372a49a0852d.

#17 Updated by intrigeri 2016-10-29 12:20:58

  • Assignee deleted (intrigeri)
  • QA Check changed from Ready for QA to Pass

Thanks!

#18 Updated by bertagaz 2016-11-15 18:23:36

  • Status changed from Fix committed to Resolved