Virtualbox Issue Ubuntu 22.04.1: "Communication with VM Hypervisor failed"

Graham Jenkins ID: 1626 Posts: 163
16 Aug 2022 04:06 AM

Anybody else seen this? It's happening on an i7-4790 machine with Rosetta Python tasks (where the task just gets postponed indefinitely), and in a like vein LHC Virtualbox jobs are failing with a "Computational Error" about 12 seconds after they start.

I've tried re-installing boinc-virtualbox (7.18.1+dfsg-4) and that doesn't solve the problem.  The machine happily runs Boinc Numberfields jobs, and also runs other (FreeBSD, NetBSD, Alma Linux) Vitualbox guests.

Suggestions?

 

Graham Jenkins ID: 1626 Posts: 163
16 Aug 2022 06:47 AM

And I'm seeing exactly the same story on an i5-3470S machine. Again, it happily runs Boinc Numberfields jobs.

I've also tried a Project Reset on both Rosetta and LHC; doesn't resolve the issue.

Tristan Olive ID: 22 Posts: 384
16 Aug 2022 04:35 PM

Do you know what version of VirtualBox you have installed? User "entity" just mentioned in another thread that Rosetta is running an old version of vboxwrapper, which is the application through which BOINC communicates with the hypervisor. 

That being said, I'm not sure what you could do about it other than try to install an older version of VirtualBox, if that is the problem. (Or maybe Rosetta forums have some discussions along these lines).

Graham Jenkins ID: 1626 Posts: 163
16 Aug 2022 09:09 PM
  • virtualbox                 6.1.34-dfsg-3~ubuntu1.22.04.1
  • boinc-virtualbox     7.18.1+dfsg-4
  • boinc                         7.18.1+dfsg-4

Even if Rosetta is running an old version of vboxwrapper, we should still expect LHC to work. I've submitted a Ubuntu Bug Report; if that doesn't get a result. I try to install an older VirtualBox wrapper.

Graham Jenkins ID: 1626 Posts: 163
21 Aug 2022 01:34 AM

See: https://bugs.launchpad.net/bugs/1986647

Its status has now been flagged as "Confirmed" because it affects multiple users.

e
entity ID: 7097150 Posts: 29
27 Aug 2022 01:49 PM

IMO, I have no absolute proof, the problem with Rosetta is the version of the vboxwrapper it is using. That version copies the .vdi file from the projects directory to each slot directory as the WU starts executing (if you notice, it takes a while for the task to start showing percentage complete right after it starts. About 45 to 60 seconds). This copy is about 8 to 10 GB in size and is I/O intensive. I believe if you are running more than one Rosetta task at a time, the executing Rosetta task will not be able to write to the disk during the time the new Rosetta task is doing it's copy at start-up. You then get the message. For me, it seems that the task that got that message you described was close to being complete most of the time. The way to get it to start back up and complete is to stop the boinc client and restart it. The Rosetta task will restart and complete. The absolute fix for the problem is to get Rosetta to use the updated xboxwrapper with the multi-attach function to negate the copy of the .vdi file. This will also make the Rosetta tasks start a lot faster. If you don't restart the boinc-client or the computer, those tasks will occupy the slot and prevent other work from using it. Over time all your slots will get exhausted and no other work will start until a restart happens. This doesn't fit the CE "install and forget" methodology.

LHC: I had a similar problem with the LHC tasks on Linux and it was caused by an upgrade to the boinc client that introduced stricter systemd protections for the tmp filesystem. I was getting permission problems when the task was starting up and trying to create files in the /tmp filesystem. I had to edit the systemd configuration to correct the problem. If you are running Windows, then this description doesn't hold. Unfortunately, which is why I'm not currently using Charity Engine, it is very difficult to diagnose as you don't have access to the stderr.txt files for the failing tasks since you can't get logged on to the userid you are running the project tasks under. This file would probably have the reason the task is failing. Tristan et. al. would have to look for you as they are the only ones who have access to the WU output once it is uploaded to the project site. 

In my opinion, LHC and Rosetta (actually any project requiring VirtualBox) are not a good fit for Charity Engine users. They have a tendency to not behave correctly at times causing the issues you are experiencing now. In Charity Engine I would focus more on the non-VB tasks like NumberFields, TN-Grid, DENIS, WCG is starting back up, etc. They tend to have tasks that are relatively short running, don't require lots of resources, and tend to behave correctly which fits CE's "install and forget" approach to grid computing.

Good luck

Graham Jenkins ID: 1626 Posts: 163
05 Sep 2022 11:33 AM

I've acquired and installed boinc-7.20.2+dfsg+202208252251~ubuntu22.04.1, and that hasn't resolved the issue, so I've passed some comments back to its developer and am awaiting his response.

I note that World Community Grid System Migration has now been completed. Is there any chance that we can now recover the WCG project points that were never credited to our accounts? And perhaps then start processing some WCG tasks again?

Graham Jenkins ID: 1626 Posts: 163
11 Sep 2022 10:13 AM

I've managed to resolve the Virtualbox issue on Ubuntu 22.04.1 by replacing its 'boinc-client.service' file with the one from Ubuntu 20.04.3. The only difference is that the latter declares "PrivateTmp=true".

e
entity ID: 7097150 Posts: 29
11 Sep 2022 12:54 PM

That should fix the LHC issue but I doubt it will fix the Rosetta problem you described. Be aware the next BOINC client upgrade will replace that file and you will have to copy it again if you still have the older version to copy. You could create an override file in /etc/systemd/system/ by doing:

sudo systemctl edit boinc-client.service

then editing the service file with PrivateTmp=true and then saving the updates. This will allow the change to survive service file updates

Graham Jenkins ID: 1626 Posts: 163
15 Sep 2022 06:48 AM

Remaining issue .. On Ubuntu 22.04.1, LHC Atlas jobs requiring multiple CPUs just seem to execute on one CPU, leaving the other CPUs unavailable for other jobs.

And I've solved this one by upgrading (from the Ubuntu 22.04.1 proposed packages) to Virtualbox 6.1.38-dfsg-3~ubuntu1.22.04.1; no idea why this worked.