During the past couple of weeks, I've noticed a number of Rosetta jobs stuck in a "Waiting to Run" status at somewhere like 60% completion.
This is occuring because they seem to get to that stage, then demand more memory than the machine has available. The machine is actually an OpenVZ instance with a total of 768MB memory, and in the most recent case, a job was requesting 800MB .. so it was never going to proceed.
So .. is there are way of:
1. preventing jobs which might grow like this from being allocated? or:
2. allowing new jobs to start while they are stalled? or:
3. transparently migrating stalled jobs to another of my machines that has more memory?
Sorry, Graham -- we may have to look into this further. It could be something of an incompatibility between the Charity Engine client and OpenVZ. What you note in point 1 is the way Charity Engine should already be working, so the fact that a job is even starting when it needs more than all memory available is strange.
One thing you can try is modifying the computing preferences in your Charity Engine software in order to adjust memory usage limits. These are based on percentages of the total system memory, so I wonder if it sees the actual physical memory size rather than the OpenVZ memory size when performing this calculation. So if you have 2000MB total on the system and Charity Engine is set to use up to 40%, then maybe Rosetta jobs think they have 800MB available instead of the 768MB OpenVZ limits it to using?
Seems to have worked OK for the last couple of days, perhaps the Rosetta people have changed something? Memory Usage is set to 90% when computer is in use, 91% whent it's not. Job now executing is at 62% done, showing VM size 417MB, working-set size 263MB.
New job download shows
Anyway .. I'll keep an eye on it for the next few days ..
Thanks for the update on this. What you shared from the log confirms that it is targeting 90% of 768MB, so that is as it should be. It very well could have been an issue with the Rosetta job, if they thought it would need less memory than it actually did.