Continue to Site

Eng-Tips is the largest engineering community on the Internet

Intelligent Work Forums for Engineering Professionals

  • Congratulations IDS on being selected by the Eng-Tips community for having the most helpful posts in the forums last week. Way to Go!

PC locks up running large jobs

Status
Not open for further replies.

BeardedBob

Mechanical
Sep 14, 2007
6
Does anyone have any thoughts on why we are experiencing machine lockups with large Abaqus jobs?

When running large jobs the PC will freeze and has to be switched off and re-booted. Apart from the loss of the job, it then takes 6 hours or so to check the RAID array.

Simulia can see no problems with our models, they run OK on their systems so blame our hardware. Our IT support group have run various hardware diagnostics and say there is nothing wrong with the hardware and are blaming Abaqus. Not that they have any experience of Abaqus, they usually support Ansys users:-{

Hardware:
Dell Precision T3500
12 Gbyte RAM
4 x 1 TByte hard discs configured as 2x2Gbyte RAID array
1GB QUADRO NVIDIA FX3800-2 graphics card
Windows XP 64bit

Abaqus 6.9-1

Abaqus memory settings originally at 90% (default), reduced to 75% and the problem still occurs. We are about (when the latest RAID check is finished) to try again at 60%. The problems have occurred when running 1 or 3 cores of the processors.

Generally the crashes occur at night after several hours running which has lead to speculation that there is a thermal issue. The machine is not enclosed, is on a desk in a normal UK office so I'm not convinced, although the graphics card is a big beast.

One suggestion was it was due to auto updates by Kaspersky Anti-virus when heavily loaded, but the crashes still occur when the network cable is unplugged and Kaspersky is turned off. We've turned off as many other progams as we can to reduce load on the system.

Any thoughts, similar experiences etc. would be greatly appreciated.

Bob Andrews
 
Replies continue below

Recommended for you

I've had Abaqus hang up but it is almost always an error with my model and not my system. Unless the fan vents are being blocked I think the thermal idea is weak. Besides if it was just the system heating up too much then it should run for longer when you are only using 1 processor as opposed to 3.

I am going to apologize up front if these are too obvious/dumb but here are a couple of ideas.

1. Are you sure Kaspersky isn't trying to run a scan at the same time?

2. Did you check the power settings to make sure XP isn't putting the system in standby, or turning off the hard drive? I think older versions of Windows used to just use the mouse/keyboard to determine if the system is being used.

Again, sorry if these are obvious but I've been in your spot and know that sometimes even bad ideas can get me on the right track.

Dan
 
Does this happen with a particular model or with any model ?

Do you get any error message in .log / .msg about abort?

Running jobs with 3 cores is kind of unusual. Does is run with 2 cores?

Does the HD undergo automatic back-up?
 
Thanks for the responses, comments and a progress update:

Danstro - We did suspect Kaspersky scans or automatic updates, but we've unplugged the network cable and turned Kaspersky off and the problem still occurs. Power management etc. is off as well.

Johnhors - Interesting thought, might be a last resort. Would take me back about 15 years to when I ran a group of SGI workstations with 32 bit IRIX. I think our analysts would need some re-education though, they're Windows people.

Xerf - it's happened with various models so we don't think it's a model problem. No error messages at all in .log / .msg files, all the files just stop dead, presumably when everything locks so the I/O buffers aren't flushed to disc. Not sure about 2 cores, will try that - but the same problem occurs with just one core. HD is managed by a RAID controller on the motherboard, with mirroring. My (limited) understanding is that this is happening at a low level and doesn't involve any dedicated backup process. We are not running any sort of automated backup software like Acronis.

Progress - we turned the Abaqus memory limit down to 60% and managed to run one of the problem jobs OK. This may prove to be a work around but I'm left wondering why we need to reduce it so much from the default. I'm also worried that we haven't really solved the problem and it will just come back to bite us in the middle of an urgent job :-{

Bob Andrews
 
don't use all the processor cores to do the job, leave one (core) for the system.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor