Threads Crash and don't restart

Started by commorancy, August 01, 2008, 02:40:25 PM

Previous topic - Next topic

commorancy

Hello,

I'm seeing a problem where, during a render, one or more of the threads crash with 'An unknown error occurred in a render thread'.  In my first instance of this, one of four render threads died (4 cpu system).  When the thread died, this left the other 3 threads running.  The problem is that when the thread dies, it leaves the tile it was working on unfinished.  No other thread realizes this error to go back and finish that tile.  Instead, the tile where the thread died remains unrendered (black or mostly black).  Since I've found no way to tell Terragen 2 which tile to work on specifically, there's no way to, other than guessing by using the Crop function, to get that area rendered.

The second time I've seen this issue, all four render threads failed with that same error as above.  So, it appears that when a thread dies, it does not restart itself.  Instead, the work it was doing is left unfinished and the thread does not restart.  This issue leaves the render window open hanging waiting for the render to complete, yet it will never complete because the threads died.  In the case of all 4 threads dying, I had to manually kill the render because the render window apparently doesn't keep track of thread failures.

I would suggest that if a thread dies, that either 1) stop the entire render process or 2) restart the thread picking up where the dead thread left off.  It would be best to attempt to restart the thread and let it continue to render.  However, if the thread continues to fail, then close the entire render.  The way it works now, it's impossible to complete the render of a scene when the threads randomly die.  I would also suggest hooking up the render window to the status of the threads.  If all threads fail, the render window should mark the status of the render as finished with errors.

This issue doesn't happen until the second stage of the render begins.  So, it gets all the way through the lengthy dotty tile sequence and then these errors occur during the second stage of rendering (when it begins to fill in the tiles with the final image).

Note that this is in TP5 (1.9.99.1) operating on Windows Vista with a quad core and 3G of ram.

Thanks.

--
Brian

commorancy

Additionally, I should point out that this scene is also generating population errors described in another thread (http://forums.planetside.co.uk/index.php?topic=4320.0).  I do not know if the population errors are related to the thread crashing.

Thanks.

--
Brian

Cyber-Angel

Wouldn't be better if instead of shutting down the whole render if a thread fails as you suggest (Thank you...but no) if Terragen was smart enough to use a technique similar to some high end render farm software that is constantly looking at the state of each machine on the farm and should one stop responding or no longer be on the farm for what ever reason, that machines  current work state is moved to a machine on the farm not currently in use.

One would think that in theory Terragen would look at the states of the threads it is using, and should a thread fail for what ever reason shunt that threads workload gracefully as an automatic process as part of its core to another thread: this would be the preferred solution to termination of the render with out explanation at least form a "End User Frustration" point of view as if you've had a render going for some hours, and then it terminates with out explanation or even force the application to crash to desktop there would be a great many enquirers, as to why here on these forums.

Regards to you.

Cyber-Angel         

commorancy

#3
Quote from: Cyber-Angel on August 01, 2008, 07:56:40 PM
Wouldn't be better if instead of shutting down the whole render if a thread fails as you suggest (Thank you...but no) if Terragen was smart enough to use a technique similar to some high end render farm software that is constantly looking at the state of each machine on the farm and should one stop responding or no longer be on the farm for what ever reason, that machines  current work state is moved to a machine on the farm not currently in use.

Yes, I would also prefer the renderer to start up a new thread.  The thread that died would leave a new processor available to start new work.  So, in fact, I agree that it should start a new thread to continue the unfinished work and move on.  That said... if there is an unresolvable issue in the data set supplied to the renderer, then there is no other choice than to abort the entire render after that set of conditions have been met.  That means implementation of an iteration system to count thread failures over a given bit of data.  If the thread continues to fail and restart over and over, that's just as bad as dying and not restarting.  Thus, if the data cannot be rendered after a certain amount of tries, the renderer must give up and notify the user.

We all know that there are scenarios that can be created through the workflow functions that can lead to loops and other unrenderable data sets.  So, this is why it's important for them to create a retry system and eventually give up the render if the threads die unexpectedly due to the data set.

It's actually more important to know why a thread failed than to restart the render after a failure.  If the thread can give some kind of information on why it failed in the errors area that's user friendly, the user can then adjust the render data to allow it to render.

Quote
One would think that in theory Terragen would look at the states of the threads it is using, and should a thread fail for what ever reason shunt that threads workload gracefully as an automatic process as part of its core to another thread: this would be the preferred solution to termination of the render with out explanation at least form a "End User Frustration" point of view as if you've had a render going for some hours, and then it terminates with out explanation or even force the application to crash to desktop there would be a great many enquirers, as to why here on these forums.

Cyber-Angel

To be perfectly honest, what you are asking for here is a preparsing system. A preparsing system could go through the render input data and do integrity checking, before beginning the actual render, to warn of issues.   A parsing system could help the user clean up the scene and prevent failures hours into a render.

Thanks.

--
Brian


Cyber-Angel

You could get round the loop problem by having "Loop Cheeking" as part of the core but since I'm not a programmer and know vary little about this subject apart form some extremely elementary knowledge: that is extremely vague and hazy in terms of my recollection of it, I could no say where "Loop Cheeking" would be used other than maybe have it look for loops at the parsing stage, if such technologies where to be implemented in the future.

Regards to you.

Cyber-Angel         

commorancy

#5
I've found a reasonably effective workaround for both the population errors and the thread crashing issue.  I set the crop to a reasonably small horizontal stripe.  I render each stripe going from top to bottom of the image.  So far it looks like about 7-10 stripes for the entire image at 1920x1080.  This process takes longer only because it's a manual process, but each stripe render is reasonably short, creates an incremental render (so you don't lose everything in case of error) and seems to get around whatever the issue that causes the thread crashing and population errors when trying to render the whole screen all at once.

So, I would like to see something like this added to Terragen. Instead of trying to render the entire image size all at once, break the image down into a bunch of crop sections and render each crop section separately (stage 1, then stage 2 for only that section) and incrementally save that work.  Then move to the next section and begin the next section.  When attempting a full sized render, I'm concerned that this issue is related to some Windows resource issue (ram, thread constraints, etc) when combined with the memory needed for a very large scene.  This stripe rendering methodology seems to get around whatever resource limitation is causing these thread failures and population errors.  Incremental saves could allow for restarting a rendering at a previously saved point if it aborts from error.

Of course, at the end it all has to be stitched together in a final image, but that's reasonably easy with something like the Gimp or Photoshop.  This stitching process could also be automated.

Thanks.

--
Brian

commorancy

A note about rendering separate crops.  While I was rendering my image in small crops to avoid the thread crashing and population errors, I found another issue.  The rendered crops do not retain the exact color balance as previous rendered crops.  So, when you stitch the image back together, some of the crops end up a slightly different color balance than other rendered crops.  This leaves noticeable stitch 'lines' between the various crops when joined into the final image.  I found that I had to color correct some of the crops to get them to match other crops.  The odd thing is, it didn't seem to affect all parts of the image.  It affected, primarily, the haze color value.  The foreground colors seemed not to be altered that I could tell.

Just an FYI in case you want to try the crop + stitching process to avoid population errors and thread crashing.

--
Brian

Tangled-Universe

#7
Quote from: commorancy on August 03, 2008, 10:46:05 AM
A note about rendering separate crops.  While I was rendering my image in small crops to avoid the thread crashing and population errors, I found another issue.  The rendered crops do not retain the exact color balance as previous rendered crops.  So, when you stitch the image back together, some of the crops end up a slightly different color balance than other rendered crops.  This leaves noticeable stitch 'lines' between the various crops when joined into the final image.  I found that I had to color correct some of the crops to get them to match other crops.  The odd thing is, it didn't seem to affect all parts of the image.  It affected, primarily, the haze color value.  The foreground colors seemed not to be altered that I could tell.

Just an FYI in case you want to try the crop + stitching process to avoid population errors and thread crashing.

--
Brian

This is an old and well known problem since the very beginning of TG2 and is due to the GI. Though it is improved quite much. I only notice seams when rendering at high quality settings with minimum GI of 2/2 and then it is rather mostly an exception.

Oshyan

I'm going to have Matt come and take a look at these issues, but here's currently attending Siggraph so it may be a few days before he will be able to get to it.

- Oshyan

Tangled-Universe

Quote from: Oshyan on August 12, 2008, 03:33:53 AM
I'm going to have Matt come and take a look at these issues, but here's currently attending Siggraph so it may be a few days before he will be able to get to it.

- Oshyan

Wow, Matt's a lucky bastard ;D
I suppose he hasn't an oral there about TG? At least, I didn't find it quickly at CGSociety.

Martin

Oshyan

No presentations this year I'm afraid. ;D

- Oshyan

freelancah

Any news on this subject yet? I'm having the same problem rendering at 1920x1200. Always crashes at the end but if I render in 4 different cropped pieces it works fine only problem is that the colours are not exactly the same on the cropped pieces so I have to do some editing to nit the pieces up 

commorancy

#12
Quote from: freelancah on September 28, 2008, 07:58:10 AM
Any news on this subject yet? I'm having the same problem rendering at 1920x1200. Always crashes at the end but if I render in 4 different cropped pieces it works fine only problem is that the colours are not exactly the same on the cropped pieces so I have to do some editing to nit the pieces up 

I'm still having this error myself.  In fact, with a new scene I've found these additional issues (usually 20+ times):

An unknown error occurred in TraceRay
An unknown error occurred in trBucketRender: RenderMore() ray-traced pass

It also appears that the thread with this error dies with:

An unknown error occurred in render thread (one for each thread that dies).

Then, the only way to resolve this issue is to close TG2 and reopen it.  If you don't restart TG2, the errors persist on subsequent renders.  Restarting TG2 appears to be the only way to clear out the state of the renderer (or whatever variable states are still out there).  This is a hassle because I have to regenerate heightfields manually and regenerate populations manually before I can render again.

Thanks.

freelancah

Here's a screenshot what's happening when the rendering fails. http://www.joce-servers.com/~freelancah/Error.jpg
I'm using C2D Quad 9400 with 4gb of memory. The same error occurs on my old C2D E6750 so I'm pretty sure it's not the system setup

commorancy

Quote from: freelancah on October 02, 2008, 06:00:38 AM
Here's a screenshot what's happening when the rendering fails. http://www.joce-servers.com/~freelancah/Error.jpg
I'm using C2D Quad 9400 with 4gb of memory. The same error occurs on my old C2D E6750 so I'm pretty sure it's not the system setup

It also happens on my Quad 6600 also with 4GB of RAM.  However, in a previous forum thread it was stated that because TG2 is a 32 bit app, it can only address a max of 2GB of memory anyway.  My thought is that this may be a memory related issue.  It could simply be running out of memory while rendering.  If so, then I'd like to see a 64 bit version of TG2.  I'd be willing to install 64 bit Windows to get better performance and more application addressable RAM.  But, only someone from Planetside can explain if this is the issue and whether a 64 bit version of TG2 would resolve this issue.

BTW, those billowy clouds look amazing in the pre-render.  I'd like to see your render fully completed.

Thanks.