Rendering on 4 cores is faster than 8 when animating !

Started by cyphyr, April 19, 2011, 09:02:29 PM

Previous topic - Next topic

cyphyr

Just discovered something. Sorry if this is old news.

I knew the efficacy of Terragens renderer falls off with the more cores it uses, 8 cores is not twice as fast as 4 cores, but I had never applied this principle to animations.
So I made a very simple test. Using the Terragen benchmark scene I set up a frame sequence of two frames, using all 8 cores and saved it as 8 cores.tgd, made two more versions but set the cores to 4 (max and min), set them to render frame 1 and frame 2 respectively and saved them as 4_cores_1.tgd and 4_cores_2.tgd. I shut down Teragen and relaunched, loading the first scene, 8 cores.tgd, hit Render Sequence and waited. 19.35 min later I had two frames rendered and saved. I then shut down and this time launched Terragen twice, arranged them on my screen side by side, loaded 4_cores_1.tgd and 4_cores_2.tgd in to each window and once again hit render, and waited. 15.16 min later I had two more frames rendered and saved.

That's nearly a 25% speed increase!

Now if there was a way to get the command line to launch two instances of Terragen (there may be, I simply don't know), this could be a significant boost to render farms. Got a feeling it might be a bit more complex to implement. But for small farms where you could just save several incremental versions of a scene its more worthwhile.

Cheers

Richard

ps just noticed this is my 2300th post  ;D
www.richardfraservfx.com
https://www.facebook.com/RichardFraserVFX/
/|\

Ryzen 9 5950X OC@4Ghz, 64Gb (TG4 benchmark 4:13)

Matt

Most render farm software has the ability to run multiple instances concurrently on the same machine.
Just because milk is white doesn't mean that clouds are made of milk.

jo

Hi Richard,

I've just checked this out. I can confirm your observation however your interpretation is incorrect :-). That is to say that while you're right that in this case rendering 2 individual frames side by side is quicker than rendering two frames in sequence, you are wrong about why that's happening.

I tested this on OS X using my dual quad core machine (8 real cores + 8 hyperthreads). I was also using a newer build than the public release in 64 bit mode. The reason I mention that is that multithreading rendering is more efficient in the newer version on the Mac (bringing it into line with the Windows version) so if any Mac users try to reproduce this with the current public version (v2.2.something) they won't be able to.

In any case, here are my results:

Sequence with 16 threads: 18:20
Two frames at once with 16 threads: 12:46
Two frames at once with 8 threads: 14:33

Right off the bat we can see that 8 cores is not actually faster than 16 cores (in your case 4 vs 8). While 16 threads is not twice as fast as 8 it's still considerably faster. I'll talk a bit more about that later. First off I'll address why I think that it's quicker to render two frames at once instead.

TG2 divides renders up into buckets. Each bucket is assigned to a thread. By default TG2 creates one thread for each core on your CPU. When a thread finishes rendering a bucket it's assigned a new bucket if there are any remaining to be rendered. As the render gets closer to being finished there are fewer buckets available to be assigned to threads. When a thread has no buckets it stops. This means that as the render gets closer to being finished fewer and fewer threads are actually running. You can see this in the Task Manager/Activity Monitor by watching the CPU usage and thread count decrease as the render nears the end.

If you watch the benchmark scene you will see that the first 3/4 or so of it renders at a pretty steady pace. However it slows down dramatically when it gets to the lower right corner. I did quite a lot of work with this scene earlier in the year when I was investigating scaling performance. I had assumed that it was the black sphere which slows things down but the grass population also makes a big difference. If you turn on Ray trace objects and also Ray trace atmosphere the scene actually renders quite a lot quicker, with improved scaling.

In any case, that lower right corner really slows things down. As the rest of the image renders a lot quicker a big part of the render time is actually taken up by just a few threads, down to 2 from 16, working away on that part of the image. This is actually why it becomes faster to render two frames at the same time. When rendering in sequence only 2 threads are working on that part of the image and CPU usage is, let's say, 200%. However when you render two frames at the same time then 4 threads are working on that part of the image (2 per frame) and CPU usage is 400%. This means that on average more CPU is used while rendering the frames and therefore it finishes more quickly.

So you're absolutely right that rendering 2 frames at the same is quicker. However I think this will depend in large part on your scene. The reason this "works" is down to that slow part of the scene where few threads are rendering. If you had a scene which was more "balanced" you might find that rendering two frames was slower. Another aspect could be population time - if you weren't repopulating every frame then only populating once at the start of a sequence could be a win in terms of render time overall. That's all kind of educated guesswork though. I do have a stripped back version of the benchmark scene which is more suited to testing scaling and I will try this again with that. IIRC scaling with that scene is something like 25% better than with the original benchmark scene.

Now we get to the part where you're incorrect :-). I don't mean to seem rude here but people do have the impression that TG2 doesn't scale well and that isn't really the case, so I think it's important to dispel that myth. Earlier in the year I looked into scaling, prompted by the fact that the Mac version was actually getting a lot slower once you moved from real cores to hyperthreads. For example it would scale pretty well up to 8 cores on my machine but once it got past that it really slowed down with hyperthreads. I'm happy to say that's fixed for the next release and Mac and Windows versions have the same sort of performance.

I posted a graph in this message (where I also talk about scaling):

http://forums.planetside.co.uk/index.php?topic=11545.msg121102#msg121102

You can see from that graph that on my machine scaling at 8 cores still has a little way to match the ideal but it's not so bad as I think people commonly believe. One of the alpha testers has a 12 core machine and he sees the same sort of scaling out to 12 real cores. Once you get on to hyperthreads performance still improves but not as much. That's because hyperthreads are not real cores and they're not nearly as fast.

This is also demonstrated by my results. Two frames rendering at the same time with 16 threads is faster than using 8 threads. I think you made an incorrect assumption when you thought 4 threads would be faster than 8. I would be interested to see what happens if you tried this again but using 8 threads.

Touching on the render farm aspect, I think you reached the wrong conclusion that it would be faster to render 2 frames simultaneously on each render farm machine using half the number of threads. My results show it would be faster to render using the maximum number of threads available, even if you were running two instances (dependingonthescene ;-).

Like I say, I don't mean to get down on you but I thought it was important to correct the idea that this has something to do with TG2 scaling poorly. There are still improvements which can be made, but it's pretty good. One thing I would like to see happen is the subdivision of buckets so that as the render nears completion the remaining buckets get divided up and more threads stay working for longer.

Congratulations on 2300 posts BTW :-). I notice that Kadri has also reached that milestone today, spooky!

Regards,

Jo

jo

Hi,

I've just repeated my tests using a stripped back version of benchmark. Specifically it's lacking the water and sphere and it has Raytrace objects and Raytrace atmosphere turned on. Here's the results:

Sequence with 16 threads: 8:37
Two frames at once with 16 threads: 7:38 (11.4% difference)
Two frames at once with 8 threads: 7.24 (14.1% difference)

This is a case where rendering was quicker with fewer threads, although not by much, 2.7%. Looking at the previous scene it was 10.3% in favour of using more threads, so you can see that the best route to go depends on the scene.

Even with this stripped back scene it was still a decent win to render two frames individually on the same machine, so that certainly seems to be something worth investigating.

Regards,

Jo

Oshyan

Just to note, that benchmark was created quite a few years ago now, when dual core CPUs were even fairly new and fast, and of course raytrace objects and atmosphere was unavailable. I have intended to update it for some time and hope to do so in the next couple of weeks. It would be great if everyone ran the new benchmark when available to get some updated times and system specs.

- Oshyan

cyphyr

Ok tanks for the feed  back guys, it was one of those ideas I'd had in the back of my mind for aeons but never got about to testing. I definitely believe that its always going to highly scene (and system) dependant. I guess it would be difficult to create a render calculator that had any real meaning without doing physical tests on each machine and system seperately (which could well take longer than the time ultimately saved)
Oh well it was a 2am posting  ;D ::)
Cheers
Richard
www.richardfraservfx.com
https://www.facebook.com/RichardFraserVFX/
/|\

Ryzen 9 5950X OC@4Ghz, 64Gb (TG4 benchmark 4:13)

ajcgi

We have a box in our farm with 4 machines of 24 cores. Sending tg2 scenes to that is seemingly fastest if used in 3 sets of 8 cores, ie each machine picks up 3 frames off the list, then renders them in parallel on 8 cores each, ie 1-8 on one frame, 9-17 the next, 18-24 for the last.
That's just an example, backed up with generic testing rather than sensible benchmarking. ;)

rcallicotte

So this is Disney World.  Can we live here?

Matt

Quote from: ajcgi on April 20, 2011, 07:31:21 AM
We have a box in our farm with 4 machines of 24 cores. Sending tg2 scenes to that is seemingly fastest if used in 3 sets of 8 cores, ie each machine picks up 3 frames off the list, then renders them in parallel on 8 cores each, ie 1-8 on one frame, 9-17 the next, 18-24 for the last.
That's just an example, backed up with generic testing rather than sensible benchmarking. ;)

That is what I would expect. As long as you have enough physical RAM to handle multiple instances without paging, multiple instances will probably give better efficiency than parallelism within one instance.

Not counting load times etc.

Matt
Just because milk is white doesn't mean that clouds are made of milk.

neuspadrin

Quote from: jo on April 19, 2011, 11:25:45 PM
TG2 divides renders up into buckets. Each bucket is assigned to a thread. By default TG2 creates one thread for each core on your CPU. When a thread finishes rendering a bucket it's assigned a new bucket if there are any remaining to be rendered. As the render gets closer to being finished there are fewer buckets available to be assigned to threads. When a thread has no buckets it stops. This means that as the render gets closer to being finished fewer and fewer threads are actually running. You can see this in the Task Manager/Activity Monitor by watching the CPU usage and thread count decrease as the render nears the end.

If you watch the benchmark scene you will see that the first 3/4 or so of it renders at a pretty steady pace. However it slows down dramatically when it gets to the lower right corner.

Here's a question/idea.  Is it possible (perhaps sometime in the future) to split up the last bucket into multiple buckets? Just wondering if maybe the last bucket can then be sub-divided into (number of cores) sections or something.  This would allow all cores to remain working for a longer period of time, and shorten the render time (probably only have a decent change on high resolution though).

Would that help any? Would it be possible?  Just an idea of how to help multi-core get an extra boost there at the end to remain consistent.  Those bottom corners always seem like they last forever ;)

Oshyan

Yes, subdividing remaining buckets is a possibility and would certainly help. Hopefully it's something we can implement in the future.

- Oshyan

Jonathan

Hi all,

This makes very interesting reading.

I am new to TG2, and would welcome some ideas on the ideal setup for the render (threads / cache size etc). I have a quad core 3.2GHz Intel based PC with 8Gb Ram and plenty of HD space.

I have already read some good topics regarding GI Detail / anti-aliasing settings etc, but would welcome views on what would constitute good settings to produce a high quality render.

Many thanks and keep the disucssions coming!

Jonathan
Every problem is an opportunity, but there are so many opportunities it is a problem!

Tangled-Universe

Hi Jonathan,

Like you said there's plenty info on the forums about rendersettings, like this one below which new TG2 users should be obligatory to read:
http://forums.planetside.co.uk/index.php?topic=6442.0

Each project/scene requires different rendersettings.
This is because the scale of the scene can be different, the amount of displacements, type of clouds and atmosphere.
All kinds of things.

Therefore there's no "golden standard" of settings.

Cheers,
Martin

DutchDimension

Quote from: jo on April 19, 2011, 11:25:45 PM
TG2 divides renders up into buckets. Each bucket is assigned to a thread. By default TG2 creates one thread for each core on your CPU. When a thread finishes rendering a bucket it's assigned a new bucket if there are any remaining to be rendered. As the render gets closer to being finished there are fewer buckets available to be assigned to threads. When a thread has no buckets it stops. This means that as the render gets closer to being finished fewer and fewer threads are actually running. You can see this in the Task Manager/Activity Monitor by watching the CPU usage and thread count decrease as the render nears the end.

Interesting reading indeed.

On the case of the final buckets, wouldn't this problem be minimized to some degree by manually setting smaller render buckets? I know this can be a bit risky as certain render engines perform better when they have the freedom to dynamically set their bucket sizes. However, in my experience with other engines, I've noticed than on certain scenes, (though not all), smaller buckets can speed up render times. Simply because of the fact that as the frame-buffer is nearly filled up, because the buckets are smaller, more of them can fit in the remaining un-rendered areas, thus having more threads/cores contribute for longer. I suspect the efficiency of such a strategy depends (amongst other factors) on how the Terragen renderer deals with loading and off-loading of geometry. Is data dynamically being loaded only when it's needed and flushed when the renderer is done with it?

I can't wait to test this out on my 12/24 core Mac Pro whenever I'm reunited with it again (long story). My MacBook Pro is feeling the strain. :-\

Tangled-Universe

In my experience and I believe for some others testers as well rendering often goes a bit faster with 128x128 bucketsize vs default 256x256.
However, decreasing it further makes it slower than 128x128 (and maybe slower than 256x256, but I don't know).

When it comes to rendering I believe TG2 uses some dynamics, for the procedural geometries for instance I believe, but not for the instances of populations, as they're loaded into the memory upon populating at the start of the render.