Help requested for Terragen render thread scalability test

Started by Tangled-Universe, August 29, 2019, 06:38:35 AM

Previous topic - Next topic

Tangled-Universe

In the process of preparing a new system build for TG/3D I'm kind of on the fence about whether to wait for Ryzen 3950X or the later to be released Threadripper with 32 cores.
I'm well aware that a lot of software is not capable of fully utilizing that many cores, but that differs for each piece of software.

I'd like to ask a favor to all owners of systems with 16 physical cores or more (preferably!) to perform a scalability test for TG using the Terragen 4 benchmark scene.
I have to think about how valuable and insightful it is to also perform SMT/hyperthreading for these core-counts, but for now I think those are more relevant to CPU testing than software testing and I'd like to keep things as simple as possible to begin with.

I'd like to plot render times for rendering with 4, 8, 12, 16, 24, 28 and 32 physical cores. If such systems are available and people feel up to it of course!
You can do this by changing the max threads number to the same numbers as above, but please do not render with more threads than cores available in your system (see above).
As usual with benchmark testing, don't change anything else.

It would be great if around 8 people would manage to do this, so I can plot average and standard deviations.
Please report CPU build and clock speeds, although they are not critical since I know how to normalize this kind of data, thanks to my dayjob in a lab.
However, you never know with Intel/AMD's differences, so it might be useful in the end.

Constructing such a curve would be very informative for people considering to build a hefty render machine, but who still want to hit the sweet spot for price/performance.
At some point more cores are not helpful anymore and a potential waste of money.
I'm shooting for a system with 4 or 8 more cores than TG can handle so I can still do other relatively demanding tasks while rendering with optimal thread number.

Thanks in advance for people who are willing to invest time - I know I ask for quite a bit of your time!!- in helping me with this and I'm happy to hear any discussion on this endeavour!

Cheers,
Martin

digitalguru

I'd be happy to help - I guess you'll provide a test .tgd for the exercise.

I have a 16 core Threadripper and like you, have heard of scalability issues on the newer AMDs with more cores.

As a side note,  I recently tried a Realflow fluid sim on a 16 core / 32 thread 1950X and it ran surprisingly slow. It was only whens I restricted the threads to 16 that it ran at the speed I would have expected

Tangled-Universe

Thanks!

Probably due to TL;DR effects you missed that I intend to use the Terragen 4 benchmark scene for it:

https://planetside.co.uk/terragen-4-benchmark/

Tangled-Universe

The reason I'd like to use the official benchmark is that whatever results come out of it, it will also somehow relate/translate to what's already in the database.

The issue with the database itself for investigating scalability is that each user renders the benchmark full throttle, of course!
But all these results are different because of different CPU builds/speeds RAM build/speeds and what not. These matter, but do not tell anything about TG, but rather about the machine.
These factors all cancel out when on 1 machine you render at 4 cores base, then increase core-numbers.
In an ideal situation with each core-doubling render time should be half. Thus a straight line in the graph.
The 4 core base render time incorporates all these differences in CPU/RAM model/speed etc., so normalization of these variables happens all at once.

The only potential wrong assumption I make is that TG is scaling very well from 1 thread to 4 threads, but I don't want people to go through the long process of rendering the benchmark at 1 and 2 cores.

digitalguru

No probs - should take just over a couple of hours to run on my machine I think - I'll set up the files to run with a render manager (to get some accurate stats).

cyphyr

Here you go:

4 threads 17m 38s
8 threads 9m 43s
12 threads 7m 29s
16 threads 6m 46s
24 threads 6m 11s

64 threads 6m 06s (just let run un-changed)

Ryzen 9 3900X oc'd @ 4.15 Ghz
64 Gb Ram

There is an up-to 40s plus variance on the bunchmark (going on previous tests running at default 64 threads).
Fastest I have had it render is 4m55s. Slowest at about 6m 40s with the same system settings.
www.richardfraservfx.com
https://www.facebook.com/RichardFraserVFX/
/|\

Ryzen 9 5950X OC@4Ghz, 64Gb (TG4 benchmark 4:13)

Tangled-Universe

Thanks guys!

@cyphyr Richard, thanks for kicking off this test! :)
Your CPU is a 12 core capable of SMT/hyperthreading. I will use your results up to 12 threads for main purpose, since SMT/hyperthreading is complicating matters quickly, but definitely appreciated you tried it!
I think SMT/hyperthreading is probably a large cause of the variance you describe.

Let's wait for more results, but yours are already interesting!

D.A. Bentley (SuddenPlanet)

Here is some additional data.  If I understood correctly you wanted SMT/Hyper-Threading turned off to see just Physical Core performance differences so I turned off the SMT in mt Threadripper 1950X and got these numbers rendering the TG4 Benchmark Scene:

16 Cores / 16 Threads:  7:01
12 Cores / 12 Threads: 9:28
 8 Cores /  8 Threads: 12:23
 4 Cores /  4 Threads: 22:56

digitalguru

Here you go, very interesting results, not what I'd expected, but looking at Cypher's results seem to follow a similar pattern: (and just seeing D A Bentley's results similar with Hyperthreading turned on)

Machine details:

AMD Threadripper 1950X 16 cores / 32 threads - 3.7ghz over clock
96 GB Corsair DDR4 Vengeance LPX 3000MHz
Windows 10 Professional
Bios and Chipset drivers up to date
SMT (Hyperthreading) Enabled

Terragen 4.4.18 Frontier

Rendered via command line render queued in Deadline

Threads Time        Peak Ram    CPU load
32         06m56s   7.884 GB    100% (control render)
16         08m51s   5.687 GB    51%
12         08m36s   5.249 GB    39%
8           11m51s   4.344 GB    26%
4           21m57s   3.238 GB    14%

Quote from: Tangled-Universe on August 29, 2019, 09:09:32 AMn an ideal situation with each core-doubling render time should be half. Thus a straight line in the graph.

Obviously not from these results, it follows the curve from 4 to 8 threads, but surprised at the results from 12 to 16!

I seem to remember render times being a bit more predictable on the last project I rendered, but I wasn't using Defer all or Path tracing.

The test scene uses Defer All - I wonder if that's a factor. I might try the same test with the Terragen 3 benchmark to see if the results are similar.

D.A. Bentley (SuddenPlanet)

My results were with hyper-threading turned OFF (AMD calls it SMT).

Also used Terragen 4.4.18 Frontier and have 128GB RAM, but only 9GB was needed.

-Derek

digitalguru

Quote from: digitalguru on August 29, 2019, 02:22:42 PM(and just seeing D A Bentley's results similar with Hyperthreading turned on)
I could have phrased that better :)

KlausK

Here are my results:

- 32 cores - 00h08m29s HyperThreading ON / all other sessions with HT OFF


- 16 cores - 00h09m08s

- 12 cores - 00h11m22s

- 08 cores - 00h16m32s

- 06 cores - 00h20m30s

- 04 cores - 00h27m59s

- 02 cores - 00h51m54s

Intel (R) Xeon(R) CPU E5-2640v3 @2.60GHz / 2 Processors
Installed Memory RAM - 64GB
Windows 7 Ultimate SP1 64bit

I disabled the cores in the BIOS and so restarted every time before I rendered.

Hope that is useful for you.
CHeers, Klaus

ps: this test reminded me why I gave up on TG for a while when I only had a Quad-Core Pentium CPU (Labtop) to work with.
TG3.x was sloooow but when I installed the Beta version of TG4.x it felt like it did not move at all.
I "saw the light again" after building this Dual Xeon machine a few years back...still happy with it for my needs.
/ ASUS WS Mainboard / Dual XEON E5-2640v3 / 64GB RAM / NVIDIA GeForce GTX 1070 TI / Win7 Ultimate . . . still (||-:-||)

Tangled-Universe

Thanks guys!!

Keep them coming please :) :) :) 

Also, please, keep hyperthreading/SMT out of the tests, unless curiosity is irresistible :P It will save you time as well.

@KlausK , why did you disable cores in the BIOS for these tests? If you have 16 physical cores and set max threads in the renderer to 12 threads then you will utilize 12 of your 16 threads. No BIOS tweaking needed I'd say, but thanks a lot for your effort and willingness to do this so exact!

@D.A. Bentley , thanks for running those tests again without SMT! Intel and AMD ought to achieve similar things with hyperthreading and SMT, but these could lead to performance differences among these platforms and therefore potentially skew the results of these tests, because it would make seem TG be more capable to run more threads on Intel or AMD (don't know which of the two has the better implementation). This would then be more meaningful to test CPU's rather than TG's capability of managing threads.

@digitalguru , your remarks about defer all/PT are interesting. I hope these aspects are not at play too much. As long as people render the same benchmark scene this is all fine. If it turns out that certain aspects of the renderer are multithreaded more/less efficiently then all we can say is that the render thread scalability data is valid for this benchmark.
It foremostly would mean that the renderer needs work and the results so far suggest that's definitely needed. Let's wait and see.

Thanks all again!

KlausK

"...but thanks a lot for your effort and willingness to do this so exact!" Exactly ;)
I restarted in between renders anyway so...
CHeers, Klaus
/ ASUS WS Mainboard / Dual XEON E5-2640v3 / 64GB RAM / NVIDIA GeForce GTX 1070 TI / Win7 Ultimate . . . still (||-:-||)

Oshyan

It would indeed be interesting to see if different render methods scale better than others. We can probably do this with the existing TG4 benchmark scene. It doesn't matter so much what the absolute time differences are (i.e. PT takes a lot longer than the Standard renderer), what matters is the relative difference between # of cores used. So we should be able to test Standard, Defer All, and PT all with the same scene.

I would say this is relevant to but separate from Martin's request here. If people are similarly willing to run some tests (I would limit the core tests even further than Martin probably, since 3 rendering methods need to be tested), then I will start a new thread to plan and discuss results. "Like" this post if you are interested in participating!

- Oshyan