Address space Limitations of the AMD64/Intel64

Started by PabloMack, July 31, 2019, 03:45:01 pm

Previous topic - Next topic

PabloMack

July 31, 2019, 03:45:01 pm Last Edit: July 31, 2019, 04:54:28 pm by PabloMack
While writing an assembler and linker for the AMD64, I learned that the instruction set can only directly reach across images up to 2GBytes. The PE32+FileFormat which defines Microsoft's EXE format has the same limitations. I just ran TG4 just to see how big the image is and it shows to be about 0.756 GBytes. I am running another application that shows to be 1.721 GBytes. This is almost knocking on the door for being too large for this architecture. What is going to happen if the TG application grows by a factor of 3? How is this image size limit going to be handled? Maybe it is planned obsolescence?
[attach=1]

Matt

Are you sure that is reporting anything to do with the instruction set? I think that's just the overall RAM use including data, right? The code portion of Terragen cannot be more than a few dozen Mb, but I don't know how much scattering occurs at run time.
Just because milk is white doesn't mean that clouds are made of milk.

WAS

August 02, 2019, 03:51:18 am #2 Last Edit: August 02, 2019, 04:10:00 am by WASasquatch
A strange topic. Address space, or what a application or process can access is in memory. But this memory isn't just physical (and shouldn't be on modern OSes). 2GB limitation is a remnant of older 32bit OSes and that's where the Large Address Awareness comes in with 64bit applications to allow up to 4GB of address space.

This is also just address space (addressable at a time) and not cached assets where you can see memory allocated above 4GB.

A good example of this is my inability to create a appropriate caching properly (as well as pure optimization of dirty code) for a shader exporter I am writing and it eating up all my memory (literally gigabytes of memory for a single shader at 12k) lol

Large memory address aware applications like games allow them to load large chunks of open world assets and stream them from cache too, dramatically speeding up loading times rather than pulling from the HDD.
Check out the Terragen Community on Facebook: https://www.facebook.com/groups/Terragen.Galleries/

WAS

August 02, 2019, 11:48:25 am #3 Last Edit: August 02, 2019, 11:50:38 am by WASasquatch
Also I am not sure what you mean by images? Executables are not images. Images use sectors and headers cause originally they were snapshots of actual HDDs. IMG, IMA, etc have this info so they can be mountable as a medium.
Check out the Terragen Community on Facebook: https://www.facebook.com/groups/Terragen.Galleries/

PabloMack

August 02, 2019, 01:20:25 pm #4 Last Edit: August 02, 2019, 01:26:05 pm by PabloMack
Quote from: WASasquatch on August 02, 2019, 11:48:25 am
Also I am not sure what you mean by images? Executables are not images. Images use sectors and headers cause originally they were snapshots of actual HDDs. IMG, IMA, etc have this info so they can be mountable as a medium.


I am using Microsoft terminology. The PE32+ format is what is used for AMD64. I think it is called
PE32+ instead of PE64 because it only supports 32-bit images but was upgraded from PE32 so
that it can load them into a 64-bit address space. However, the specification is complicated and
documentation is not well written in my opinion. So it may be that loading an executable that uses
a list of sections instead of an image may make it possible to load programs much larger than
what will fit into a 32-bit section of address space.  I can't say.

What I understand an image to be is a section of address space that can be reached using 32-bit
offsets. Since these are signed (plus or minus) either direction will only reach 2 GBytes (not 4). In
other words, a reference can reach up to 2GB forward or 2GB backward. If the reference was at the
beginning of the image, it can only reach 2GB forward. Reaching backward will not be used because
the reference being at the beginning of the image means that there is nothing to reach back there.
Same goes for a reference being at the end of an image. The reference will only reach backward
because there is nothing forward to reach because the reference is at the end of the image.

The image includes the program plus the statically defined data which includes globals and constants.
Dynamic memory (which is called the heap in C/C++) is another matter and that is implemented using
RAM that is not part of the image. So it may be that the TG memory usage that I see not only includes
the image but also dynamic memory which will grow and shrink depending on what it is doing. So I don't
think TG has to worry about its executable becoming too large because the image is certainly a lot smaller
than 2GBytes. The TGD.EXE file shows to be 488KB. That plus any DLLs that are used are probably a
good approximation of what TG uses statically. Sorry if I alarmed anyone.

Here is a picture I took out of a document that describes the PE32 file format it uses the term "image"
in two places. The PE32+ is similar but it has some fields extended to 64-bit to support a large address
space. But images are still limited to 2GBytes.

The reason why Windows32 needs 4GBytes is so that it can look at system space at the same time
as it looks at user space, each being 2GBytes in size.

[attach=1]

WAS

August 02, 2019, 03:18:05 pm #5 Last Edit: August 02, 2019, 03:29:02 pm by WASasquatch
I see what you mean about images now. I spend to much time in Linux. Lol

Seems PE+/PE32+ is a .NET PE 64bit extension specification introduced for Windows CE. Didn't AMD introduced 64bit architecture first, and Intel followed suit with the modified NetBurst? Not sure about the hearts of the CPUs these days, but during the Athlon battle days, AMD64 was considered "true 64bit". Where AMD64 for was from R&D and Intel's a modification, and likely RE.

What you are saying doesn't seem to be true, there are many programs I shouldn't be running on AMD64, let alone my KVMs... :O

From researching the "2 Gigabyte Problem" which gave us PAE, there is no mention of AMD64 as any problem. When the 4GT flag is set in the BIOS, a 64bit executable is Large Address Aware  and can use "3GB" (seems odd) -- While a 32bit process that has  a Large Address Aware flag can use 4GB from 2GB.

I can't actually find anything about AMD64 limited to 2GB, and PE32+ relates to .NET PE and a 64bit implementation.

A good game example for this problem is the Morrowind Graphics Overhaul. It required 4GB address space to run, and uses the Large Address Aware patch to give the morrowind executable the LAA flag, and than can use 4GB of address space. Also probably my favorite game of all time. Just throwing that out there...
Check out the Terragen Community on Facebook: https://www.facebook.com/groups/Terragen.Galleries/

WAS

Also what you are saying just confuses me as it would mean AMD64, a breakthrough is really no different than x86? Utilizing some sort of hack that even LAA 32bit processes don't even need to do. I'm just confused.  :o
Check out the Terragen Community on Facebook: https://www.facebook.com/groups/Terragen.Galleries/

PabloMack

August 02, 2019, 05:12:11 pm #7 Last Edit: August 02, 2019, 05:58:19 pm by PabloMack
Quote from: Matt on August 02, 2019, 01:39:10 am
I think that's just the overall RAM use including data, right? The code portion of Terragen cannot be more than a few dozen Mb, but I don't know how much scattering occurs at run time.


I think that is all correct.

Quote from: WASasquatch on August 02, 2019, 03:38:06 pm
Also what you are saying just confuses me as it would mean AMD64, a breakthrough is really no different than x86? Utilizing some sort of hack that even LAA 32bit processes don't even need to do. I'm just confused.  :o


Well...I'm confused too so we're in the same boat. The x86 is a big mess and I don't think anyone who knows this architecture doubts that.

When the x86-32 became x86-64 (AMD64), the main thing that happened was that it now had a 64-bit address space. That is not a trivial improvement. That is a big improvement. But its addressing modes changed very little. Programs that are linked as one image, still only have a choice between 8- or 32-bit (signed) offsets. That's all. You can't build an image that is so large that references within the image can't be resolved. This is what caps the maximum size of your program to 2 G-Bytes. Also, keep in mind that program space pointed to by the code segment can be in a different 32-bit space from the data space pointed to by the data segment. Initialized data is part of the image. So Code Space plus Data Space together can be up to 4GBytes.

Building an image is a pre-load (link time) process. This means that all references within itself must be resolved and there are no lingering references that know anything about anything outside of its 32-bit address space. I don't think this includes DLL's because they don't link until needed (That's what's meant by Dynamic which is the 'D' in DLL). But this does not say that you can't do post-load time (i.e. run time) address calculations to reach more than that. This is RAM and it is uninitialized and it is outside the image. It will still show up as memory usage and this part of memory can become very huge. When a program like TG loads such things as texture maps and geometries, these are not part of its image. But they are part of your process's memory. So when you see program usage in the Task Manager, it doesn't break down usage by image/segment and dynamically allocated memory. It just shows you the total.

WAS

August 02, 2019, 06:44:25 pm #8 Last Edit: August 02, 2019, 06:55:44 pm by WASasquatch
Quote from: PabloMack on August 02, 2019, 05:12:11 pm
When the x86-32 became x86-64 (AMD64), the main thing that happened was that it now had a 64-bit address space. That is not a trivial improvement. That is a big improvement. But its addressing modes changed very little. Programs that are linked as one image, still only have a choice between 8- or 32-bit (signed) offsets. That's all. You can't build an image that is so large that references within the image can't be resolved. This is what caps the maximum size of your program to 2 G-Bytes. Also, keep in mind that program space pointed to by the code segment can be in a different 32-bit space from the data space pointed to by the data segment. Initialized data is part of the image. So Code Space plus Data Space together can be up to 4GBytes.

Building an image is a pre-load (link time) process. This means that all references within itself must be resolved and there are no lingering references that know anything about anything outside of its 32-bit address space. I don't think this includes DLL's because they don't link until needed (That's what's meant by Dynamic which is the 'D' in DLL). But this does not say that you can't do post-load time (i.e. run time) address calculations to reach more than that. This is RAM and it is uninitialized and it is outside the image. It will still show up as memory usage and this part of memory can become very huge. When a program like TG loads such things as texture maps and geometries, these are not part of its image. But they are part of your process's memory. So when you see program usage in the Task Manager, it doesn't break down usage by image/segment and dynamically allocated memory. It just shows you the total.


That sorta defeats the purpose of the logic, and again, what LAA does, making THAT process 4GB aware. Which I guess the 3GB (mentioned above) actually comes from 32 process (and OS) with the /3GB switch enabled (https://docs.microsoft.com/en-us/previous-versions/tn-archive/bb124810(v=exchg.65)).

I did find this interesting Microsoft blog post by ASP.NET Debugging regarding processes. And I do see a limitation imposed on .NET applications.


"2800 MB if using a 4 GB process or more if more RAM (around 70% of RAM + Pagefile)"

Further going on to say

"Keep in mind that although a .NET process can grow this large, if the process is multiple GB in size, it can become very difficult for the Garage Collector to keep up with the memory as Generation 2 will become very large.  I'll talk about the generations more in an upcoming post."

https://blogs.msdn.microsoft.com/tom/2008/04/10/chat-question-memory-limits-for-32-bit-and-64-bit-processes/

I assume that 4 GB Process refers to a LAA process. It does though mention that it can go above this like a normal 64bit process if you have the ram available. Which it does look like the offset for that is pretty cracy at 70% plus Pagefile to match (I already do this from the get go for best windows performance. Had many convos here about that).

In the end though, this seems to be a limitation of .NET, not AMD64/Intel64.

This is actually enlightening information about the stability of some .NET programs that grow in memory.

Check out the Terragen Community on Facebook: https://www.facebook.com/groups/Terragen.Galleries/

PabloMack

August 02, 2019, 07:49:01 pm #9 Last Edit: August 02, 2019, 08:44:16 pm by PabloMack
I'm pretty much talking about hardware. There are software systems layered over the hardware for gaming and such that can manage additional memory beyond the 4GByte image. These software systems have their own switches, parameters and metrics to do their own work. A lot of this kind of thing is hidden in their software layers. But the bottom line is, ultimately, software can only do what the hardware can do. What is hidden in the software layers is out of sight. The IBM AS/400 got to be very good at hiding how it did things by this layering upon layers. There is nothing to prevent a program from loading additional executable code to many places in that huge 64-bit address space to make a program that goes far beyond a single image. But they do this at additional cost by making more calls to the OS and doing things like link-loading other files and using "manual" address calculations to get to those other places in memory. This kind of work-around is common in the Microsoft world. Remember memory extenders in DOS? It's very nasty stuff. It seems it's de-ha-veau all over again.

WAS

Quote from: PabloMack on August 02, 2019, 07:49:01 pm
I'm pretty much talking about hardware. There are software systems layered over the hardware for gaming and such that can manage additional memory beyond the 4GByte image. These software systems have their own switches, parameters and metrics to do their own work. A lot of this kind of thing is hidden in their software layers. But the bottom line is, ultimately, software can only do what the hardware can do. What is hidden in the software layers is out of sight. The IBM AS/400 got to be very good at hiding how it did things by this layering upon layers. There is nothing to prevent a program from loading additional executable code to many places in that huge 64-bit address space to make a program that goes far beyond a single image. But they do this at additional cost by making more calls to the OS and doing things like link-loading other files and using "manual" address calculations to get to those other places in memory. This kind of work-around is common in the Microsoft world. Remember memory extenders in DOS? It's very nasty stuff. It seems it's de-ha-veau all over again.


Hardware is what I am talking about. The hardware switches for 3gb and 4gb allow SINGLE 3gb/4gb address calls.

No where can I find that 64bit processes are limited to 2gb draws, for AMD64 or otherwise. The only limitation is in .NET. like you mentioned PE32+  but it's a extension for Windows CE, a different breed to WinXP or otherwise.

The limitations you describe seem to fall in line with .NET memory limitations exclusively.

Also, you can explore sub processes of processes to see what is all linked. Expand the process in task manager or view it in process hacker.
Check out the Terragen Community on Facebook: https://www.facebook.com/groups/Terragen.Galleries/

PabloMack

August 03, 2019, 11:37:57 am #11 Last Edit: August 03, 2019, 12:05:37 pm by PabloMack
Quote from: WASasquatch on August 02, 2019, 10:35:08 pm
Hardware is what I am talking about. The hardware switches for 3gb and 4gb allow SINGLE 3gb/4gb address calls.


The document you showed me was old and pre-AMD64. What you are talking about deals with how a 32-bit operating system can use a 32-bit address space (in a pre-AMD64 system) to map in system and user space. I believe the way both Windows and Linux work is that when the OS is in Supervisor Mode, it can see both user space and system space. This makes it easier to process requests because it can directly see what the user mode sees while simultaneously seeing its own resources.

In a system call, the user passes an address to the supervisor in a request for some service to be provided. The user will always pass an address that has meaning in its own user space because it doesn't have direct access to supervisor space. If a user tries to access the space that is mapped for use by the supervisor, then it will cause a fault and the process will be aborted. This is because the MMU will not allow the processor to make the access that is protected for system use only. But when in supervisor mode, the supervisor can make accesses to both its own space and user's space. In the first implementation of NT, 2GBytes of address space were used for mapping in system resources while the other 2GBytes were used to map in user memory. Later on when many user mode programs were needing access to more memory, the partitioning was changed to give more to the user and less for use by the operating system. In order to provide a user mode process more than 2GB of address space, a special setup was used where the system will make-do with only 1GByte for itself, leaving 3GBytes for use for user programs. This is not so much a "hardware switch". It is the way the operating system configures the MMU (which is hardware) for simultaneous use by the system and the user.

Quote from: WASasquatch on August 02, 2019, 10:35:08 pm
No where can I find that 64bit processes are limited to 2gb draws, for AMD64 or otherwise. The only limitation is in .NET. like you mentioned PE32+  but it's a extension for Windows CE, a different breed to WinXP or otherwise.


Let me also say that "processes" are not hardware but are software entities created and managed by the operating system.

What do you mean by a "draw"? Also, I've never used .NET but it has to run on the same hardware as everything else. Same applies to Windows-CE. That is just software. They use the same processors as regular Windows. But it uses the hardware differently by configuring memory management differently.

Quote from: WASasquatch on August 02, 2019, 10:35:08 pm
Also, you can explore sub processes of processes to see what is all linked. Expand the process in task manager or view it in process hacker.


This may be the case. But still, if different sub-processes are running within the same "process" and together they can grow to be very much larger than 4GB, then in effect, each sub-process has its own "image" so we are talking about multiple images now and multiple 4GB images can reside within the same 64-bit address space at the same time. This is something that Windows-32 or Linux-32 could not do because all they had was a 32-bit address space. But still, an instruction that is executing within one of the images can't directly reach a resource that is located within another image using one of the standard addressing modes that is defined in the Mod/RM byte of the instruction. The offsets are only 8 or 32-bits and there are no 64-bit offsets in any of the addressing modes. But you can load a 64-bit value into a register and then use it as an address but this is not a 64-bit addressing mode.

It would be like one guy says "You can't kill anyone with an unloaded gun". Then to prove that he can, he takes the empty gun and hits someone over the head with it and kills him. So you can kill someone with an empty gun. You just can shoot someone with it to kill him. These conversations are all about how you say things and what you mean by saying them. I think in some ways we are defining our words differently. For example, the kind of "linking" you are talking about may be different than what I meant by the word. Sometimes in computer science, the same words are used for very different things. When they talk about "linked-lists" they are not talking about what a "linker" does. These are talking about completely different things.

So it is no wonder that we sometimes disagree because we don't realize we are talking about different things in the first place or define our words to mean different things. To me a "hardware switch" is a flip-flop where you write a "1" into it and the way it controls other hardware is changed. To you, it might be like a command-line "switch" using terminology used in command line interpreters that somehow configure things behind the scenes so that the software works differently somehow.

WAS

August 03, 2019, 12:36:24 pm #12 Last Edit: August 03, 2019, 12:59:52 pm by WASasquatch
Quote from: PabloMack on August 03, 2019, 11:37:57 am
Quote from: WASasquatch on August 02, 2019, 10:35:08 pm
Hardware is what I am talking about. The hardware switches for 3gb and 4gb allow SINGLE 3gb/4gb address calls.


The document you showed me was old and pre-AMD64. What you are talking about deals with how a 32-bit operating system can use a 32-bit address space (in a pre-AMD64 system) to map in system and user space. I believe the way both Windows and Linux work is that when the OS is in Supervisor Mode, it can see both user space and system space. This makes it easier to process requests because it can directly see what the user mode sees while simultaneously seeing its own resources.

In a system call, the user passes an address to the supervisor in a request for some service to be provided. The user will always pass an address that has meaning in its own user space because it doesn't have direct access to supervisor space. If a user tries to access the space that is mapped for use by the supervisor, then it will cause a fault and the process will be aborted. This is because the MMU will not allow the processor to make the access that is protected for system use only. But when in supervisor mode, the supervisor can make accesses to both its own space and user's space. In the first implementation of NT, 2GBytes of address space were used for mapping in system resources while the other 2GBytes were used to map in user memory. Later on when many user mode programs were needing access to more memory, the partitioning was changed to give more to the user and less for use by the operating system. In order to provide a user mode process more than 2GB of address space, a special setup was used where the system will make-do with only 1GByte for itself, leaving 3GBytes for use for user programs. This is not so much a "hardware switch". It is the way the operating system configures the MMU (which is hardware) for simultaneous use by the system and the user.

Quote from: WASasquatch on August 02, 2019, 10:35:08 pm
No where can I find that 64bit processes are limited to 2gb draws, for AMD64 or otherwise. The only limitation is in .NET. like you mentioned PE32+  but it's a extension for Windows CE, a different breed to WinXP or otherwise.


Let me also say that "processes" are not hardware but are software entities created and managed by the operating system.

What do you mean by a "draw"? Also, I've never used .NET but it has to run on the same hardware as everything else. Same applies to Windows-CE. That is just software. They use the same processors as regular Windows. But it uses the hardware differently by configuring memory management differently.

Quote from: WASasquatch on August 02, 2019, 10:35:08 pm
Also, you can explore sub processes of processes to see what is all linked. Expand the process in task manager or view it in process hacker.


This may be the case. But still, if different sub-processes are running within the same "process" and together they can grow to be very much larger than 4GB, then in effect, each sub-process has its own "image" so we are talking about multiple images now and multiple 4GB images can reside within the same 64-bit address space at the same time. This is something that Windows-32 or Linux-32 could not do because all they had was a 32-bit address space. But still, an instruction that is executing within one of the images can't directly reach a resource that is located within another image using one of the standard addressing modes that is defined in the Mod/RM byte of the instruction. The offsets are only 8 or 32-bits and there are no 64-bit offsets in any of the addressing modes. But you can load a 64-bit value into a register and then use it as an address but this is not a 64-bit addressing mode.

It would be like one guy says "You can't kill anyone with an unloaded gun". Then to prove that he can, he takes the empty gun and hits someone over the head with it and kills him. So you can kill someone with an empty gun. You just can shoot someone with it to kill him. These conversations are all about how you say things and what you mean by saying them. I think in some ways we are defining our words differently. For example, the kind of "linking" you are talking about may be different than what I meant by the word. Sometimes in computer science, the same words are used for very different things. When they talk about "linked-lists" they are not talking about what a "linker" does. These are talking about completely different things.

So it is no wonder that we sometimes disagree because we don't realize we are talking about different things in the first place or define our words to mean different things. To me a "hardware switch" is a flip-flop where you write a "1" into it and the way it controls other hardware is changed. To you, it might be like a command-line "switch" using terminology used in command line interpreters that somehow configure things behind the scenes so that the software works differently somehow.


Uhmm. AMD64 was released 4 years earlier in 2004 (why the article is about 64bit and 32bit OSes; and R&D finished in 2001), I've had AMD64 since 2004 (with the new 2005 model HP Media Center, as the AMD64 allowed unrivaled performance at the time). I'm not a Intel guy. Haven't had one I wanted since my Toshiba in 97 with 233mhz MMX. The other article is maintained, as of 2014.

And yes, processes are not hardware, but their hardware specific hardware flags are for specific use of hardware functionality...

Can you provide any concrete evidence of this limitation outside of .NET software?
Check out the Terragen Community on Facebook: https://www.facebook.com/groups/Terragen.Galleries/

PabloMack

August 03, 2019, 01:54:37 pm #13 Last Edit: August 03, 2019, 02:24:38 pm by PabloMack
Quote from: WASasquatch on August 03, 2019, 12:36:24 pm
Can you provide any concrete evidence of this limitation outside of .NET software?


This is taken from the AMD64 specification Volume 3 in the discussion on the MOV instruction
which is the only instruction that can even do a 64-bit access. And these instructions can only
do them using register RAX. So any memory access using any of the other instruction or any
other register is limited to a 32-bit reach.

This is the quote:
"Opcodes A0-A3, in 64-bit mode, are the only cases that support a 64-bit offset value.
(In all other cases, offsets and displacements are a maximum of 32 bits.) The B8 through
BF (B8 +rq) opcodes, in 64-bit mode, are the only cases that support a 64-bit immediate value
(in all other cases, immediate values are a maximum of 32 bits)."

The reason why code with only 32-bit offsets can operate anywhere within 64-bit address
space is because these addresses are added to the bases of the segments that are associated
with the registers doing the access. These 64-bit bases are managed by the operating
system and are not directly handled by the application.

The following are the instruction encodings as taken from the same section about MOV in volume 3.
Keep in mind that they don't all apply to 64-bit mode:

MOV AL, moffset8 A0 Move 8-bit data at a specified memory offset to the AL register.
MOV AX, moffset16 A1 Move 16-bit data at a specified memory offset to the AX register.
MOV EAX, moffset32 A1 Move 32-bit data at a specified memory offset to the EAX register.
MOV RAX, moffset64 A1 Move 64-bit data at a specified memory offset to the RAX register.
MOV moffset8, AL A2 Move the contents of the AL register to an 8-bit memory offset.
MOV moffset16, AX A3 Move the contents of the AX register to a 16-bit memory offset.
MOV moffset32, EAX A3 Move the contents of the EAX register to a 32-bit memory offset.
MOV moffset64, RAX A3 Move the contents of the RAX register to a 64-bit memory offset.

WAS

August 03, 2019, 03:45:31 pm #14 Last Edit: August 03, 2019, 03:49:22 pm by WASasquatch
And how does this relate to AMD64 2GB limit of address space? I'm still confused what you're explicitly talking about. What you started this topic about doesn't seem to be an issue anywhere in development space that I can find besides .NET, which is not a AMD64 related issue.

The problem you introduce seems to nullify the point of LAA

For reference, here is Volume 3 (.p226-227): https://www.amd.com/system/files/TechDocs/24594.pdf
Check out the Terragen Community on Facebook: https://www.facebook.com/groups/Terragen.Galleries/