Dec 2, 2023

My first stack overflow.

Stack Overflow.

StackOverflow is a website you almost certainly know if you are a programmer. You might be an actual user of it, posting questions and answers. You might interact with it simply through Google, finding it is often the first suggestion for your progamming questions. You might also be the kind of gatekeeper that people frequently complain about when it comes to the site.

What is a stack.

The wikipedia definition of a stack is a reasonable one, but I’m going to look at a more concrete definition, as that is when I first discovered how the stack works. It is also important to note that THE stack in computing is quite distinct from A stack. You can create and use as many stacks as you like in a computer program, but every process has a single THE stack.

Stacks in Python.

A queue you can only access at one end, where the last element in must be the first element out (LIFO) is a stack.

Python has a standard library function for this kind of queue.

import queue

stack = queue.LifoQueue(maxsize=4)

for i in range(8):
    print(f"Adding {i} to the stack")
    stack.put(i)
    print(f"Stack is now {stack.queue}")

When run this produces.

Adding 0 to the stack
Stack is now [0]
Adding 1 to the stack
Stack is now [0, 1]
Adding 2 to the stack
Stack is now [0, 1, 2]
Adding 3 to the stack
Stack is now [0, 1, 2, 3]
Adding 4 to the stack

Note the program now hangs here. These kind of queues in Python are designed to be used to in multithreaded programs, and it is designed to just block and wait when the stack is full.

Our own overflowable stack.

If we want to, we could build our own stack function that errors if it is full or empty. Something like.

import typing

class StackUnderflow(Exception):
    pass

class StackOverflow(Exception):
    pass

class Stack:
    def __init__(self, maxsize: int = None, contents: typing.Sequence | None = None) -> None:
        self.maxsize = maxsize
        if not contents:
            self.contents = []
        else:
            self.contents = list(contents)

    def push(self, element: any):
        print(f"Trying to add {element} to stack with contents {self.contents}")
        if len(self.contents) == self.maxsize:
            raise StackOverflow
        self.contents.append(element)
        
    def pop(self) -> any:
        if len(self.contents) == 0:
            raise StackUnderflow
        ret = self.contents.pop()
        print(f"Removed {ret} from stack with contents {stack.contents}")
        return ret

stack = Stack(maxsize=4)
stack.push("A")
stack.push("B")
stack.pop()
stack.push("C")
stack.push("D")
stack.push("E")
stack.push("F")

Which produces output

Trying to add A to stack with contents []
Trying to add B to stack with contents ['A']
Removed B from stack with contents ['A']
Trying to add C to stack with contents ['A']
Trying to add D to stack with contents ['A', 'C']
Trying to add E to stack with contents ['A', 'C', 'D']
Trying to add F to stack with contents ['A', 'C', 'D', 'E']
Traceback (most recent call last):
  File "stack.py", line 37, in <module>
    stack.push("F")
  File "stack.py", line 20, in push
    raise StackOverflow
__main__.StackOverflow

This is a pretty close analog to how THE stack works. It has a fixed size, and an error is produced when the stack is too full.

To overflow a real process stack though, we will need a lower level programming language.

Call paths in C.

Imagine I write code like this, in C.

#include <stdio.h>

void foo() {
    printf("Running Foo\n");
}

void bar() {
    printf("Running Bar\n");
}

int main() {
    printf("Starting Main\n");
    foo();
    bar();
    printf("Ending Main\n");
    return 0;
}

It is hopefully pretty obvious this produces.

Starting Main
Running Foo
Running Bar
Ending Main

The code execution path is simple enough to follow. The lines of main() get executed one at a time, with detours into foo() and bar(). If you indent calls, it would look like.

print "Starting Main"
CALL foo()
    print "Running Foo"
CALL bar()
    print "Running Bar"
print "Ending Main"

We could rearrange this code to produce the same output, but with a different path, one that has two indents in it. This would be

#include <stdio.h>

void foo() {
    printf("Running Foo\n");
}

void bar() {
    foo();
    printf("Running Bar\n");
}

int main() {
    printf("Starting main\n");
    bar();
    printf("Ending main\n");
    return 0;
}

This subtle change moving the call for foo() inside bar() means we produce the same output as before, but the call path now looks like.

print "Starting Main"
CALL bar()
    CALL foo()
        print "Running Foo"
    print "Running Bar"
print "Ending Main"

What is going on.

If you throw this program into Compiler Explorer, you get to see how the code is actually assembled. When compiled to AMD64 / x86-64 Assembler, you get this code.

.LC0:
        .string "Running Foo"
foo():
        push    rbp
        mov     rbp, rsp
        mov     edi, OFFSET FLAT:.LC0
        call    puts
        nop
        pop     rbp
        ret
.LC1:
        .string "Running Bar"
bar():
        push    rbp
        mov     rbp, rsp
        call    foo()
        mov     edi, OFFSET FLAT:.LC1
        call    puts
        nop
        pop     rbp
        ret
.LC2:
        .string "Starting main"
.LC3:
        .string "Ending main"
main:
        push    rbp
        mov     rbp, rsp
        mov     edi, OFFSET FLAT:.LC2
        call    puts
        call    bar()
        mov     edi, OFFSET FLAT:.LC3
        call    puts
        mov     eax, 0
        pop     rbp
        ret

Here we can see the code uses an assembly memonic CALL.

The documentation here goes fairly deep, but the description is perfect for what we need to know here.

Saves procedure linking information on the stack and branches to the called procedure specified using the target operand.
The target operand specifies the address of the first instruction in the called procedure. The operand can be an immediate value,
a general-purpose register, or a memory location.

While a little more complex to follow than the simplified description I wrote before, we can see what is going on here by reading the assembly code. We have main, which does “some stuff”, and then runs call puts. This is what prints things to the screen, so every instance of call puts is a printf in the original C. Then, the call foo() and call bar() are equivilent to the bare foo() and bar() in the original C.

Reading that documentation on CALL again more carefully, you can see it talks about put things on the stack, but what uses those things on the stack. Well, that is the ret memonic. Reading the important part of that description, we find.

Transfers program control to a return address located on the top of the stack.

A slightly simplified explanation.

I’m not going to get into the details of addressing modes, near vs far return, or swapping ring levels as there is a lot of depth and detail here and this is just a summary explanation of what is happening.

But, every use of CALL puts an address on the process stack and moves execution into the function. At the end of every function, RET is called, the address CALL put onto the stack is taken and put as the next instruction to run.

So here we can see the stack grows as we call nested functions, and empties once we are at the outermost indent level in code, or in the main() function in the case of a C program.

So what exactly is the stack.

This varies from OS to OS, but on Linux, every process has a fixed size stack. This is where this data gets put when CALL gets executed inside it. You can find what this size is with the ulimit command. On my system running Debian 12, this was 8192 kilobytes, and you can verify that with ulimit -s.

Overflowing intentionally.

So, lets make something that on purpose fills the stack. How ?. Simple, we will just write a C program that calls the same function over and over again, and see how it fails.

#include <stdio.h>

int call_depth = 0;

void loop() {
    printf("call depth is %i\n", call_depth);
    call_depth++;
    loop();
}

int main() {
    loop();
    return 0;
}

And this program does.

call depth is 0
call depth is 1
call depth is 2
...
call depth is 523781
call depth is 523782
call depth is 523783
Segmentation fault

As expected, we can only do this a finite number of times before we fail. The fact we can do it 523,783 times gives us an idea of how much CALL is putting onto the stack. We have 8192kilobytes of stack space, so 8192 * 1024 = 8,388,608 bytes of space. If we assume the stack is empty, and all it is used for is storing these return values, we can estimate we need 8388608 / 523783 = 16.01 bytes per stack entry. So it seems safe to estimate we are putting 16 bytes on the stack for each CALL.

Can we find somewhere to prove this ?. Yes, this is the C x64 architecture calling convention. The minimum stuff that gets pushed on to the stack is 8 bytes of return address, and 8 bytes for data to pass to the callee. If 8 bytes of data isn’t enough, it gets added to the stack as well, but is the responsbility of the callee to clean up before calling RET. If you aren’t reading or writing assembly you don’t need to care about any of this, your compiler will deal with it all for you.

In the example above, we made a segmentation fault because we kept trying to add to the stack over and over until our program tried to write past the end of the stack, and the Linux operating system stepped in and killed the process to stop other memory being overwritten and unknown consequeunces occuring.

My first stack overflow.

So, how did I first discover stack overflows ?. It was in quite a different computing enviornment. It was on the first computer I ever owned, the Commodore 16. It had similar assembly instructions to CALL and RET, but slightly different. The 6502 chip in it had JSR for Jump to Subroutine and RTS for Return from subroutine.

In the commodore 16 case though there was no operating system, and when writing in assembly on it, you had none of these safeguards.

The 6502 CPU inside it had a single 8 bit register, called SP for stack pointer. It pointed counted how far into the stack you were. There was 256 bytes of the computers RAM, always at 0x0100 to 0x01ff which contained the stack. When you used an instruction that either pushed something onto the stack implicitely, like PHA, or did so as part of how it worked like JSR, the stack pointer moved. If you ever overflowed the stack, the stack pointer just wrapped around to the start and started overwritting entries on the stack that something put there, and presumably something would later want to retrive, but it never would now.

I discovered this while writing recursive assembly code trying to display very crude version of the julia fractal set. My naive code would frequently end up putting more things on the stack than the stack had space for, and I’d end up with the program breaking in ways that were almost impossible to debug, particularly with my lack of experience at the time and the kind of tools you had to work with on a 1980’s 8 bit microcomputer.

A modern recreation.

Something similar to what this first overflow I experienced can be expressed in a modern language like Python with.

import string

class Stack:
    def __init__(self, maxsize: int) -> None:
        self.contents = [None] * maxsize
        self.maxsize = maxsize
        self._index = 0

    def push(self, element: any):
        self.contents[self._index] = element
        if self._index == self.maxsize - 1:
            self._index = 0
        else:
            self._index += 1
        
    def pop(self) -> any:
        ret = self.contents[self._index]
        if self._index == 0:
            self._index = self.maxsize - 1
        else:
            self._index -= 1
        return ret
    
    def __repr__(self) -> str:
        return f"Stack has contents {self.contents} and index pointer {self._index}"
        
        
stack = Stack(maxsize=4)
for c in string.ascii_uppercase[0:7]:
    stack.push(c)
    print(stack)
    
print("================")

for i in range(8):
    print(stack)
    p = stack.pop()
    print(f"Popped {p} from stack")

Which results in this behaviour. We can overflow the stack without error, and just replace the start of the stack with what we push onto it. We can also underflow the stack without error, and we just get whatever was left at the other end of it. In reality things could have been even worse, as nothing short of pulling the power was likely to actually zero the memory, it was quite likely if you hadn’t done that, whatever was left in the stack from the last thing you did with the computer would still be there, and the results would be even more random than those None’s suggest.

Stack has contents ['A', None, None, None] and index pointer 1
Stack has contents ['A', 'B', None, None] and index pointer 2
Stack has contents ['A', 'B', 'C', None] and index pointer 3
Stack has contents ['A', 'B', 'C', 'D'] and index pointer 0
Stack has contents ['E', 'B', 'C', 'D'] and index pointer 1
Stack has contents ['E', 'F', 'C', 'D'] and index pointer 2
Stack has contents ['E', 'F', 'G', 'D'] and index pointer 3
================
Stack has contents ['E', 'F', 'G', 'D'] and index pointer 3
Popped D from stack
Stack has contents ['E', 'F', 'G', 'D'] and index pointer 2
Popped G from stack
Stack has contents ['E', 'F', 'G', 'D'] and index pointer 1
Popped F from stack
Stack has contents ['E', 'F', 'G', 'D'] and index pointer 0
Popped E from stack
Stack has contents ['E', 'F', 'G', 'D'] and index pointer 3
Popped D from stack
Stack has contents ['E', 'F', 'G', 'D'] and index pointer 2
Popped G from stack
Stack has contents ['E', 'F', 'G', 'D'] and index pointer 1
Popped F from stack
Stack has contents ['E', 'F', 'G', 'D'] and index pointer 0
Popped E from stack