Writeup by celi0n for Mégalosaure

Table of contents

This challenge looks like a classical linux crack-me. A note in the description specify that the binary might need the cap_sys_resource capability and that some issues with fork may happen. Interesting…

The given program is a 64 bit ELF, it asks for the flag and answer with “Wrong flag format” if you give it random input. One can easily guess that the flag format is FCSC{<the flag>}. When our input match this format, the program runs for a long time (I actually never waited for the program to finish).

Main function

The next step is to open it in a decompiler, the main function looks like that once simplified:

  create_pipes();
  block_ptr = (long)mmap((void *)0x0,0x100000,3,0x21,-1,0);
  puts("Enter the flag:");
  
  read(0,inp_flag,0x46);
  if (check_input(inp_flag) != 0) {
    puts("Wrong flag format!");
    exit(1);
  }
  
  for (idx = 0; idx < 0x12; idx = idx + 2) {
    *block_ptr = *block_ptr ^ inp_flag[idx];
    *(block_ptr + 4) = inp_flag[idx + 1] ^ *(block_ptr + 4);
    for (idx_step = 0; idx_step < 0x2c; idx_step = idx_step + 1) {
      step(steps[idx_step],opcodes,9);
    }
    local_58 = memory_base; //why ghidra's decompiler keep these 2 lines ?
    local_60 = block_ptr;
    *(memory_base + 0x800 + idx * 4) = *(block_ptr + 0xb0);
    block_ptr = block_ptr + 0xb0;
  }
  
  start_save = (long)memory_base + 0x800;
  is_ok = 0;
  for (idx2 = 0; idx2 < 9; idx2 = idx2 + 1) {
    is_ok = is_ok | LONG_ARRAY_RESULT[idx2] ^ *(ulong *)(start_save + idx2 * 8);
  }
  if (is_ok == 0) {
    puts("Win!!");
  }else {
    puts("Nope.");
  }

Basically in the first loop every 8 bytes of our input are xored with the values at block_ptr end then some computation is done. After this, the block results are stored at memory_base + 0x800 to be checked at the end. This pattern is very close to the CBC mode of a block cipher. At this point I looked briefly at the function step and since it was a bit complexe I first tried to check for different inputs and try to find if it was actually a block cipher (look at the diffusion by permutating one bit for example) and if the computations done on every block were the same / symmetric. In the end the step function provide a good diffusion and is not symmetric. The block size is 8, so it is too much for plain bruteforce and each block value depends of every previous inputs so we have to solve it one block at a time.

The step function

void step(long func_to_execute,ushort *opcodes,long cst){
 
  idx = 0;
  while( true ) {
    total_num_fork = (int)func_to_execute;
    if (total_num_fork <= idx) {
      do {
        res = wait(0x0);
      } while (0 < res);
      return;
    }
    res = fork();
    if (res == -1) break;
    if (res == 0) {
      start = (int)(func_to_execute >> 32);
      exec_child(opcodes + (idx + start) * cst);
    }
    idx = idx + 1;
  }
  perror("fork");
  exit(1);
}

This function fork new processes up to the value contained in low bytes of func_to_execute, and every process is given a different offset from opcodes array. Once every process are spawned, it waits for them to finish and returns.

Exec child

This function is more complex, here is the raw code from ghidra after some renaming :

  memset(stack,0,0x1000);
  idx = 0;
  stack_ptr = 0;
  do {
    while( true ) {
      next_idx = idx + 1;
      c1 = (uint)code[idx];
      if (c1 == 0) break;
      if (c1 == 3) {
        idx = idx + 2;
        c2 = (uint)code[next_idx];
        stack_ptr = stack_ptr + -1;
        val_from_stack = stack[stack_ptr];
        stack[stack_ptr] = 0;
        for (idx2 = 0; idx2 < (int)c2; idx2 = idx2 + 1) {
          ci = (uint)code[idx];
          idx = idx + 1;
          sVar1 = write(pipe_structs[(long)(int)ci * 2 + 1],&val_from_stack,4);
          if (sVar1 == -1) {
            perror("write");
                    /* WARNING: Subroutine does not return */
            exit(1);
          }
        }
      }
      else if (c1 == 1) {
        c2_(1) = (uint)code[next_idx];
        local_c0 = block_ptr;
        stack[stack_ptr] = *(uint *)((long)(int)c2_(1) * 4 + block_ptr);
        stack_ptr = stack_ptr + 1;
        idx = idx + 2;
      }
      else if (c1 == 2) {
        c2_(2) = CONCAT22(code[idx + 2],code[next_idx]);
        stack[stack_ptr] = c2_(2);
        stack_ptr = stack_ptr + 1;
        idx = idx + 3;
      }
      else if (c1 == 4) {
        local_a0 = (uint)code[next_idx];
        stack_ptr = stack_ptr + -1;
        local_a4 = stack[stack_ptr];
        stack[stack_ptr] = 0;
        local_b0 = block_ptr;
        *(uint *)((long)(int)local_a0 * 4 + block_ptr) = local_a4;
        idx = idx + 2;
      }
      else {
        idx = next_idx;
        if (c1 == 5) {
          next_idx = stack_ptr + -1;
          local_9c = stack[next_idx];
          stack[next_idx] = 0;
          stack[next_idx] = local_9c;
          stack[stack_ptr] = local_9c;
          stack_ptr = stack_ptr + 1;
        }
        else if (c1 == 6) {
[...]
        else if (c1 == 0xd) {
          local_34 = stack[stack_ptr + -1];
          stack[stack_ptr + -1] = 0;
          next_idx = stack_ptr + -2;
          local_38 = stack[next_idx];
          stack[next_idx] = 0;
          local_3c = local_34 >> ((byte)local_38 & 0x1f);
          stack_ptr = stack_ptr + -1;
          stack[next_idx] = local_3c;
        }
        else if (c1 == 0xe) {
          local_28 = stack[stack_ptr + -1];
          stack[stack_ptr + -1] = 0;
          next_idx = stack_ptr + -2;
          local_2c = stack[next_idx];
          stack[next_idx] = 0;
          local_30 = local_28 << ((byte)local_2c & 0x1f);
          stack_ptr = stack_ptr + -1;
          stack[next_idx] = local_30;
        }
        else if (c1 != 0x11) {
          if (c1 != 0x12) {
            exit(1);
          }
          exit(0);
        }
      }
    }
    idx = idx + 2;
    c2_ = (uint)code[next_idx];
    epoll_fd = epoll_create1(0);
    if (epoll_fd == -1) {
      perror("epoll_create1");
      exit(1);
    }
    for (idx3 = 0; idx3 < (int)c2_; idx3 = idx3 + 1) {
      ci_ = (uint)code[idx];
      fd = pipe_structs[(long)(int)ci_ * 2];
      event._0_4_ = 1;
      event._4_8_ = (long)fd | (long)idx3 << 0x20;
      idx = idx + 1;
      next_idx = epoll_ctl(epoll_fd,1,fd,(epoll_event *)event);
      if (next_idx != 0) {
        perror("epoll_ctl");
        exit(1);
      }
    }
    local_18 = c2_;
    do {
      num_fd_ready = epoll_wait(epoll_fd,(epoll_event *)epoll_event,1,-1);
      for (idx4 = 0; idx4 < num_fd_ready; idx4 = idx4 + 1) {
        local_e0 = *(epoll_data_t *)(epoll_event + (long)idx4 * 0xc + 4);
        pipe_fd = local_e0.fd;
        idx3_recup = (int)((ulong)local_e0 >> 0x20);
        sVar1 = read(pipe_fd,stack + (long)idx3_recup + (long)stack_ptr,4);
        local_ec = (int)sVar1;
        if (local_ec < 1) {
          perror("read");
          exit(1);
        }
      }
      local_18 = local_18 - 1;
    } while (local_18 != 0);
    stack_ptr = stack_ptr + c2_;
    res = close(epoll_fd);
  } while (res == 0);
  perror("close");
  exit(1);
}

The first thing to notice is that there is a big switch like structure (in fact a bunch of if/else if). The input of this switch is c1, a value stored at code[idx]. This index increases every time a new value is read from code. For those familiar with this kind of challenge, it really looks like the dispatcher of a vm. c1 is the opcode and in the handlers the operands / data are fetched.

After a bit of reversing, one can notice that hopefully this VM takes code in input and executes it sequentially (no jumps or call). A stack is used to store some data temporarily and it can read / write to memory (where the blocks are stored). The different operations are:

Basic operations on top of stack. For example pop a; pop b; push a ^ b;
Push a constant value into stack
Write the value on top of stack to memory
Push value from memory on stack
Exit with code 1 or 0
Do pipe related stuff x2

I skipped the pipe setup in the main function but there are a lot of them (9928), with their input and output handles stored in a table. The virtual machine can write the top value of stack in one of them or read. But for read it waits until some data is in the pipe.

Nice. But what all this pipe / fork / vm stuff is really doing ?

High level view

It is quite difficult to debug this machinery because every bit of code is executed in a different process with the fork loop. If we decide to debug in gdb, we can follow only process (set follow-fork-mode child) at a time and the others will go on with their lifes. How can we monitor every process at the same time and still gather some informations about their execution ?

It is time to invoke a special card: uprobes. It is basically a breakpoint that the kernel put in every instance of a given program, with a way to print registers, memory, stack values… But the best thing is that it doesn’t care if the program uses fork or other anti-debug stuff (if not specifically targeted against uprobes).

To use this, you need to be root and write a few things in files of /sys/kernel/tracing/ pseudo file system. For example to put a uprobe that print the value in eax every time a process write to a pipe (code at offset 0x1543), I used the following setup:

>echo 'p:write_pipe_val /home/celian/Downloads/megalosaure_dir/megalosaure:0x1543 val=%ax' >> /sys/kernel/tracing/uprobe_events
>echo 1 > /sys/kernel/tracing/events/uprobes/write_pipe_val/enable
>cat /sys/kernel/tracing/trace_pipe
megalosaure-593911  [011] DNZff 71981.279438: write_pipe_val: (0x5c4d20c73543) val=0x43534346
<...>-593912  [015] DNZff 71981.279963: write_pipe_val: (0x5c4d20c73543) val=0x1337
<...>-593913  [000] DNZff 71981.280512: write_pipe_val: (0x5c4d20c73543) val=0xa4e1a60a
<...>-593914  [009] DNZff 71981.280837: write_pipe_val: (0x5c4d20c73543) val=0x0

With this setup, every time we run the program again, these values are printed in trace_pipe. For more context, one can setup a bunch of uprobes at key point of the code:

p:uprobes/xor1 /home/celian/Downloads/megalosaure_dir/megalosaure:0x0000000000002059 block_part_1=%dx input_part_1=%ax
p:uprobes/xor2 /home/celian/Downloads/megalosaure_dir/megalosaure:0x000000000000208d block_part_2=%cx input_part_2=%dx
p:uprobes/fork_stuff /home/celian/Downloads/megalosaure_dir/megalosaure:0x0000000000001c77 func_to_execute=%di func_table=%si cst=%dx
p:uprobes/child_func /home/celian/Downloads/megalosaure_dir/megalosaure:0x0000000000001249 code=%di
p:uprobes/write_pipe_val /home/celian/Downloads/megalosaure_dir/megalosaure:0x0000000000001543 val=%ax
p:uprobes/write_pipe_num /home/celian/Downloads/megalosaure_dir/megalosaure:0x0000000000001587 pip_num=%ax
p:uprobes/read_wait_pipe_num /home/celian/Downloads/megalosaure_dir/megalosaure:0x000000000000133c pipe_num=%ax
p:uprobes/instr_c1 /home/celian/Downloads/megalosaure_dir/megalosaure:0x00000000000012a1 c1=%ax
p:uprobes/get_state /home/celian/Downloads/megalosaure_dir/megalosaure:0x0000000000001613 state_num=%ax
p:uprobes/set_state /home/celian/Downloads/megalosaure_dir/megalosaure:0x00000000000016f4 set_state_num=%ax

Then you can redirect the output of trace_pipe during one execution to get a complete log of what the VMs are doing. A nice feature is that every line starts with the process pid, so we can search for this pid and find every operations done by a process.

Thanks to these logs, we can deduce a few things:

Every process executes only a few instructions.
At a given step, a pipe is only read / written by one process. Added to the fact that everything is executed sequencially, it greatly simplify the analysis.
Every step start with a process that wait to read pipe 0x0, which is the last to terminate. This process then write something to memory (set_state probe).
Each step read the two last state and write to a new one. The things I called state is a group of 4 bytes in memory. A block is composed of 2 states. We can understand that every step process one state with a quick calculus with the values in main: size_state * num_state = 4 * 0x2c = 0xBO = gap_between_blocks.

I tried to look at inter-process dependencies by creating a tree graph with ete3 python library and a bit of log parsing. The result is a giant binary tree with the process that read pipe 0 as a root.

tree

Solution

Now that the reverse part is done, lets think about how to solve this. As the correct value for every block is hardcoded in the program, we are done if we can get the input of a block with his value. We know that every step takes 2 states as input and output to one. Here is a diagram of the inputs / outputs of a state:

step_megalosaure

With a SAT solver we can solve a system with the state n-1 as unknown and some values in the states n and n+1. It gives us a way to find state[n-1] from state[n] and state[n+1]. Then, by replacing n = n-1, every time and we can deduce the first two states values which correspond to the hardcoded block value xored with the input. At this point, we are only a xor away from the flag !

Solving

I solved this part with python and z3. My script actually solve every state at once between two blocks but doing it one step at a time was way faster. Unfortunatly, I deleted this version while trying to solve a bug due to z3 using arithmetical shift instead of the VM logical shift. Otherwise, the code is quite similar to the vm code. I used a stack were I put the z3 variables. The value of a given pipe at a given step is represented by a variable. This is quite nice, as we can debug by comparing the values found by z3 with the values printed by the uprobes. Here is the code I used:

from pwn import *
import z3

steps_bytes = b'\x0e\x23\x00\x00\x00\x00\x00\x00\xca\x26\x00\x00\x0e\x23\x00\x00\x78\x0d\x00\x00\xd8\x49\x00\x00\xfa\x1b\x00\x00\x50\x57\x00\x00\xb0\x0e\x00\x00\x4a\x73\x00\x00\xe1\x19\x00\x00\xfa\x81\x00\x00\xa7\x14\x00\x00\xdb\x9b\x00\x00\xf1\x0e\x00\x00\x82\xb0\x00\x00\x72\x10\x00\x00\x73\xbf\x00\x00\x30\x1c\x00\x00\xe5\xcf\x00\x00\xaf\x26\x00\x00\x15\xec\x00\x00\x47\x17\x00\x00\xc4\x12\x01\x00\xc5\x19\x00\x00\x0b\x2a\x01\x00\x32\x17\x00\x00\xd0\x43\x01\x00\xcc\x13\x00\x00\x02\x5b\x01\x00\x7d\x11\x00\x00\xce\x6e\x01\x00\x0b\x0c\x00\x00\x4b\x80\x01\x00\xaf\x12\x00\x00\x56\x8c\x01\x00\xca\x26\x00\x00\x05\x9f\x01\x00\x72\x0a\x00\x00\xcf\xc5\x01\x00\x92\x19\x00\x00\x41\xd0\x01\x00\xf2\x0e\x00\x00\xd3\xe9\x01\x00\xd1\x19\x00\x00\xc5\xf8\x01\x00\xe8\x1d\x00\x00\x96\x12\x02\x00\x85\x0c\x00\x00\x7e\x30\x02\x00\x60\x1c\x00\x00\x03\x3d\x02\x00\x72\x10\x00\x00\x63\x59\x02\x00\xda\x0b\x00\x00\xd5\x69\x02\x00\x32\x17\x00\x00\xaf\x75\x02\x00\x30\x16\x00\x00\xe1\x8c\x02\x00\x45\x10\x00\x00\x11\xa3\x02\x00\x5a\x16\x00\x00\x56\xb3\x02\x00\x7a\x23\x00\x00\xb0\xc9\x02\x00\x53\x17\x00\x00\x2a\xed\x02\x00\x9b\x0e\x00\x00\x7d\x04\x03\x00\xeb\x15\x00\x00\x18\x13\x03\x00\xcc\x10\x00\x00\x03\x29\x03\x00\xb2\x17\x00\x00\xcf\x39\x03\x00\xb2\x14\x00\x00\x81\x51\x03\x00\xfa\x09\x00\x00\x33\x66\x03\x00\xa3\x20\x00\x00\x2d\x70\x03\x00\x92\x11\x00\x00\xd0\x90\x03\x00\x2d\x0d\x00\x00\x62\xa2\x03\x00\xBC\x16\x00\x00\x8f\xaf\x03\x00'
steps = []
for i in range(len(steps_bytes) // 8):
    step = u64(steps_bytes[i*8:i*8+8])
    print(hex(step))
    steps.append(step)

result_bytes = b'\xb5\xa7\xa8\x91\xce\xe7\x07\x9b\x7c\xe9\xe7\x35\xac\x9e\x81\x9e\x5f\x6b\xaa\x17\x33\x1d\x40\xfd\x87\x55\x9d\xbd\x2f\xa3\x16\xdf\xae\x4f\xab\x0d\xac\x61\xc5\x80\x09\xe2\x68\xd3\xdd\xd1\x37\x92\x2c\x88\x26\xee\xf6\xe4\xeb\x07\x3b\x30\x78\xe8\x11\xfd\x2f\xb7\x3f\xbf\x67\x82\xdc\xa7\xd2\x99'

result_states = []
for i in range(len(result_bytes) // 4):
    res = u32(result_bytes[i*4:i*4+4])
    print(hex(res))
    result_states.append(res)

start_code = 0x40c0 #or 0x50c0 check later
size_code =  4452704

code = open("../megalosaure", "rb").read()[start_code:start_code+size_code]


def short_op(idx, code):
    return (idx+2, u16(code[idx:idx+2]))
def int_op(idx, code):
    return (idx+4, u32(code[idx:idx+4]))
def get_instr(idx, code, stack, pipes, states, solver, curr_step):
    idx, opcode = short_op(idx, code)
    if opcode == 0:
        idx, c2 = short_op(idx, code)
        #print("push pipe(", end="")
        for i in range(c2):
            idx, ci = short_op(idx, code)
            #print(" " +hex(ci), end = "")
            stack.append(pipes[ci])
        #print(" )")
    elif opcode == 1:
        idx, c2 = short_op(idx, code)
        #print(f"push state[{hex(c2)}]")
        stack.append(states[c2])
    elif opcode == 2:
        idx, c2 = int_op(idx, code)
        #print(f"push imm({hex(c2)})")
        stack.append(z3.BitVecVal(c2, 32))

    elif opcode == 3:
        idx, c2 = short_op(idx, code)
        val = stack.pop()
        #print("pop pipe(", end="")
        for i in range(c2):
            idx, ci = short_op(idx, code)
            #print(" " +hex(ci), end = "")
            solver.add(pipes[ci] == val)
        #print(" )")

    elif opcode == 4:
        idx, c2 = short_op(idx, code)
        #print(f"pop state[{hex(c2)}]")
        solver.add(states[c2] == stack.pop())
    elif opcode == 5:
        #print("pop tmp; push tmp; push tmp")
        tmp = stack.pop()
        stack.append(tmp)
        stack.append(tmp)
    elif opcode == 6:
        #print("pop a; pop b; push a+b")
        a = stack.pop()
        b = stack.pop()
        stack.append(a + b)
    elif opcode == 7:
        #print("pop a; pop b; push a-b")
        a = stack.pop()
        b = stack.pop()
        stack.append(a - b)
    elif opcode == 8:
        #print("pop a; pop b; push a*b")
        a = stack.pop()
        b = stack.pop()
        stack.append(a * b)
    elif opcode == 9:
        print("pop a; pop b; push a%b")
        assert False #If used check if it works
        a = stack.pop()
        b = stack.pop()
        stack.append(a % b)
    elif opcode == 10:
        #print("pop a; pop b; push a^b")
        a = stack.pop()
        b = stack.pop()
        stack.append(a ^ b)
    elif opcode == 11:
        #print("pop a; pop b; push a&b")
        a = stack.pop()
        b = stack.pop()
        stack.append(a & b)
    elif opcode == 12:
        #print("pop a; pop b; push a|b")
        a = stack.pop()
        b = stack.pop()
        stack.append(a | b)
    elif opcode == 13:
        #print("pop a; pop b; push a>>(b&0x1f)")
        a = stack.pop()
        b = stack.pop()
        stack.append(z3.LShR(a, (b & z3.BitVecVal(0x1f, 32))))
    elif opcode == 14:
        #print("pop a; pop b; push a<<(b&0x1f)")
        a = stack.pop()
        b = stack.pop()
        stack.append(a << (b & z3.BitVecVal(0x1f, 32)))
    elif opcode == 15:
        #print("pop a; push ~a")
        a = stack.pop()
        stack.append(~a)
    elif opcode == 16:#not used
        a = stack.pop()
        stack.append(-a)
        print("pop a; push -a")
        assert False  # If used check if it works
    elif opcode == 17:
        print("NOP?")
        assert False  # If used check if it works
    elif opcode == 18:
        pass#print("exit(0)")
    else:
        assert False

    return (idx, opcode)


def parse_step(step, code, states, solver):
    curr_step =  steps.index(step)
    step_start = (step >> 32) *9
    num_process = step % (2**32)
    print("Step:", hex(curr_step), " num_process:", hex(num_process))


    pipes = z3.BitVecs(" ".join([f"p{curr_step}_{i}" for i in range(num_process)]), 32)
    for i in range(num_process):
        process_start = (step_start + i * 9)*2 # for * ushort
        idx = process_start
        stack = []
        while True:
            idx, opcode = get_instr(idx, code, stack, pipes, states, solver, curr_step)
            if opcode == 0x12:
                assert len(stack) == 0
                break

flag = b"FCSC{"
for round in range(1,9):
    solver = z3.Solver()

    states = z3.BitVecs(" ".join([f"state_{i}" for i in range(len(steps)+2)]), 32)


    solver.add(states[-2] == z3.BitVecVal(result_states[round*2], 32))
    solver.add(states[-1] == z3.BitVecVal(result_states[round*2 + 1], 32))
    #solver.add(states[0] == z3.BitVecVal(u32(last_first_part), 32))#use first time to guess with part of input (maybe faster)
    #solver.add(states[1] == z3.BitVecVal(u32(b"ABCD"), 32))

    for i in range(len(steps)):#len(steps) - 1, -1, -1
        parse_step(steps[i], code, states, solver)



    print(solver.check())
    model = solver.model()
    print(model)
    res_part0 = model.evaluate(states[0])
    res_int0 = res_part0.as_long()
    res_part1 = model.evaluate(states[1])
    res_int1 = res_part1.as_long()

    res_int0 ^= result_states[(round -1) * 2]
    res_int1 ^= result_states[(round -1) * 2 + 1]

    part0 = res_int0.to_bytes(4, 'little')
    part1 = res_int1.to_bytes(4, 'little')
    flag += part0
    flag += part1
    print(flag)

After a bit of time, this program prints the correct flag. When submitting the flag to CTFd, I had the nice surprise to be the first to flag this challenge :)

I learnt from a writeup that it was a whitebox of the cipher SIMON-64-128 (in CBC mode so it was a correct guess).