Introduction

In March 2017 the Danish Defence Intelligence Service made a campain to hire people to their hacking academy.

To promote this, they published an assignment in the newspapers. You can see a copy of the assignment here.

I decided to try to solve this assignment.

First Impression

The text to the left looks like 80386 assembly code. I've written a lot of assembly code for PC's over the years, but it was 16 bit code (DOS). But I can manage to read this code:

The text to the right is written using the characters A-Z, a-z, 0-9, +, and /. This is typical for base64 encoded data. Some of the characters are in bold and underlined. Most of the lines (except line 6 and the last line) are terminated with a minus sign.

The Code

I've previously worked with a simulator for the Danish GIER machine. This was coded in C. I quickly decided to write this emulator in C as well. It makes it easier to add debug printouts to the code.

The central part of the code looks like this:

  for(;;)
  {
    edx=REGS[63];
    edx=*(unsigned int *) (MEM+edx);
    REGS[63]+=4;
    REGS[0]=0;

    ebp=edx;
    ebp>>=21;
    ebp&=077;
    esi=edx;
    esi>>=15;
    esi&=077;
    edi=edx;
    edi>>=9;
    edi&=077;
    eax=edx;
    eax>>=27;
    switch(eax)
    {
      case  0: /*OP_LOAD_B*/
	eax=REGS[esi];
	eax+=REGS[edi];
	eax=MEM[eax];
	REGS[ebp]=eax;
	break;
      case  1: /*OP_LOAD_H*/
	fprintf(stderr, "Not implemented opcode: %d\n",eax);
	exit(0);
	break;
  ...
    }
  }

I've kept the register names as working variables (eax, ebx, etc.). They have all been declared as unsigned int. REGS is a table of 64 unsigned ints. MEM is a table with unsigned chars. One megabyte seems to be enough. I should add a test for access violations.

The Disk Image

Now the fun part starts: The code for the disk image to the right is only available as a bitmap. I tried various OCR software, without success.

I started typing in the text manually.

I wrote a program (ocr.c) to assist in the verifying process. It reads the file with the typed in text line by line, and extracts the characters one by one from the bitmap file. 64 bitmap files are produced, the first containing all the A's, the next all the B's, etc. It is then easy to take one file at a time and check if all characters are the same.

A subset of the file with A's is shown here:

The characters are shown to the left, scaled by a factor 10. The line and character number is shown to the right, they refer to the file with the text typed in.

This made me correct some mistakes in my typing of the text.

The program also generated a text file with the pixel values of all characters. I'll come back to this later.

The Trick

As said, some of the characters to the right are in bold and underlined. If we collect these, we get:

MzJoYWNrZXI1NTd6amt6aS5vbmlvbgo

If we pass this to the base64 program we get:

$ base64 -d underline.txt
32hacker557zjkzi.onion

I did not know what .onion was, but Google did. https://32hacker557zjkzi.onion is an address that can be opened by a TOR enabled browser.

The browser openes a window showing two images: A picture of an emu to the left, and an 8" floppy disk to the right.

The title reads: Don't like OCR?

The picture of the emu links to a file containing the (still unfinished) assembly code.

Clicking on the floppy starts a download of a text file named disk.img.b64.

This file contains base64 encoded text and some escape sequences that changes the background color when the file is cat'ed to the terminal window.

The program t2.c removed the escape sequences from the file.

The base64 code from the TOR site is not identical to the file from OCR. The first file is 29956 bytes, the second only 13974 bytes. I'll come back to this difference.

Neural Network

I have previously used a neural network to identify cross peaks in 2D NMR spectra.

I've tried to train a neural network using the pixel values extracted by the ocr.c program.

Each character is 20x10 pixels, so the network is set up with 200 neurons in the input layer. There are 64 output neurons, one for each character value. I've chosen 128 neurons in the hidden layer. As input I use the pixel values scaled to the interval [0.1 - 0.9]. The output values are all set to 0.1 except for the the output neuron matching the character value. This is set to 0.9.

I've trained the network using the first half of the list of characters. The check is then to see if it recognizes the other half of the characters.

The list of bitmap values is permuted before each training pass. The list is used 5000 times.

I've created 100 different networks, each network initially loaded with the weights set to random numbers in the interval -0.5 to 0.5.

In 31% of the generated networks, all characters are recognized after the training. In case of the inperfect networks, the characters that were not recognized correctly were also difficult for me to recognize:

Because of the underlining, it is difficult to see if it is a "q" or a "g".

If I introduce errors in the first part of my typing (opgave.txt) it is difficult for the learning process to converge, and errors in the second half are easily found.

It is my believe that I've typed in the file correctly - or the neural network makes the same mistakes as I do.

Getting the Code to Run

Now for the exciting moment: Starting the code:

$ ./kodes disk1.out
Wrong endianness!

It failed. I had assumed that the virtual machine has the same endianness (little) as an Intel x86 CPU, this was wrong.

A add a function to swap bytes:

#include <byteswap.h>
...
static unsigned int loadw(unsigned int addr)
{
  unsigned int w;
  w = *(unsigned int *)(MEM+addr);
  w = bswap_32(w);
  return w;
}

The code in the program that tests for endianness is quite smart, take a look at it. The program kode.c contains debug printout that makes it easier to see what goes on.

With this addition (and a similar for OP_STORE_W) we get:

$ ./kodes disk1.out
Password:

Now what?

The Password

None of the passwords I could think of worked: root, hacker, r00t, etc., so I obviously had to do it more systematically.

On https://weakpass.com/wordlist I found file, passwords.txt.gz, containing more than 15 million passwords.

The emulator was changed so that instead of exiting on the OP_HALT instruction it restarts. In this way I can pass the list of passwords to stdin.

This gives a problem: It would take many days to go through the list.

I have access to a machine with 72 cores, including hyperthreading this makes it possible to run 144 threads at a time. The script do2b.com in the tar file splits the password file into 144 files of nearly identical sizes and starts the emulator for each file.

I've added to the opcode OP_IN that it echoes the character read to stdout.

In this way it is possible to process the whole list in approximately 50 minutes.

The script stores the 144 output files in the folder split2. A hit is easily found:

$ cd split2
$ fgrep "Checking decryption key" *|fgrep -v Bad
xcd:Checking decryption key... OK
xcd:Checking decryption key... OK

Let's take a look at the xcd file:

Password: agent
Initializing decryption... OK
Checking decryption key... OK
Decrypting disk image.............................................. Done
$ # Look, I'm an HTTP server now!
$ cat xinetd.conf
service httpd
{
    disable     = no   
    socket_type = stream 
    protocol    = tcp  
    wait        = no
    bind        = 0.0.0.0
    server      = u5emu
    server_args = disk.img
    type        = UNLISTED
    port        = 80
    user        = root
}

The password is: agent. The password: agentagent is also found.

The Assembly Code

I've also tried to complete the assembly code. I've chosen to use calls to the Linux kernel (int 0x80) for input and output and for the disk operations.

Some help has been build into the code: It tests if the instructions OP_READ and OP_WRITE have been made correctly, and it tests if you have forgotten to clear the edx register in the OP_DIV code.

I've tried to benchmark the assembly code and the C code, and it turned out quite interestingly:

Time, in seconds, to test 1001 passwords, 1000 bad passwords and one correct password.

Machine kodes u5emu
Desktop, CentOS 6 25.40 34.86
72 cores, CentOS 7 17.78 46.93
HP laptop, CentOS 6 VM 33.49 42.21
Lenovo laptop, CentOS 7 16.11 36.57
RPi 2 204.84  

The C code is faster than the assembly code. There could be several reasons for this: gcc could be better generating assembly code than I am, and the C code gets translated into 64 bit code which could be faster than 32 bit code.

gcc version 4.8.5 (CentOS 7) generates faster code than gcc 4.4.7 (CentOS 6).

The Web Server

Using the correct password the code tells us that it is now a web server.

We change the opcode OP_HALT so that it no longer loops, but instead writes out the modified DISK area into a file.

This file is stored as /usr/local/lib/disk.img. The xinet daemon configuration files are located in the /etc/xinetd.d folder on a CentOS 6 machine. We add the follow file to this folder:

service httpd8080
{
  disable = no
  socket_type = stream
  protocol = tcp 
  wait = no
  bind = 0.0.0.0
  server = /usr/local/bin/u5emu
  server_args = /usr/local/lib/disk.img
  type = UNLISTED
  port = 8080
  user = root
}

I chosen to bind it to port 8080 as port 80 is already in use.

The u5emu file is stored in /usr/local/bin and xinetd is restarted.

If you take a look at http://lemo.dk:8080/index.html nothing much is shown.

We try the strings command on the disk.img file, this gives some hints:

$ strings /usr/local/lib/disk.img
...
/favicon.ico
/index.html
/robots.txt
/imgs/morpheus.gif
/imgs/hacker-emblem.png
/imgs
/secret/tshirt.html
/secret
/docs/P51-01.lzma
/docs/P07-03.lzma
/docs
...

There's a robots.txt file, it contains:

User-agent: *
Disallow: /secret/

It tells Google & Co. to stay out of the folder /secret. This folder contains a single document: http://lemo.dk:8080/secret/tshirt.html. I've sent an email to the gmail address, and received a nice T-shift a few days later.

The folder /docs contains two files. The content of the file P07-03.lzma can be typed out using the lzcat command. The contents match http://phrack.org/issues/7/3.html.

The code from the TOR site contains two additional files compared to the code from OCR: /docs/P51-01.lzma and /imgs/morpheus.gif:

This is a refence to The Matrix.

Analysis of the Decryption Code

I haven't look so deeply into the decryption code yet. I can see that the password is used to set up some tables. The password is checked by decrypting a 56 bytes ciphertext (at address 2404), and if the password is correct, the text should match the plaintext "Another one got caught today, it's all over the papers.\n" (address 2348). This text is taken from the file P07-03.lzma.

During the table setup the password is used cyclic. That's why both "agent", "agentagent", and "agentagentagent" are all accepted.

Link to the Code

fe20170403.tar.gz

Thanks to DDIS for a fun puzzle!