If you have a laptop or desktop computer, you probably use a 8086-based CPU, or one implementation of x86 family to be exact. For example, I have a Lenovo laptop with a Core i5 CPU, which is based on x86 architecture. In this article, I want to talk about x86 architecture, and to explain how it works, I just start with the simplest one : 8086.
8086, is probably the first general-purpose processor made by Intel. This is why it’s famous, and in a lot of cases, people prefer to use it or study it. Everything is well-documented and also there are billions of tutorials and examples on how to use it! For example, if you search for “Interfacing Circuits”, you will find a lot of 8086-based computers made by people, connected to interface devices such as monitors, keyboards or mice.
Before we start reverse engineering, and make our simple x86-compatible computer, let’s take a look on the machine code structure of 8086. In this case, we just review Register addressing mode, because this mode is easier to understand or re-implement.
In this case, we can only take a look on a two-byte (or 16 bit) instruction code. Our instruction code looks like this :
Byte 1 | Byte 0 |
---|---|
|Opcode|D|W| | |MOD|REG|R/M| |
What are these? and why we should learn this? As we want to reverse engineer 8086 architecture and learn how it works, we need to know how this processor can understand programs! So, let’s check what are those fields in these bytes :
- Opcode : a 6-bit number, which determines about operation (for example ADD, SUB, MOV, etc. )
- D : Determines source or destination operand. To make reverse engineering process simple, we consider that as constant 1. So, REG field in byte 0 is always destination.
- W : Determines data size, and like D, to make reverse engineering process simple, we consider it as a constant 1. So, we only can do operations on 16 bit numbers.
- MOD : Determines mode, as we decided before, we only model the register addressing mode, so we need to consider mode as constant 11.
- REG and R/M : REG shows us source, R/M shows us destination. Please pay attention, we made this special case because we are going to model register addressing mode. For other modes, we can’t consider R/M as destination.
Now, we learned how 8086 can understand programs, for now, we have some instruction code like this :
Opcode | D | W | MOD | REG | R/M |
---|---|---|---|---|---|
xxxxxx | 1 | 1 | 11 | xxx | xxx |
Let’s assign codes to our registers. As we decided to simplify our reverse engineering process, and also we decided to use only 16 bit registers, I prefer to model four main registers, AX, BX, CX and DX. This table shows us codes :
Code | Register |
---|---|
000 | AX |
001 | CX |
010 | DX |
011 | BX |
As you can see, now we are able to convert instructions to machine code. To make reverse engineering process even simpler, we can ignore “MOV”, but I prefer to include MOV in my list. So, let’s find opcodes for MOV, ADD and SUB.
Opcode | Operation |
---|---|
100010 | MOV |
000001 | ADD |
010101 | SUB |
Now, we can convert every 8086 assembly instruction using the format we have. For example this piece of code :
MOV AX, CX MOV BX, DX ADD AX, BX
Now, if we want to convert this piece of code to machine code, we have to use the tables we made. Now, I can tell you these codes will be :
1000101111000001 => 0x8bc1 1000101111011010 => 0x8bda 0000011111000011 => 0x07c3
Now, we can model our very simple x86-based computer. But a note to mention, this has a lot of bugs! For example, we can’t initialize registers, so we need to study and implement immediate addressing mode, also, we can’t read anything from memory (x86 is not a load/store architecture, and it lets people work with data stored in memory directly!). But I think, this can be helpful if you want to study this popular processor or do any projects with it.