Program Compiling and Linking
Last updated
Last updated
We all know that we need to compile the source file to an executable program for the machine to run, but what exactly happened? A modern compiler does most of the work for programmers and make life much easier, but it is important for us to understand what's going on out of our sight. Let's start from these two functions from two separate files.
sum.cpp
main.cpp
The first step is to handle the preprocessor directives that begin with #, for example #include and #define. This includes comments removing, macros expanding and included files expanding. Notice that directive #pragma is not processed here, for it includes the linking of libraries.
The compiling step is performed on the output of the preprocessor. The compiler parse the C++ source code into assembly file. Various optimization techniques are performed as well, which is an important criterion for the performance of the compiler.
In this step, the compiler translate the assembly code into machine code, which is represented as an Relocatable Object File. This object file contains an ELF (Executable Locatable File) header and several sections, including .text, .rodata, .data, .bss and .symtab, a symbol table.
The ELF headed stores information of this file, helping the linker to parse and explain the target file.
.text, .rodata, .data, .bss is similar to those we have seen in the process virtual address space, which refers to commands and data.
The symbol table stores symbols of the declarations and definitions of global variables and functions in our file. We can use objdump -t main.o
to see the symbol table of our main() function.
As we can see from the symbol table, our main() function is stored as commands inside .text section, and the global variable data is stored in .data. But what about sum() and gdata, why their location is UND (undefined)? That is because these are only declarations, and their implementation is in sum.cpp. Now the computer doesn't know where to find these symbols. We need to target the actual location of these symbols in the linking step.
Similarly, we can take a look at the symbol table of sum.o.
Sure enough, gdata and sum() is defined here, and we find their locations in the symbol table.
Now we may understand why .o file can not be executed directly, because some points in each source file is missing, and we need to form a relation between them. To better illustrate this, we can use objdump -S main.o
to see the machine commands stored in .text.
Notice that in line 8 and line 11, the destination address of mov command is 0x0! It shows that in compiling process, the symbols are not assigned with virtual memory address, this process is done in the following step.
The linker takes the object files, relates them together and produces the final executable file (or a shared library). First, it combines sections of all ELF files. For .text, .data or .bss, the mergence is straightforward. And for symbol tables, the linker needs to find where a symbol is defined when its location is marked UND, and replace it with .text, .data, etc.
At this stage, the most common errors are missing definitions or duplicate definitions. The former means that the linker cannot find the definition of the symbol, and the latter shows that a symbol is defined more than once in different object files.
The second step is to assign virtual memory address for symbols, which is called symbol relocation. It goes through the .text section, and replace 0x0 with newly assigned address. Now if we dump our executable file with objdump -S a.out
, we can find that all symbols get their virtual addresses.
The content of our final executable file is similar to the relocatable object file. The ELF header also records the entry point of the program, which is the address of the first command run by the program. In our case, it is the address of the main function.
Besides, the executable file has one more section: the program header. The program header tells the system to load sections into the corresponding virtual address space when the program is executed.
Now we understand the process of compiling and linking, but why we need to separate them? It is mainly because we would like to compile each source code file separately. We do not need to recompile everything if we only change a single file, which may be very time-consuming in large-scale projects. Also, the object files can be buddle together as static libraries for reusing. Therefore we can simply base on others' efforts instead of written everything by our own.