What is a computer program translates one program instructions at a time into machine language?

Computers do not understand human languages. In fact, at the lowest level, computers only understand sequences of numbers that represent operational codes (op codes for short). On the other hand, it would be very difficult for humans to write programs in terms of op codes. Therefore, programming languages were invented to make it easier for humans to write computer programs.

Índice Show

Compiled languages (e.g. C, C++)
Interpreted programming languages (e.g. Python, Perl)
And now for something different ... Java
Details of the Java programming process
Programming tip

Programming languages are for humans to read and understand. The program (source code) must be translated into machine language so that the computer can execute the program (as the computer only understands machine language). The way that this translation occurs depends on whether the programming language is a compiled language or an interpreted language.

Compiled languages (e.g. C, C++)

The following illustrates the programming process for a compiled programming language.

A compiler takes the program code (source code) and converts the source code to a machine language module (called an object file). Another specialized program, called a linker, combines this object file with other previously compiled object files (in particular run-time modules) to create an executable file. This process is diagrammed below. Click Initial build to see an animation of how the executable is created. Click Run executable to simulate the running of an already created executable file. Click Rebuild to simulate rebuilding of the executable file.

executable file

So, for a compiled language the conversion from source code to machine executable code takes place before the program is run. This is a very different process from what takes place for an interpreted programming language.

This is somewhat simplified as many modern programs that are created using compiled languages makes use of dynamic linked libraries or shared libraries. Therefore, the executable file may require these dynamic linked libraries (Windows) or shared libraries (Linux, Unix) to run.

Interpreted programming languages (e.g. Python, Perl)

The process is different for an interpreted language. Instead of translating the source code into machine language before the executable file is created, an interpreter converts the source code into machine language at the same time the program runs. This is illustrated below:

Interpreted languages use a special program called an interpreter that converts the source code, combines with runtime libraries, and executes the resulting machine instructions all during runtime. Unlike a compiled language, there is no precompiled program to run. The conversion process and combination with runtime libraries takes place every time an interpreted language program is run. This is why programs written in compiled languages tend to run faster than comparable programs written in interpreted languages. Click Start to run the simulation of an interpreted program. Click Restart if you want to run the simulation again.

Each time an interpreted program is run, the interpreter must convert source code into machine code and also pull in the runtime libraries. This conversion process makes the program run slower than a comparable program written in a compiled language.

Because an interpreter performs the conversion from source to machine language during the running of the program, interpreted languages usually result in programs that execute more slowly than compiled programs. But what is often gained in return is that interpreted languages are often platform independent because a different interpreter can be used for each different operating system.

And now for something different ... Java

The Java programming language does not fit into either the compiled language or interpreted language models. This is illustrated in the figure below.

The Java compiler (javac) converts the source code into bytecode. Bytecode is a kind of average machine language. This bytecode file (.class file) can be run on any operating system by using the Java interpreter (java) for that platform. The interpreter is referred to as a Virtual Machine. Thus, Java is an example of a Virtual Machine programming language.

Virtual machine languages were created to be a compromise between compiled and interpreted languages. Under ideal conditions, virtual machine language programs run closer in speed to compiled language programs but have the platform indepency of interpreted language programs.

Virtual machine languages makes use of both a compiler and an interpreter. The compiler converts the source code into a kind of average machine language. In Java, this average machine language is called bytecode. In Visual Studio.NET languages, this average machine language is called MSIL (Microsoft Intermediate Language). (To keep the discussion on this page simpler, this compiled code will be referred to generically as bytecode from this point on.) The interpreter for virtual machine languages is a special program that provides the runtime libraries for the given operating system. That means that there is a different virtual machine interpreter for all of the supported operating systems.

The way that virtual machine programming languages get some of the speed of compiled languages is that the source code is run through the compiler to create the bytecode. That conversion takes place before the program is ever run. The way that virtual machine languages gain their portability (platform independence) is by having a different interpreter for each supported operating system. This interpreter ties in the correct runtime libraries for each different operating system. The compiled bytecode is an average machine language that will work without changes with any of the virtual machine interpreters for that language. This process is illustrated next. We have a compiler that converts the source code into bytecode. This can be simulated by clicking on the Compile button. Once the bytecode has been created, that same bytecode can be used without any changes on any operating system that has a virtual machine interpreter for the programming language. Note that each of the virtual machine interpreters have different runtime library code, because each operating system has different runtime libraries. This is how the virtual machine language gets around platform dependency problems. Click Run Windows, Run Mac OSX or Run Linux to simulate running the program on any of those operating systems.

Once again, note that the bytecode does not need to be recompiled to run on any of the different operating systems. The only reason to recompile a program is if you changed the source code.

Hopefully, you can see how virtual machine language programs will have better performance than interpreted language programs. The virtual machine languages convert the source code to an average machine code before the program is ever run. Virtual machine languages don't quite match the performance of compiled languages because the bytecode still has to be loaded by the virtual machine before running.

Details of the Java programming process

The source code for a Java program is a text file that ends in ".java". Suppose you typed out the following file, "Hello.java".

class Hello { public static void main(String[] args) { System.out.println("Hello"); } }

To compile this program, you would type the following at the command line:

javac Hello.java

The Java compiler is named javac. The javac program is unique in that it does not produce actual machine code. Instead it produces something called bytecode. Unlike machine code, bytecode is not platform specific. The bytecode produced on a Windows machine is the same bytecode that is produced on a Linux machine. This means that the bytecode can be run (without recompiling) on any platform that has a Java interpreter.

If the compilation into bytecode is successful, the bytecode will be contained in a file called "Hello.class" is created. To run this bytecode, the Java interpreter is invoked in the following way.

java Hello

Note the name of the Java interpreter is java. Also note that you do not include the .class at the end of the filename when invoking the interpreter. By default, the .class file is created in the same directory as the directory you are running the compiler from.

Programming tip

At this point, one of the best ways to make progress in Java programming is to take a program that works and purposely introduce errors in the source code. This will help you to start recognizing how the compiler reports the various kinds of errors. For example, try the following:

Remove the semicolon at the end of a statement.
Remove the right curly brace at the end of a block.
Add an extra left curly brace just before the beginning of a block.
Misspell the word main. The main method marks the starting point of the program.

When the error is reported, take note of the location of the error that the compiler reports. As you will see, the line that the compiler points to as having the error may not be the actual line the error occurs on.

In computer programming, machine code is any low-level programming language, consisting of machine language instructions, which are used to control a computer's central processing unit (CPU). Each instruction causes the CPU to perform a very specific task, such as a load, a store, a jump, or an arithmetic logic unit (ALU) operation on one or more units of data in the CPU's registers or memory.

Machine language monitor in a W65C816S single-board computer, displaying code disassembly, as well as processor register and memory dumps.

Machine code is a strictly numerical language which is designed to run as fast as possible, and may be considered as the lowest-level representation of a compiled or assembled computer program or as a primitive and hardware-dependent programming language. While it is possible to write programs directly in machine code, managing individual bits and calculating numerical addresses and constants manually is tedious and error-prone. For this reason, programs are very rarely written directly in machine code in modern contexts, but may be done for low level debugging, program patching (especially when assembler source is not available) and assembly language disassembly.

The majority of practical programs today are written in higher-level languages or assembly language. The source code is then translated to executable machine code by utilities such as compilers, assemblers, and linkers, with the important exception of interpreted programs,[nb 1] which are not translated into machine code. However, the interpreter itself, which may be seen as an executor or processor performing the instructions of the source code, typically consists of directly executable machine code (generated from assembly or high-level language source code).

Machine code is by definition the lowest level of programming detail visible to the programmer, but internally many processors use microcode or optimise and transform machine code instructions into sequences of micro-ops. This is not generally considered to be a machine code.

Every processor or processor family has its own instruction set. Instructions are patterns of bits, digits or characters that correspond to machine commands. Thus, the instruction set is specific to a class of processors using (mostly) the same architecture. Successor or derivative processor designs often include instructions of a predecessor and may add new additional instructions. Occasionally, a successor design will discontinue or alter the meaning of some instruction code (typically because it is needed for new purposes), affecting code compatibility to some extent; even compatible processors may show slightly different behavior for some instructions, but this is rarely a problem. Systems may also differ in other details, such as memory arrangement, operating systems, or peripheral devices. Because a program normally relies on such factors, different systems will typically not run the same machine code, even when the same type of processor is used.

A processor's instruction set may have all instructions of the same length, or it may have variable-length instructions. How the patterns are organized varies with the particular architecture and type of instruction. Most instructions have one or more opcode fields which specifies the basic instruction type (such as arithmetic, logical, jump, etc.), the operation (such as add or compare), and other fields that may give the type of the operand(s), the addressing mode(s), the addressing offset(s) or index, or the operand value itself (such constant operands contained in an instruction are called immediate).[1]

Not all machines or individual instructions have explicit operands. An accumulator machine has a combined left operand and result in an implicit accumulator for most arithmetic instructions. Other architectures (such as 8086 and the x86-family) have accumulator versions of common instructions, with the accumulator regarded as one of the general registers by longer instructions. A stack machine has most or all of its operands on an implicit stack. Special purpose instructions also often lack explicit operands (CPUID in the x86 architecture writes values into four implicit destination registers, for instance). This distinction between explicit and implicit operands is important in code generators, especially in the register allocation and live range tracking parts. A good code optimizer can track implicit as well as explicit operands which may allow more frequent constant propagation, constant folding of registers (a register assigned the result of a constant expression freed up by replacing it by that constant) and other code enhancements.

A computer program is a list of instructions that can be executed by a central processing unit (CPU). A program's execution is done in order for the CPU that is executing it to solve a problem and thus accomplish a result. While simple processors are able to execute instructions one after another, superscalar processors are capable of executing many instructions simultaneously.

Program flow may be influenced by special 'jump' instructions that transfer execution to an address (and hence instruction) other than the next numerically sequential address. Whether these conditional jumps occur is dependent upon a condition such as a value being greater than, less than, or equal to another value.

A much more human friendly rendition of machine language, called assembly language, uses mnemonic codes to refer to machine code instructions, rather than using the instructions' numeric values directly, and uses symbolic names to refer to storage locations and sometimes registers. For example, on the Zilog Z80 processor, the machine code 00000101, which causes the CPU to decrement the B processor register, would be represented in assembly language as DEC B.

The MIPS architecture provides a specific example for a machine code whose instructions are always 32 bits long. The general type of instruction is given by the op (operation) field, the highest 6 bits. J-type (jump) and I-type (immediate) instructions are fully specified by op. R-type (register) instructions include an additional field funct to determine the exact operation. The fields used in these types are:

6 5 5 5 5 6 bits [ op | rs | rt | rd |shamt| funct] R-type [ op | rs | rt | address/immediate] I-type [ op | target address ] J-type

rs, rt, and rd indicate register operands; shamt gives a shift amount; and the address or immediate fields contain an operand directly.

For example, adding the registers 1 and 2 and placing the result in register 6 is encoded:

[ op | rs | rt | rd |shamt| funct] 0 1 2 6 0 32 decimal 000000 00001 00010 00110 00000 100000 binary

Load a value into register 8, taken from the memory cell 68 cells after the location listed in register 3:

[ op | rs | rt | address/immediate] 35 3 8 68 decimal 100011 00011 01000 00000 00001 000100 binary

Jumping to the address 1024:

[ op | target address ] 2 1024 decimal 000010 00000 00000 00000 10000 000000 binary

On processor architectures with variable-length instruction sets[2] (such as Intel's x86 processor family) it is, within the limits of the control-flow resynchronizing phenomenon known as the Kruskal Count,[3][2] sometimes possible through opcode-level programming to deliberately arrange the resulting code so that two code paths share a common fragment of opcode sequences. These are called overlapping instructions, overlapping opcodes, overlapping code, overlapped code, instruction scission, or jump into the middle of an instruction, and represent a form of superposition.[4][5][6]

In the 1970s and 1980s, overlapping instructions were sometimes used to preserve memory space. One example were in the implementation of error tables in Microsoft's Altair BASIC, where interleaved instructions mutually shared their instruction bytes.[7][2][4] The technique is rarely used today, but might still be necessary to resort to in areas where extreme optimization for size is necessary on byte-level such as in the implementation of boot loaders which have to fit into boot sectors.[nb 2]

It is also sometimes used as a code obfuscation technique as a measure against disassembly and tampering.[2]

The principle is also utilized in shared code sequences of fat binaries which must run on multiple instruction-set-incompatible processor platforms.

This property is also used to find unintended instructions called gadgets in existing code repositories and is utilized in return-oriented programming as alternative to code injection for exploits such as return-to-libc attacks.[8][2]

In some computers, the machine code of the architecture is implemented by an even more fundamental underlying layer called microcode, providing a common machine language interface across a line or family of different models of computer with widely different underlying dataflows. This is done to facilitate porting of machine language programs between different models. An example of this use is the IBM System/360 family of computers and their successors. With dataflow path widths of 8 bits to 64 bits and beyond, they nevertheless present a common architecture at the machine language level across the entire line.

Using microcode to implement an emulator enables the computer to present the architecture of an entirely different computer. The System/360 line used this to allow porting programs from earlier IBM machines to the new family of computers, e.g. an IBM 1401/1440/1460 emulator on the IBM S/360 model 40.

Machine code is generally different from bytecode (also known as p-code), which is either executed by an interpreter or itself compiled into machine code for faster (direct) execution. An exception is when a processor is designed to use a particular bytecode directly as its machine code, such as is the case with Java processors.

Machine code and assembly code are sometimes called native code when referring to platform-dependent parts of language features or libraries.[9]

The Harvard architecture is a computer architecture with physically separate storage and signal pathways for the code (instructions) and data. Today, most processors implement such separate signal pathways for performance reasons but implement a Modified Harvard architecture,[citation needed] so they can support tasks like loading an executable program from disk storage as data and then executing it. Harvard architecture is contrasted to the Von Neumann architecture, where data and code are stored in the same memory which is read by the processor allowing the computer to execute commands.

From the point of view of a process, the code space is the part of its address space where the code in execution is stored. In multitasking systems this comprises the program's code segment and usually shared libraries. In multi-threading environment, different threads of one process share code space along with data space, which reduces the overhead of context switching considerably as compared to process switching.

Pamela Samuelson wrote that machine code is so unreadable that the United States Copyright Office cannot identify whether a particular encoded program is an original work of authorship;[10] however, the US Copyright Office does allow for copyright registration of computer programs[11] and a program's machine code can sometimes be decompiled in order to make its functioning more easily understandable to humans.[12] However, the output of a decompiler or disassembler will be missing the comments and symbolic references, so while the output may be easier to read than the object code, it will still be more difficult than the original source code. This problem does not exist for object-code formats like SQUOZE, where the source code is included in the file.

Cognitive science professor Douglas Hofstadter has compared machine code to genetic code, saying that "Looking at a program written in machine language is vaguely comparable to looking at a DNA molecule atom by atom."[13]

Look up machine code in Wiktionary, the free dictionary.

Assembly language
Endianness
List of machine languages
Machine code monitor
Overhead code
P-code machine
Reduced instruction set computing (RISC)
Very long instruction word
Teaching Machine Code: Micro-Professor MPF-I

^ Such as many versions of BASIC, especially early ones, as well as Smalltalk, MATLAB, Perl, Python, Ruby and other special purpose or scripting languages.
^ As an example, the DR-DOS MBRs and boot sectors (which also hold the partition table and BIOS Parameter Block, leaving less than 446 respectively 423 bytes for the code) were traditionally able to locate the boot file in the FAT12 or FAT16 file system by themselves and load it into memory as a whole, in contrast to their counterparts in MS-DOS/PC DOS, which instead relied on the system files to occupy the first two directory entries in the file system and the first three sectors of IBMBIO.COM to be stored at the start of the data area in contiguous sectors containing a secondary loader to load the remainder of the file into memory (requiring SYS to take care of all these conditions). When FAT32 and LBA support was added, Microsoft even switched to require 386 instructions and split the boot code over two sectors for code size reasons, which was no option to follow for DR-DOS as it would have broken backward- and cross-compatibility with other operating systems in multi-boot and chain load scenarios, as well as with older PCs. Instead, the DR-DOS 7.07 boot sectors resorted to self-modifying code, opcode-level programming in machine language, controlled utilization of (documented) side effects, multi-level data/code overlapping and algorithmic folding techniques to still fit everything into a physical sector of only 512 bytes without giving up any of their extended functionality.

^ Kjell, Bradley. "Immediate Operand".
^ a b c d e Jacob, Matthias; Jakubowski, Mariusz H.; Venkatesan, Ramarathnam (20–21 September 2007). Towards Integral Binary Execution: Implementing Oblivious Hashing Using Overlapped Instruction Encodings (PDF). Proceedings of the 9th workshop on Multimedia & Security (MM&Sec '07). Dallas, Texas, USA: Association for Computing Machinery. pp. 129–140. CiteSeerX 10.1.1.69.5258. doi:10.1145/1288869.1288887. ISBN 978-1-59593-857-2. S2CID 14174680. Archived (PDF) from the original on 2018-09-04. Retrieved 2021-12-25. (12 pages)
^ Lagarias, Jeffrey C.; Rains, Eric; Vanderbei, Robert J. (2009) [2001-10-13]. Brams, Stephen; Gehrlein, William V.; Roberts, Fred S. (eds.). The Kruskal Count (PDF). The Mathematics of Preference, Choice and Order. Essays in Honor of Peter J. Fishburn. Berlin / Heidelberg, Germany: Springer-Verlag. pp. 371–391. arXiv:math/0110143. ISBN 978-3-540-79127-0. Archived (PDF) from the original on 2021-12-25. Retrieved 2021-12-25. (22 pages)
^ a b "Unintended Instructions on x86". Hacker News. 2021. Archived from the original on 2021-12-25. Retrieved 2021-12-24.
^ Kinder, Johannes (2010-09-24). Static Analysis of x86 Executables [Statische Analyse von Programmen in x86 Maschinensprache] (PDF) (Dissertation). Munich, Germany: Technische Universität Darmstadt. D17. Archived from the original on 2020-11-12. Retrieved 2021-12-25. (199 pages)
^ "What is "overlapping instructions" obfuscation?". Reverse Engineering Stack Exchange. 2013-04-07. Archived from the original on 2021-12-25. Retrieved 2021-12-25.
^ Gates, William "Bill" Henry, Personal communication (NB. According to Jacob et al.)
^ Shacham, Hovav (2007). The Geometry of Innocent Flesh on the Bone: Return-into-libc without Function Calls (on the x86) (PDF). Proceedings of the ACM, CCS 2007. ACM Press. Archived (PDF) from the original on 2021-12-15. Retrieved 2021-12-24.
^ "Managed, Unmanaged, Native: What Kind of Code Is This?". developer.com. 2003-04-28. Retrieved 2008-09-02.
^ Samuelson, Pamela (September 1984). "CONTU Revisited: The Case against Copyright Protection for Computer Programs in Machine-Readable Form". Duke Law Journal. 1984 (4): 663–769. doi:10.2307/1372418. JSTOR 1372418. PMID 10268940.
^ "Copyright Registration for Computer Programs" (PDF). US Copyright Office. August 2008. Retrieved 2014-02-23.
^ "What is decompile? - Definition from WhatIs.com". WhatIs.com. Retrieved 2016-12-26.
^ Hofstadter, Douglas R. (1980). Gödel, Escher, Bach: An Eternal Golden Braid. p. 290.

Hennessy, John L.; Patterson, David A. (1994). Computer Organization and Design. The Hardware/Software Interface. Morgan Kaufmann Publishers. ISBN 1-55860-281-X.
Tanenbaum, Andrew S. (1999). Structured Computer Organization. Prentice Hall. ISBN 0-13-020435-8.
Brookshear, J. Glenn (2007). Computer Science: An Overview. Addison Wesley. ISBN 978-0-321-38701-1.

Retrieved from "https://en.wikipedia.org/w/index.php?title=Machine_code&oldid=1105188887"