The Two’s Complement Story

Ever wondered how positive and negative integers are stored in the computer? I did that once, read a few answers on the Internet, and came to know that negative integers are stored in the two’s complement notation while positive integers are stored as is (in binary, of course). What I failed to do at that time is ask myself Why. Why exactly is it stored in that manner and how it helps, because it seems quite far from intuition. I try to answer the Why in this post.

If you are not familiar with the binary representation of numbers or the two’s complement notation, don’t worry, I have tried to explain that as well.

Binary Representation

In mathematics and digital electronics, a binary number is a number expressed in the binary numeral system or base-2 numeral system which represents numeric values using two different symbols: 0 (zero) and 1 (one). Because of its straightforward implementation in digital electronic circuitry using logical gates, the binary system is used internally by almost all modern computers and computer-based devices. Each digit is referred to as a bit.

Integer values are common data items. They are used in computer programs and computation all the time. We learn about them in math class and represent them using the decimal number system, or base 10. The decimal number 23310 and its corresponding binary equivalent 111010012 are interpreted respectively as

(2 x 102) + (3 x 101) + (3 x 100)

and

(1 x 27) + (1 x 26) + (1 x 25) + (0 x 24) + (1 x 23) + (0 x 22) + (0 x 21) + (1 x 20)

But how do we get one from the other? The answer is an algorithm called “Divide by 2” that uses a stack to keep track of the digits for binary result.

The Divide by 2 algorithm assumes that we start with an integer greater than 0. A simple iteration then continually divides the decimal number by 2 and keeps track of the remainder. The first division by 2 gives information as to whether the value is even or odd. An even value will have a remainder of 0. It will have the digit 0 in the ones place. An odd value will have a remainder of 1 and will have the digit 1 in the ones place. We think about building our binary number as a sequence of digits; the first remainder we compute will actually be the last digit in the sequence. As shown in the following figure, we again see the reversal property that signals that a stack is likely to be the appropriate data structure for solving the problem.

dectobin

Computer Memory & Integer representation

A n-bit storage location can represent up to 2n distinct entities. For example, a 3-bit memory location can hold one of these eight binary patterns : 000, 001, 010, 011, 100, 101, 110, or 111. Hence, it can represent at most 8 distinct entities. You could use them to represent numbers 0 to 7, numbers 8881 to 8888, characters ‘A’ to ‘H’, or up to 8 kinds of fruits like apple, orange, banana; or up to 8 kinds of animals like lion, tiger, etc.

Computers use a fixed number of bits to represent integers. The commonly used bit-lengths for integers are 8-bit, 16-bit, 32-bit or 64-bit. We shall consider the 32 bit representation for the rest of this post. For example, the number 233 can be represented using 32 bits as follows:

0000-0000-0000-0000-0000-0000-1110-1001

Signed Integers

So now we know how to represent positive integers, it’s a good time to wonder how negative integers are represented in the computer. In ordinary usage, one uses a minus sign to designate a negative integer. However, a computer can only store information in bits.

An unsigned integer is either positive or zero. Consider a single digit decimal number: in a single decimal digit, you can write a number between 0 and 9. In two decimal digits, you can write a number between 0 and 99, and so on. Since 9 is equivalent to 101 – 1, 99 is equivalent to 102 – 1, etc, in n decimal digits, you can write a number between 0 and 10n – 1. Analogously, in the binary number system, an unsigned integer containing n bits can have a value between 0 and 2n – 1 ( which is 2n different values). Thus, with 32 bits, we can store numbers between 0 and 232 – 1 (4,294,967,295).

Signed number representations are required to encode negative numbers in binary number systems. The four best-known methods for representing signed numbers are : signed-magnitude, ones’ complement, two’s complement, and offset binary. I will avoid the discussion of offset binary in this post.

Signed Magnitude Representation

In this approach, the problem of representing a number’s sign can be solved by allocating one sign bit to represent the sign: setting that bit (often the most significant bit, or the MSB) to 0 is for a positive number and setting it to 1 is for a negative number. The remaining bits in the number indicate the magnitude (or absolute value). Hence, in a 32-bit memory location, with only 31 bits (apart from the sign bit), the magnitude can range from 0 to 2147483647 and the values can range from -214748364710 to 214748364710. The extreme positive and negative values with the sign bit (MSB bit) in bold, are shown in binary below :

1111-1111-1111-1111-1111-1111-1111-11112 (-214748364710)
1000-0000-0000-0000-0000-0000-0000-00002 (-010)
0000-0000-0000-0000-0000-0000-0000-00002 (+010)
0111-1111-1111-1111-1111-1111-1111-11112 (+214748364710)

Though this approach is directly comparable to the common way of showing a sign (placing a “+” or “-” next to the number’s magnitude), it has certain drawbacks.

First, it has two patterns that corresponds to 0 ( “plus zero”, and “minus zero”, which are shown above). Second, it does not work well in computation. A good representation method for integers must not only be able to represent the objects of interest, but must also support operations on those objects. This is illustrated with the following example. Consider addition using simple 32-bit signed magnitude integers. Consider adding 1 + 1

  0000-0000-0000-0000-0000-0000-0000-00012 (+110)
+ 0000-0000-0000-0000-0000-0000-0000-00012 (+110) 
------------------------------------------------
  0000-0000-0000-0000-0000-0000-0000-00102 (+210) 

We see that we have arrived at the right answer when adding two positive integers. What happens when we try adding two negative integers? Consider (-1) + (-1)

      1000-0000-0000-0000-0000-0000-0000-00012 (-110)
+     1000-0000-0000-0000-0000-0000-0000-00012 (-110)
----------------------------------------------------
1carry  0000-0000-0000-0000-0000-0000-0000-00102 (+210)

Since we are adding 32 bit integers, we discard the overflow bit, and we are left with +210. We got the magnitude of the sum correct, but not the sign. A naive solution would be to ignore the sign bit during the calculation, in which case (-1) + (-1) is :

  1000-0000-0000-0000-0000-0000-0000-00012 (-110)
+ 1000-0000-0000-0000-0000-0000-0000-00012 (-110)
---------------------------------------------------------------------
  1000-0000-0000-0000-0000-0000-0000-00102 (-210)

We got -2, which is the correct answer. While this also works when adding two positive integers, we see that it’s not a viable solution when adding positive and negative integer. Consider (1) + (-1) :

  0000-0000-0000-0000-0000-0000-0000-00012 (+110)
+ 1000-0000-0000-0000-0000-0000-0000-00012 (-110)
------------------------------------------------- 
  ?000-0000-0000-0000-0000-0000-0000-00102 (?210)

Which sign bit do we preserve? Actually, it doesn’t matter, because the magnitude of the sum is 2, which is incorrect whether the sign is positive or negative. The answer should have been zero.

Ones’ Complement

Alternatively, a system known as ones’ complement can be used to represent negative numbers. The ones’ complement of a binary number is defined as the value obtained by inverting all the bits in the binary representation of the number. For example, +1 and -1 are represented as :

0000-0000-0000-0000-0000-0000-0000-00012 (+110)

and

1111-1111-1111-1111-1111-1111-1111-11102 (-110)

We notice that this scheme, like signed-magnitude representation, uses one bit to represent sign of the integer.

The primary advantage that ones’ complement notation has over signed-magnitude is the ease of carrying out operations. Adding two values is straight forward. Simply align the values on the least significant bit and add propagating any carry to the bit one position left. If the carry extends past the end of the word it is said to have “wrapped around”, a condition called “end-around carry”. When this occurs, the bit must be added back in the right most bit.

Consider 22 + 3

  0000-0000-0000-0000-0000-0000-0001-01102 (+2210)
+ 0000-0000-0000-0000-0000-0000-0000-00112 (+310)
--------------------------------------------------
  0000-0000-0000-0000-0000-0000-0001-10012 (+2510)

The addition produces +25, which is the right answer. Now let us consider addition of two negative numbers : (-2) + (-2)

     1111-1111-1111-1111-1111-1111-1111-11012 (-210)
+    1111-1111-1111-1111-1111-1111-1111-11012 (-210)
-------------------------------------------------------
1carry 1111-1111-1111-1111-1111-1111-1111-10102 (-510)

We see that the carry extends past the end of the word. We therefore have to add the carry back to the LSB, which gives the result as :

 1111-1111-1111-1111-1111-1111-1111-10112 (-410)

Which is the right answer.

Let us now consider adding a positive and a negative integer :
(+1) + (-1)

  0000-0000-0000-0000-0000-0000-0000-00012 (+110)
+ 1111-1111-1111-1111-1111-1111-1111-11102 (-110)
-------------------------------------------------
  1111-1111-1111-1111-1111-1111-1111-1111 (-010)

Ones’ complement notation seems to work fine when doing arithmetic which signed-magnitude notation clearly failed at. But there is one major drawback to ones’ complement notation – Zeroes. The extreme values that can be fit in a 32-bit memory location when using ones’ complement notation are :

1000-0000-0000-0000-0000-0000-0000-00002 (-214748364810)
1111-1111-1111-1111-1111-1111-1111-11112 (-010)
0000-0000-0000-0000-0000-0000-0000-00002 (+010)
0111-1111-1111-1111-1111-1111-1111-11112 (+214748364810

We see that we are still stuck with multiple patterns for representing 0: +010 and -010, which is undesirable.

Two’s Complement

The problems of multiple representations of 0 and the need for the end-around carry are circumvented by a system called two’s complement. In two’s complement, negative numbers are represented by the bit pattern which is one greater (in an unsigned sense) than the ones’ complement of the positive value.

In two’s complement, there is only one zero, represented by a sequence of 0s. Negating a number (whether positive or negative) is done by inverting all the bits and then adding one to that result. Addition of a pair of two’s-complement integers is the same as addition of a pair of unsigned numbers. The two’s-complement notation, like the ones we’ve discussed earlier for representing signed integers, uses one bit (the MSB) to denote the sign: 0 for positive and 1 for negative integers.

Consider the number -30. To represent it in 2’s complement, you take the binary representation of 30 :

0000-0000-0000-0000-0000-0000-0001-1110

Invert the digits

1111-1111-1111-1111-1111-1111-1110-0001 (One’s complement)

and add one.

1111-1111-1111-1111-1111-1111-1110-0010

Let us look at some examples to show two’s complement in action.

Consider addition of two negative integers : (-2) + (-2)

  1111-1111-1111-1111-1111-1111-1111-1110 (-210, in 2’s complement)
+ 1111-1111-1111-1111-1111-1111-1111-1110 (-210, in 2’s complement)
-----------------------------------------
  1111-1111-1111-1111-1111-1111-1111-1100 (-410, in 2’s c)

Since MSB in the result is 1, we know that it’s a negative integer represented in 2’s complement notation. The magnitude is obtained by taking the 2’s complement of the result as done below:

1111-1111-1111-1111-1111-1111-1111-1100 (-410, in 2’s complement)

invert all the bits

0000-0000-0000-0000-0000-0000-0000-0011 (1's complement of result)

and add one

0000-0000-0000-0000-0000-0000-0000-0100 (+410)

Let us now consider the addition of a positive and a negative integer : (+2) + (-2)

  0000-0000-0000-0000-0000-0000-0000-0010 (+210)
+ 1111-1111-1111-1111-1111-1111-1111-1110 (-210, in 2’s complement)
-----------------------------------------
  0000-0000-0000-0000-0000-0000-0000-0000 (010)

We see that two’s complement works similar to one’s complement without the drawback of multiple representation of zeros or keeping track of overflows.

But it makes no sense!

For those who wonder why inverting all bits and adding one for representing negative integers works and why it is so far from intuition, if you think about it deep enough, you realize that it actually is quite intuitive. Let me illustrate this with an example. Consider the integer (-30). 30 is represented in binary as:

0000-0000-0000-0000-0000-0000-0001-1110 (+3010)

by inverting all the bits in the representation for 30, we obtain the 1’s complement form which is

1111-1111-1111-1111-1111-1111-1110-0001 (1’s complement of +3010)

When we add the two, we obtain a sequence of only 1s :

   0000-0000-0000-0000-0000-0000-0001-1110
+  1111-1111-1111-1111-1111-1111-1110-0001
----------------------------------------------------------
   1111-1111-1111-1111-1111-1111-1111-1111

If we add 1 to this, we obtain 0

  1111-1111-1111-1111-1111-1111-1111-1111
+ 0000-0000-0000-0000-0000-0000-0000-0001
------------------------------------------
  0000-0000-0000-0000-0000-0000-0000-0000

Therefore, for any number x, we have the following equation:

x + (~x + 1) = 0

Where ~x (NOT x) is all the bits inverted in the representation of x. Bringing x to the other side of the equation, we obtain

-x = (~x + 1)

Which is exactly how we obtain the 2’s complement of a number x.

References

Advertisements

The Python GIL

A few days ago I wrote a simple program to learn multi-threading in Python and came across something very interesting. Code for serial and multi-threaded execution are shown below.

#Serial; run one thread with a lot of work.
def countDown(n):
    while n > 0 :
	n = n – 1

COUNT = 100000000	#100 Million
countDown(COUNT)
#Multi-threaded; delegate work across two threads.
from threading import Thread

def countDown(n):
    while n > 0 :
	n = n – 1

COUNT = 100000000	#100 Million
		
t1 = Thread(target=countDown, args=(COUNT//2,))
t2 = Thread(target=countDown, args=(COUNT//2,))

t1.start(); t2.start()
t1.join(); t2.join()

Time for execution for both the programs were as follows:
Serial: 5.88249 seconds
Multi-threaded: 9.02672 seconds

Clearly, we don’t see the speed up that we expected when work was distributed across two threads. In fact, the multi-threaded program took more time to execute. In order to make sure that what I was seeing was indeed correct, I tried to distribute the load equally across four threads and measure the time taken. This time it took 11.04341 seconds. There is something really fishy going on within…

To understand what exactly is going on we need to know about a few things like ticks, checks, and the infamous Global Interpreter Lock. So let’s dive right into it!

An important thing to make note of would be that Python threads are real system threads and are fully managed by the host operating system.

The GIL

In order to simplify implementation of the Python Interpreter, there is a Global Interpreter Lock which prevents multiple threads from executing in parallel. As a result, what you get instead of parallel processing is co-operative multitasking. Each thread requires the Global Interpreter Lock in order to execute and the GIL is implemented in such a way that only one thread can have it at any instance of time. When a thread wants to execute it acquires the GIL, carries out its execution, and gives up the GIL when it either is about to perform an I/O operation (Ex: Waiting on a socket for an input) or is done with it’s execution.

PythonMultithreading.png

Ticks and Checks

Alright, so now we know why we didn’t see a significant performance improvement when we distributed the load across two or more threads. But we still haven’t figured out why it took so much more time (almost 2x) when we ran the program in a multi-threaded fashion.

Suppose that we have a python thread running on the CPU and that this thread does not give up the CPU easily. In a sense, it is a CPU intensive thread and has either minimal or no I/O bound instructions. How do other threads acquire the GIL in this scenario?

At this point it is helpful to understand that the GIL is an object that contains a mutex and a condition variable. A thread that holds the GIL mutex is allowed to proceed interpreting bytecode. Other threads that try to acquire the GIL and fail issue a wait() on the condition variable and go to sleep. When the thread that holds the GIL releases the mutex it signals the condition variable instructing the operating system to wake up one or more threads that are sleeping waiting for the GIL.

In order to address the above mentioned problem regarding threads that never give up CPU, the interpreter is forced to give up the GIL periodically. The interpreter releases the GIL every 100 ticks, where a tick is roughly one Python bytecode operation. Note that ticks are not associated to time; some Python bytecode instructions do not qualify as whole ticks and one tick might span multiple bytecode instructions in such scenarios. It is also important to note that each tick is an atomic operation in the sense that the interpreter will not thread switch in the middle of a tick.

DIS.png

 In normal python code the interpreter counts down from 100 as it goes through the valuation loop and calls a check() function when the tick counter gets to 0.CHECKS.pngSince Python threads are actual operating system threads, the thread scheduling process is delegated to the operating system. After the tick counter gets to 0, the executing thread stops executing user code, updates its state, and then releases the GIL. The next bit of code that any thread will execute if scheduled is to grab the GIL. So if in this short interval, when the GIL is released, if the operating system decides to initiate a thread switch, then this new thread will start executing. If not, the old thread will reacquire the GIL and continue its execution.

So What Exactly Happened?

The GIL ensures that interpreter threads are serialized, hence we did not get any performance increase dividing the CPU-bound work between python threads. Furthermore, system calls to coordinate the threads using lock, unlock, wait, and signal calls on the GIL mutex and condition variable using check(), which is performed every 100 ticks, incurs significant performance overhead.

The Work Around – Multiprocessimg

So now we have talked a few things about the Python GIL and how it does not allow multiple threads of the same Python Interpreter instance to execute concurrently. One might wonder how a Python programmer can attain true parallelism on multiple cores with the GIL in existence. It is to address this matter that the current post was written.

In order to attain true parallelism with multiple CPU cores, one need look no further than the Python standard library. Python’s multiprocessing package trades threads for processes. The main philosophy behind using multiprocessing is that if a single instance of the Python interpreter is restricted by the GIL, one can achieve gains in concurrent executions through multiple interpreter processes in place of multiple threads.

In order to make the results apparent, we shall use the same toy problem that we considered for discussing Python’s behavior on a single Python Interpreter instance – the countDown function.


#Multiprocessing; p1 and p2 run on separate CPU cores,
#each with it's own Python Interpreter instance and GIL.

from multiprocessing import Process

def countDown(n):
    while n > 0 :
	n = n – 1

COUNT = 100000000	#100 Million
		
p1 = Process(target=countDown, args=(COUNT//2,))
p2 = Process(target=countDown, args=(COUNT//2,))

p1.start(); p2.start()
p1.join(); p2.join()

Time for execution with multiprocessing was :
Serial: 10.3551 seconds
Multiprocessor (with 2 CPU Cores): 5.4037 seconds

Another thing worth noting is the similarity between interfaces for multi-threading and multi-processing in Python. Indeed, Python’s multiprocessing module was written with the same interface as the threading package so that code already using threads would not require a massive rewrite to make use of multiple processes.

References

The contents of this blog post are mainly derived from two sources:

SIMD Optimizations

The following problem serves as an example where the power of a concept is put forth by a really simple problem.

The problem under consideration is follows – consider a situation where you are given two character arrays, A and B, and the task at hand is to check equality of the arrays. In other words, if there are n elements in both the arrays A[0…n-1] and B[0…n-1], then we are performing the following sequence of equality checks: A[0] = B[0], A[1] = B[1],…, A[n-1] = B[n-1].

We shall try to develop an optimal solution for this problem in a step by step manner. Also, we shall assume that we only have a single thread of execution at our disposal. The function check_equal(char *a, char *b) is the problem solver; it takes two character arrays, a and b as parameters, checks them for equality, and returns 1 or 0 based on whether the two arrays are equal or unequal respectively. We shall consider that we have at our disposal a global variable n which tells us length of the arrays (assuming that both arrays are equal in length).

1. The Obvious way

The most obvious solution that comes to our mind would perhaps be something as follows – iterate over each element of the two character arrays simultaneously, check them for equality, and return 0 at the first instance we find any two elements which are unequal. The following C code demonstrates this approach :

 function check_equal(char *a, char *b){
    int i;
    for(i=0; i < n; i++)
    {
        if( *a != *b )
            return 0;
    }
    return 1;
}

Code1: The Obvious Solution

Being a student who has done a course on algorithms, a question that always haunts your mind is – Can you do better?

Luckily, there is something that we can do to our code which can make it faster. Currently we are iterating over each element of the character arrays one by one to check for equality, is it really necessary to do that?

2. The Not-So Obvious Way

Notice that in Code1, we were iterating over each element of the character array one by one in order to determine equality of the two arrays. Why not check multiple elements of the two arrays at once and check for equality, say by using int pointers? The following C code illustrates this approach :


function check_equal(char *a, char *b){
    int i;
    int j;
    int flag = 0;
    for(i=0; i<n; i = i + sizeof(int)){
        if(i + sizeof(int) < n){
            if( (*(int*)(a+i)) != (*(int*)(b+i)) )
                return 0;
        } else {
            flag = 1;
            break;
        }
    }
    if(flag == 1){
        for(j=i; j < n; j++)
            if(a[i] != b[i])
                return 0;
    }
    return 1;
}

Code2: The Not-So Obvious Solution

What we’re doing is basically this – a character is typically represented by 1 byte (8 bits) on a computer. In Code1, we were iterating over each of the n bytes one by one to determine equality whereas in Code2, we are checking 4 bytes at once (assuming that an integer is represented by 4 bytes in the memory) from both the arrays. The reason why this will still work is because each integer has a unique sequence of binary digits that represent it in the base 2 number system. So unless all the bits in their respective positions are identical in both the arrays, the equality will not hold true, which is what we desire from our code.

In case you are still haunted by the “Can you do better? Question, we can use long instead of int to obtain an even greater speedup, as a long is generally represented using 8 bytes in the memory.

3. A Graph

Performance of both the approaches were tested on inputs of varying sizes and recorded. The following is a time taken vs size of input graph.

analysis
x-axis: Number of elements in each array
Fig1: Time Taken vs Input Size

Sizes of the inputs, in terms of number of elements in each of the arrays, is given on the x-axis whereas time taken, in seconds, by the program to check for equality of the two arrays is given on the y-axis. It is evident that when we use long pointers, it is faster than the case where we used int pointers, which in turn is faster than char pointers.

4. Moral of the Story

Having read the problem and the two disparate ways of solving it, it’s a good time to ask ourselves about the purpose of this text. The reason we considered this toy problem, and the Not-So-Obvious Solution in particular, is because it elicits the essence of SIMD (Single Instruction Multiple Data) System Architectures which are one of the most widely used architectures in the parallel computing domain.

A single instruction, multiple data, or SIMD, system executes a single instruction at a time, but the instruction can operate on multiple data items. These systems often execute their instructions in lockstep: the first instruction is applied to all of the data elements simultaneously, then the second is applied, and so on. This type of parallel system is usually employed in data parallel programs, programs in which the data are divided among the processors and each data item is subjected to more or less the same sequence of instructions. Vector processors and graphics processing units are often classified as SIMD systems.

With a SIMD processor there are two improvements to a process. For one the data is understood to be in blocks, and a number of values can be loaded all at once. Instead of a series of instructions saying “retrieve this pixel, now retrieve the next pixel”, a SIMD processor will have a single instruction that effectively says “retrieve n pixels” (where n is a number that varies from design to design). For a variety of reasons, this can take much less time than retrieving each pixel individually, as with traditional CPU design.

Another advantage is that the instruction operates on all loaded data in a single operation. In other words, if the SIMD system works by loading up eight data points at once, the compare operation being applied to the data will happen to all eight values at the same time.

Hello, World!

This blog post owes its genesis to a course on Parallel Computing which I had taken up (at the time of this writing) in my University. It is my hope that I am able to eloquently portray what I have experienced, elicit lucid but elegant concepts which were either taught to me or were picked up from other sources in the manner that they were presented to me, and share my journey of discovering new things through this blog.

For those of you who are wondering what the title of the blog is, It’s Hello, World in Morse code. Read more about Morse Code here.