A few days ago I wrote a simple program to learn multi-threading in Python and came across something very interesting. Code for serial and multi-threaded execution are shown below.
#Serial; run one thread with a lot of work. def countDown(n): while n > 0 : n = n – 1 COUNT = 100000000 #100 Million countDown(COUNT)
#Multi-threaded; delegate work across two threads. from threading import Thread def countDown(n): while n > 0 : n = n – 1 COUNT = 100000000 #100 Million t1 = Thread(target=countDown, args=(COUNT//2,)) t2 = Thread(target=countDown, args=(COUNT//2,)) t1.start(); t2.start() t1.join(); t2.join()
Time for execution for both the programs were as follows:
Serial: 5.88249 seconds
Multi-threaded: 9.02672 seconds
Clearly, we don’t see the speed up that we expected when work was distributed across two threads. In fact, the multi-threaded program took more time to execute. In order to make sure that what I was seeing was indeed correct, I tried to distribute the load equally across four threads and measure the time taken. This time it took 11.04341 seconds. There is something really fishy going on within…
To understand what exactly is going on we need to know about a few things like ticks, checks, and the infamous Global Interpreter Lock. So let’s dive right into it!
An important thing to make note of would be that Python threads are real system threads and are fully managed by the host operating system.
In order to simplify implementation of the Python Interpreter, there is a Global Interpreter Lock which prevents multiple threads from executing in parallel. As a result, what you get instead of parallel processing is co-operative multitasking. Each thread requires the Global Interpreter Lock in order to execute and the GIL is implemented in such a way that only one thread can have it at any instance of time. When a thread wants to execute it acquires the GIL, carries out its execution, and gives up the GIL when it either is about to perform an I/O operation (Ex: Waiting on a socket for an input) or is done with it’s execution.
Ticks and Checks
Alright, so now we know why we didn’t see a significant performance improvement when we distributed the load across two or more threads. But we still haven’t figured out why it took so much more time (almost 2x) when we ran the program in a multi-threaded fashion.
Suppose that we have a python thread running on the CPU and that this thread does not give up the CPU easily. In a sense, it is a CPU intensive thread and has either minimal or no I/O bound instructions. How do other threads acquire the GIL in this scenario?
At this point it is helpful to understand that the GIL is an object that contains a mutex and a condition variable. A thread that holds the GIL mutex is allowed to proceed interpreting bytecode. Other threads that try to acquire the GIL and fail issue a wait() on the condition variable and go to sleep. When the thread that holds the GIL releases the mutex it signals the condition variable instructing the operating system to wake up one or more threads that are sleeping waiting for the GIL.
In order to address the above mentioned problem regarding threads that never give up CPU, the interpreter is forced to give up the GIL periodically. The interpreter releases the GIL every 100 ticks, where a tick is roughly one Python bytecode operation. Note that ticks are not associated to time; some Python bytecode instructions do not qualify as whole ticks and one tick might span multiple bytecode instructions in such scenarios. It is also important to note that each tick is an atomic operation in the sense that the interpreter will not thread switch in the middle of a tick.
In normal python code the interpreter counts down from 100 as it goes through the valuation loop and calls a check() function when the tick counter gets to 0.Since Python threads are actual operating system threads, the thread scheduling process is delegated to the operating system. After the tick counter gets to 0, the executing thread stops executing user code, updates its state, and then releases the GIL. The next bit of code that any thread will execute if scheduled is to grab the GIL. So if in this short interval, when the GIL is released, if the operating system decides to initiate a thread switch, then this new thread will start executing. If not, the old thread will reacquire the GIL and continue its execution.
So What Exactly Happened?
The GIL ensures that interpreter threads are serialized, hence we did not get any performance increase dividing the CPU-bound work between python threads. Furthermore, system calls to coordinate the threads using lock, unlock, wait, and signal calls on the GIL mutex and condition variable using check(), which is performed every 100 ticks, incurs significant performance overhead.
The Work Around – Multiprocessimg
So now we have talked a few things about the Python GIL and how it does not allow multiple threads of the same Python Interpreter instance to execute concurrently. One might wonder how a Python programmer can attain true parallelism on multiple cores with the GIL in existence. It is to address this matter that the current post was written.
In order to attain true parallelism with multiple CPU cores, one need look no further than the Python standard library. Python’s multiprocessing package trades threads for processes. The main philosophy behind using multiprocessing is that if a single instance of the Python interpreter is restricted by the GIL, one can achieve gains in concurrent executions through multiple interpreter processes in place of multiple threads.
In order to make the results apparent, we shall use the same toy problem that we considered for discussing Python’s behavior on a single Python Interpreter instance – the countDown function.
#Multiprocessing; p1 and p2 run on separate CPU cores, #each with it's own Python Interpreter instance and GIL. from multiprocessing import Process def countDown(n): while n > 0 : n = n – 1 COUNT = 100000000 #100 Million p1 = Process(target=countDown, args=(COUNT//2,)) p2 = Process(target=countDown, args=(COUNT//2,)) p1.start(); p2.start() p1.join(); p2.join()
Time for execution with multiprocessing was :
Serial: 10.3551 seconds
Multiprocessor (with 2 CPU Cores): 5.4037 seconds
Another thing worth noting is the similarity between interfaces for multi-threading and multi-processing in Python. Indeed, Python’s multiprocessing module was written with the same interface as the threading package so that code already using threads would not require a massive rewrite to make use of multiple processes.
The contents of this blog post are mainly derived from two sources:
- A wonderful talk given by David Beazly titled ‘Understanding the Python GIL‘ which is available on Youtube – https://www.youtube.com/watch?v=Obt-vMVdM8s
- An Article by Erik Froese and Antony Kaplan – http://www.cs.nyu.edu/~lerner/spring10/projects/Python_GIL.pdf
- In order to know more about how multiprocessing in depth, read Python’s official documentation for the multiprocessing module here
- A blog by Jeff Knupp discusses about the multiprocessing module. Read more about his blog here