My goal is to add new functionality to Python to enable multi-threading as a first-class component of the language. My motivation is being able to consume N CPU cores on a modern machine in a single Python process, taking advantage of all of the caching locality, concurrency, and throughput multi-threaded processes have to offer on modern operating systems.
This gist of the idea is to allow new Python code to opt-in to thread-aware behavior. I believe the core requirements are:
- Tag each object with its thread state ("owned" or "global")
- Thread-local ownership of objects (memory allocation, reference counting, GC)
- Decorators for flagging functions/classes/modules as thread-safe
- Producer/consumer queues for transfering object ownership between threads
- Compatibility mode for acquiring the GIL when necessary
- Ability to see warnings and/or exceptions on accidental GIL acquiring
Isolated threads and owned objects
Every Python function has its own scope, accessible via the locals() built-in function. The core of my proposal is to allow code to indicate that the function's locals dictionary is strictly thread-local. That means every object in the locals dict must be owned by the current thread. The isolated thread will manage reference counts and do its own garbage collection. As soon as the thread tries to access a symbol outside a thread-safe locals dict-- normally symbol resolution traverses up function scopes, eventually to the global scope-- the thread will have to acquire the GIL.
Code will run in two different scopes: global context, while holding the GIL; and isolated context, where all interactions are with thread-local data. Immutable built-in objects will be thread-safe by default (strings, tuples, numbers). When created or accessed in the global context their reference counting will be behind the GIL. When created or accessed in the isolated thread context, their reference counting will be local to the owning thread. Functions declare their thread-safe status with the
@threadsafe
decorator. User-defined classes and modules declare thread-safety with the
__threadsafe__
special attribute. This changes the behavior of a class or module by freezeing its attribute __dict__
, making it immutable after definition (similar to __slots__
). This enables multiple threads to make concurrent look-ups in the class's immutable attribute dictionary and ensures they do not conflict through mutation. This also means that isolated threads can skip reference counting on a thread-safe class's attributes altogether. Incrementing the reference count on the class itself is good enough to ensure the attribute dictionary and everything it points to are not deallocated. Isolated threads can allocate objects of thread-safe classes without holding the GIL. This gives them a reference to the object in their locals dict. They know that they are the single owner of that object and the only one mutating its attribute dictionary. Isolated threads may pass that object to any other thread-safe code without acquiring the GIL. Isolated threads may transfer ownership of the object via a special type of producer/consumer queue (described later).
Interactions between GILed and thread-safe code
Isolated threads need to be able to call non-thread-safe methods using owned objects, even if it means acquiring the GIL. The risk here is the non-thread-safe code may keep a reference to an owned object after the GILed method completes (e.g., a memoization cache). This opens us up to terrible race conditions like accessing deleted memory.
My proposed solution borrows heavily from the style of TCMalloc. Three parameters would be associated with each PyObject: The
owner_thread_id
of the isolated thread that owns the object (-1 for global objects); the local_ref_count
for this object in the isolated thread context; and the global_ref_count
for this object from GILed code. Example synchronization behaviors
1. Isolated thread orphans owned objects: Say an isolated thread called a GILed function with an owned object, which in turn incremented the
global_ref_count
for that object. Now the isolated thread is done with the object, and decrements it local_ref_count
to zero. The isolated thread knows it cannot garbage collect the object because the global_ref_count
is non-zero (synchronized when the GIL was acquired for the original function call). Instead, the isolated thread puts the object on a thread-local list of garbage. The next time the isolated thread acquires the GIL, it will add these garbage objects to the global garbage collector, transferring cleanup responsibility. 2. GILed code accesses owned objects: Say an isolated thread calls a GILed function with an owned object and the GILed function keeps a reference to that object. Sometime later, other code calls the GILed function, causing it to try to access the owned object. Before fetching an attribute, the runtime will look at the object's
owner_thread_id
; if it does not match the current execution context, the runtime will raise a ConcurrencyError
at the point of access. The reason for this behavior is the GILed code has no way to preempt the isolated code to force synchronization (but the other way around works). 3. Isolated thread accesses global objects: Say an isolated thread calls a GILed function that returns a global object as a result. This object's
owner_thread_id
would be -1 (global), its global_reference_count
would be positive, and its local_reference_count
would be incremented (while the GIL was locked) by the isolated thread that holds the reference. Sometime later, the isolated thread releases the GIL and does more processing; it then tries to access the global object. Before fetching an attribute, the runtime will look at the object's owner_thread_id
; it will see that the value is -1 (global), meaning the GIL must be acquired before proceeding. The isolated thread will acquire the GIL and then proceed. Similarly, when an isolated thread needs to decrement a reference to a global object, it will acquire the GIL first. 4. Global thread orphans owned objects: Say an isolated thread called a GILed function with an owned object, which in turn incremented the
global_ref_count
for that object. Although the GILed function can never access this object (unless the calling thread context matches), it may create new local references or delete its local references to the object. This is safe because the only context that the global_ref_count
can be modified is when the GIL has been acquired. When the global_ref_count
is decremented to zero, the global context can assume that the object will either be cleaned up by the isolated thread in the future or the object is already in the global garbage list. 5. One isolated thread gets a reference to another isolated thread's owned object: Say isolated thread A calls a GILed function, which holds a reference to an object owned by thread A. Then isolated thread B calls the GILed function and gets back the object owned by thread A. This would immediately raise an
OwnershipError
upon the return of the GILed function. When a symbol is updated in a thread-safe code's locals dict (upon function return), the runtime in the thread-safe context will check that the owner_thread_id
of the object either matches the current context or is a global object. Isolated threads may never access to each other's objects. The only way to have cross-isolated-thread communication is through special producer/consumer queues (described below). 6. GILed code calls thread-safe code: Say a GILed function calls thread-safe code. This will cause all reference increments and decrements to affect the
global_ref_count
field, not the local_ref_count
, since the execution context is global. If the thread-safe code needs to access global symbols, the GIL will already be acquired and the code can move forward without any special handling. Put simply: GILed code can always call thread-safe code. Transferring object ownership
There will be a new bidirectional producer/consumer queue type. Go's Channels might be good inspiration for the API. Its role is to transfer object ownership between threads. When a producer puts an object into a queue, the Python runtime will verify that the
local_ref_count
for that object is 1, otherwise the channel will raise an OwnershipError
exception. Putting the object on the queue will remove its symbol from the locals dict. When a consumer thread gets an object from a queue, the owner_thread_id
will be updated to the caller context and that thread will take over memory management and GC responsibilities. A lot more to figure out
This is just a sketch. There's a lot more to figure out about specific integration with the CPython API, specific architecture issues like memory fences, etc. I've been sitting on this post since PyCon this year but haven't found the time to figure those things out. I could use some help.
Meanwhile, I saw this recent post about PyPy trying to use STM and it seems totally abstract (no offense intended). I highly doubt we can have our cake and eat it too. The language semantics will have to change. My hope is to identify the minimal changes necessary to make it an easy migration from the GIL to multi-threading.
Let me know what you think!