How Celery fixed Python's GIL problem - must check ~ Welcome to viral moon

Recently I've been rereading the great Unyielding post by Glyph. If you haven't read it - go read it. I'll be summing it up below but it's worth your time to read the original.

I've been looking at the problem of Python's Global Interpreter Lock, the GIL, for the last 10 years.

The real pain point, with regard to GIL, seems to be asynchronous I/O - as we know, threads really are sold as a tidy way to handle it. You get a request, you start a thread and magic happens. Concerns are separated yet resources are shared.

But you can't do that efficiently in Python since threads have to contend for GIL and there's only one GIL per interpreter, no matter how many cores your machines has. So if you have a top of the shelf Intel Core i7, you really won't feel much improvement with threading.

That's theory, because reality is even worse - old versions of GIL (until Python 3.1) actually behaved worse when doing threading on multiple cores and could make your code slower.

Is asynchronous I/O a problem

Most tasks we do in modern day programming can be boiled down to I/O, or such is the usual answer.

For example, fetching data from the database is I/O - you wait for the data to fetch and the system can do something else in that time, for example, serve more requests. Years ago when I encountered the Twisted project that was the reasoning for all the hoops it put you through.

The reasoning for asyncio's recent addition to Python is

writing single-threaded concurrent code using coroutines, multiplexing I/O access over sockets and other resources, running network clients and servers

There's a few assumptions in that sentence that I'd like to deconstruct:

People need single-threaded concurrent code
People often need to multiplex I/O
People use coroutines

First of all, do we really need single-threaded concurrent code?

In the last 10 years I've never met a case where someone pointed out that"this code needs to be concurrent, but single threaded".

I think the idea here is really that we want concurrent and that's the most often argument we see with regard to the GIL - that we can't have true concurrency.

My recent, and maybe obvious, realization was that we really don't needconcurrency - I'll get back to this in a minute, but first, let's address the two latter assumptions, just for kicks of kicking the coroutines.

Do people use coroutines? Yes, but not in production code. I may be opinionated, but I've done concurrency in many languages and never ever have I seen anything less readable than coroutines.

If code is meant to be read, coroutines are meant to make your eyes bleed.

Also, cooperative concurrency aka coroutines was abandoned a fair amount of time ago. I wonder why?

Because the primary assumption of coroutines is that they cooperate. Preemptive concurrency forces, or at least tries to force, fair usage of resources.

No such luck with coroutines - if your coroutine blocks, it's blocked and you have to wait. And your thread waits with all the other coroutines.

That's the biggest problem of coroutines in the real world. If your problem is a shell script crunching some Fibonacci numbers then all is fine. But out here in the cold, servers break, connections time out and we can't read all the code of open source libraries we install.

Going back to concurrency - have I ever been constrained by the number of threads I can use? No, not even once.

I think we need neither coroutines, nor concurrency.

What I think we need, and this is based on looking at where the efforts go, isnon-blocking code.

The issue with blocking code

Most code we perceive as blocking is either:

really blocked
just taking years to complete

As humans we're impatient, but we seem to be particuraly poor at waiting for machines to complete. But whether we have to wait long time or avoid blocking, it really boils down to the same problem.

We don't want to block something that can be done quickly, e.g. responding to the user, while waiting for something slow, e.g. database request, to complete.

Preemptive concurrecy (threads) is nice way to handle this, with some major advantages:

the code reads well
it can use resources well

On threading readability - not much to say, they aren't the easiest to understand, but definetly better than coroutines.

In regard to resources - that really is an implementation detail that can span from "a bit better" to "incredible". But most of them can use both cores on a two-core machine.

So in classic threading we could:

def handle_user_request(request):  
    thread_processing_data(request.data).start()
    return Response("OK", request)

And if thread_processing_data times out - well we get an error and that's that. And it uses the second core, if it's available. Nice.

And we can do this today in Python - not exactly, but close. Instead of thread_processing_data we wrap the code processing data in a process. I'm talking, of course, about the great multiprocessing library.

But is that really better?

I still have to understand quite a few concepts, especially in regard to how resources are shared between processes and it seems to be less than obvious.

Is there a better way?

Lockless for the win and Celery

The only thing I want as a programmer is to have code that doesn't block.

I don't care about whether that's through processes, threads, transactional memory or magic.

Take a unit of work, describe its parameters, such as priority and your job is done. And there's a package in Python world that allows you to do that -Celery.

Celery is a whole project and there's tons of coordination before you fire your first task - it's true. But when it's all on, it's incredible.

For example, at work I have a system that pulls social shares for an entry from various networks. As we all now, API calls cost time on their own and you also have to add time for network connections sending data there and back again.

For example:

In [8]: %time update_metrics(e)  
CPU times: user 1.2 s, sys: 137 ms, total: 1.34 s  
Wall time: 1.35 s

Well, over one second isn't acceptable for user-facing code. Yet I need to trigger this code only when user visits the entry view. What can I do?

With celery I wrap update_metrics with a task and I can do this:

update_metrics.apply_async((entry.id,), queue=settings.HIGH_PRIORITY_FRONTEND_QUEUE_KEY)

So here we have:

update_metrics which is a costly operation now executed in a non-blocking fashion
queue parameter denotes which of my queues to will handle that task

update_metrics takes a long time to complete - but thanks to Celery I don't have to care about that:

it's correctly triggered because of user action
the code is readable and very explicit
the resources are used where available

And most important of all: I never have to concern myself with whether the code is doing I/O and I should yield or not or if it's constrained by CPU or I/O.

Things Celery can do

Let's say your problem is such: scrape 1000 URLs, then count the frequency of 3 words user had specified in the form.

Normally, this is problematic - you need to coordinate scraping URLs. Connections can timeout and then you also need to wait until all that work is done, plus you need to somehow store the user's input.

Without Celery this is a nightmare of figuring out where and how to store data. With Celery all I need are tasks:

import requests  
from collections import Counter  
from celery import task, chord, chain

@task(ignore_result=False)
def find_frequency(link_text, user_words):  
    # we live the implementation as an exercise to the reader :)
    return {'foo': 1, 'bar': 1, 'baz': 1}

@task(ignore_result=False)
def scrape_url(url):  
    response = requests.get(url)
    if response.status_code == 200:
        return response.text

@task(ignore_result=False)
def aggregate_results(results):  
    counter = Counter()
    for result in results:
        counter.update(result)
    # do something with data
    return counter


user_submitted_words = ['foo', 'bar', 'john']  
urls_to_scrape = []

async_result = chord(  
    [chain(
        scrape_url.subtask(args=(url,)), 
        find_frequency.subtask(args=(user_submitted_words,)))
     for url in urls_to_scrape]
)(aggregate_results.s())

The meat of this example is in the last few lines:

chain allows you to make a pipeline out of tasks, one feeding into another
chord allows you to group some tasks and have one task that's called when all tasks are complete

The pros here are quite sizeable:

you don't have to take any interest into how this is executed. It could be threads or processes or coroutines. (Celery supports, to some extent, all types of pools)
since you don't ???

Alas, there's a sizeable pinch of salt in this case too:

since Celery tasks are functions, we have to use scrape_url.subtask(args=(url,)) syntax which is not very readable
Celery needs to have clear import paths to tasks, tasks end up as functions inside, usually, tasks.py module - can't define and submit work inside other tasks
since we can't define and chain tasks simply by calling one task from the body of another, objects such as chord and chain are needed, which complicates the code

Lockless?

Apart from the problems listed above, the biggest problem for me is the lack of fine-grained control. Queues are a great primitive to achieve lockless programming model.

The example above assumes you need to execute a batch of work and then aggregate the results - that's map/reduce in ~30 lines of code.

But let's think of a more problematic design - let's assume we have an object that supports no concurrency, is fully lockless, yet needs to be both read and written to, without blocking.

What can we do?

First (here I assume you use Django integration)

./manage.py celeryd -Q serial_queue -c 1

This is all you need to have run a worker that will process, at most, one task in parallel.

SERIAL_QUEUE_NAME = 'serial_queue'

@task()
def read_value():  
    return get_lockless_api_value()

@task()
def increment_value():  
    value = get_lockless_api_value()
    new_value = value + 1
    set_lockless_api_value(new_value)

result = read_value.apply_async(queue=SERIAL_QUEUE_NAME)  
result = increment_value.apply_async(queue=SERIAL_QUEUE_NAME)  
result = read_value.apply_async(queue=SERIAL_QUEUE_NAME)

And so, without doing anything special, no locks, no GIL problems, we have the ability to read or write to a value.

Of course, this has one major problem - we can't have concurrent reads, even though they should be possible.

Final remarks

For me, this sums up the whole GIL and couroutines/asyncio debate. I think the main problem we have at the core of Python is that it's heavily inspired by C. But I think in this case it's Python's weakness.

And I don't see a justification for this effort.

There are hundreds of companies running Python code as glue logic - and that's single-threaded synchronous code (check how popular Django is) yet somehow those companies handle tens of thousands of users.

I think that if we want to support full concurrency in Python, this is the way to go. Introduce primitives for fully lockless paradigm using queues and enable programmers to define queues.

Most of it is already implemented in Celery. To have that in Python we'd need to extend the interpreter to manage workers and queues as they're needed, add some syntax sugar to enable nested tasks definition and dispatch and you're mostly good to go.

I think we never really wanted concurrent code, as programmers. I definitely never wanted coroutines and never needed to multiplex I/O.

What I want is tools to express ideas efficiently and the way I see it, abstracting threads and concurrency away and fostering a lockless paradigm solves that.

About Admin MC3

This is dummy text. It is not meant to be read. Accordingly, it is difficult to figure out when to end it. But then, this is dummy text. It is not meant to be read. Period.