Chapter 6: Practical Parallelism

In the previous chapter, we walked through the basic capabilities of web workers. We used web workers for true parallelism in the browser because they map to real threads, which in turn, map to separate CPUs. This chapter builds on the last, providing some motivation for designing parallel code in the first place.

We'll start with a brief look at some ideas borrowed from functional programming, and how they're a nice fit for concurrency problems. Then, we'll tackle the problem of parallel validity by either making the decision to compute in parallel, or to simply run on one CPU. Then, we'll go in depth on some concurrency problems that would benefit from tasks running in parallel. We'll also address the problem of keeping the DOM responsive using workers.

Functional programming

Functions are obviously central to functional programming. But then, so is the data that flows through our application. In fact, the data and its movement in a program is probably just as important as the implementation of the functions themselves, at least as far application design is concerned.

There's a strong affinity between functional programming and concurrent programming. In this section, we'll look at why this is, and how we can apply functional programming techniques that can result in stronger concurrent code.

Data in, data out

Functional programming is just as powerful as other programming paradigms. It's a different way of tackling the same problems. We use a different set of tools. For example, functions are the building blocks, and we'll use them to build our abstractions around data transformations. Imperative programming, on the other hand, uses constructs, such as classes to build abstractions. The fundamental difference is that classes and objects like to encapsulate the state of something, while functions are data in, data out.

For example, let's say we had a user object with an enabled property. The idea is that the enabled property has a value, which can change at any given time. In other words, the user changes state. If we were to pass this object around to different areas of our application, then the state is also passed along with it. It's encapsulated as a property. Any one of these components that ends up with a reference to the user object can change it, and then pass it somewhere else. And so on, and so forth.

It's not like this in functional programming. State isn't encapsulated inside objects and passed around from component to component; not because doing so is inherently bad, but because it's just a different way of addressing the problem. Where state encapsulation is a goal of object-oriented programming, getting from point A to point B and transforming data along the way is what functional programming is all about. There's no point C—once the function has done its job, it doesn't care about the state of the data.

Instead the functional approach creates a new object with the updated property value. The function takes data as input and returns new data as output. In other words, it doesn't modify the input. It's a simple idea, but one with important consequences, such as immutability.

Immutability

Immutable data is a key functional programming concept and one that fits nicely into concurrent programming. JavaScript is a multi-paradigm language. That is, it's functional, but it can also be imperative. Some functional programming languages strictly enforce immutability—you simply cannot change the state of an object. It's actually nice to have the flexibility of choosing when to keep data immutable and when it makes sense not to.

In the last example of the previous section, it was noted that a function should actually return a brand new object with a property value that's different from the input value. This is done to avoid mutating the input value. Although, this may seem wasteful— constantly creating objects on the fly, but it really isn't. Consider all the bookkeeping code that we don't have to write when an object never changes.

For example, if the enabled property of a user is mutable, then this means any component that uses this object needs to constantly check the enabled property. This check needs to happen whenever a component wants to show the user. We actually need to perform this same check using the functional approach. If some part in our system can change the enabled property, then we have to worry about that. Eliminating that part also eliminates a host of other complexities. These are called side-effects.

Side-effects and concurrency don't get along well. In fact, it's the very idea that an object can change at all that makes concurrency hard. For example, let's say we have two threads that want to access our user object. They first need to acquire access to it, but it might already be locked.

Let us suppose that the first thread locks the user object, preventing other threads from accessing it. The second thread needs to wait until it's unlocked before it can continue. This is called resource contention, and it diminishes the whole purpose of utilizing multiple CPUs. The threads aren't truly running in parallel if they're waiting for access to some kind of resource. Immutability side-steps the resource contention issue because there's no need to lock resources that don't change.

When objects don't change state, any number of threads can access them concurrently without any risk of corrupting the state of an object due to out-of-order operations and without wasting valuable CPU time waiting on resources.

Referential transparency and time

Functions that take immutable data as input have something called referential transparency. This means that given the same object as input, no matter how many times it's called, the function will always return the same result. This is a useful property because it means that temporal factors are removed from the picture. That is, the only factor that can change the result of the function's output, is its input—not when it's called relative to other functions.

Put another way, referentially-transparent functions don't produce side-effects because they work with immutable data. And because of this and the lack of time being a factor for function output, they're well-suited in a concurrent context. Let's take a look at a function that isn't referentially-transparent:

The way the getName() function works depends on the state of the user object that's passed to it. If the user object is enabled, we return the name. Otherwise, we don't return anything. This means that the function isn't referentially transparent if it passes mutable data structures, which is the case in the preceding example. The enabled property changes, and so does the result of the function. Let's fix this situation and make it referentially-transparent with the following code:

As we can see, the updateUserRT() function doesn't actually change the data. It creates a copy that includes the updated property value. This means that we're safe to call updateUser() with the original user object as input at any time.

This functional programming technique helps us write concurrent code because the order in which we execute operations isn't a factor. Ordering asynchronous operations is hard. Immutable data leads to referential transparency, which leads to stronger concurrency semantics.

Do we need to go parallel?

Parallelism can be hugely beneficial to us for the right sort of problems. Creating workers and synchronizing the communication between them to carry out tasks isn't free. For example, we could have this nice, well thought-out parallel code that utilizes four CPU cores. But it turns out that the time spent executing the boilerplate code to facilitate this parallelism exceeds the cost of simply processing the data in a single thread.

In this section, we'll address the issues associated with validating the data that we're processing and determining the hardware capabilities of the system. We'll always want to have a synchronous fallback option for the scenarios where parallel execution simply doesn't make sense. When we decide to go parallel, our next job is to figure out exactly how the work gets distributed to workers. All of these checks are performed at runtime.

How big is the data?

Sometimes, going parallel just isn't worthwhile. The idea with parallelism is to compute more in less time. This gets our results faster, ultimately leading to a more responsive user experience. Having said that, there are scenarios where the data that we process simply does not justify the use of threads. Even some large collections of data may not stand to benefit from parallelization.

The two factors that determine how suitable a given operation is for parallel execution are the size of the data and the time complexity of the operation that we perform on each item in the collection. Put differently, if we have an array with thousands of objects in it, but the computation performed on each object is cheap, then there's no real motivation to go parallel. Likewise, we can have an array with very few objects, but the operation is expensive. Again, we may not benefit from subdividing the work into smaller tasks then distributing them to worker threads.

The static factor is the computation that we perform on individual items. At design time, we have to have a general idea of whether the code is expensive or cheap in terms of CPU cycles. This might require some static analysis, some quick benchmarks, or just a glance mixed with know-how and intuition. When we devise our criteria for determining whether a given operation is well-suited for parallel execution or not, we need to combine the computation itself with the size of the data.

Let's take a look at an example that uses different performance characteristics to determine whether or not a given function should be executed in parallel:

This function is handy because it's an easy preflight check for us to perform—either it's parallel or it isn't. If it's not, then we can take the short path of simply computing the result and returning it to the caller. If it's parallel, then we'll move onto the next stage of figuring out how to subdivide the operation into smaller tasks.

The isConcurrent() function takes into consideration not only the size of the data, but also the cost of performing a computation on any one of the data items. This lets us fine-tune the concurrency of our application. If there's too much overhead, we can increase the parallel processing threshold. If we've made some changes to our code that make a previously inexpensive function, expensive. We just need to change the expensiveTask flag in this scenario.

What happens when our code runs in the main thread just as often as it runs in a worker thread? Does this mean that we have to write our task code twice: once for sequential code and again for our workers? We obviously want to avoid this, so we need to keep our task code modular. It needs to be usable both in the main thread and worker threads.

Hardware concurrency capabilities

Another high-level check that we'll perform in our concurrent applications is the concurrency capabilities of the hardware that we're running on. This informs us how many web workers to create. For example, there's really nothing for us to gain by creating 32 web workers on a system where there are only four CPU cores. On this system, four web workers would be more appropriate. So, how do we get this number?

Let's create a generic function that figures this out for us:

Since not all browsers implement the navigator.hardwareConcurrency property, we have to take this into consideration. If we don't know the exact hardware concurrency level, we have to make a guess. Here, we say that four is the most common CPU core count that we're likely to encounter. And since this is a default argument value, it is used for both: special-case handling by the caller and easy global changes.

There are other techniques that attempt to measure the concurrency level by spawning worker threads and sampling the rate at which data is returned. This is an interesting technique, but not suitable for production applications because of the overhead and general uncertainty that's involved. In other words, using a static value that covers the majority of our users' systems is good enough.

Creating tasks and assigning work

Once we've decided that a given operation should be performed in parallel, and we know how many workers to create based on the concurrency level, it's time to create some tasks, and assign them to workers. Essentially, this means slicing up the input data into smaller chunks and passing these to the workers that apply our task to a subset of the data.

In the preceding chapter, we saw our first example of taking input data and diving it into tasks. Once the work was divided, we spawned a new worker and terminated it when the task was complete. Creating and terminating threads like this may not be the ideal approach depending on the type of application we're building. For example, if we occasionally run an expensive operation that would benefit from parallel processing, then it might make sense to spawn workers on demand. However, if we frequently process things in parallel, then it might make more sense to spawn threads when the application starts, and reuse them for processing many types of tasks.

This configuration allows operations to send messages to worker threads that are already running and get results back. There's no overhead associated with spawning new workers and cleaning them up when we're done with them. There is still the problem of reconciliation. We've split the operation into smaller tasks, each returning their own result. However, the operation is expected to return a single result. So when we split work into smaller tasks, we also need a way to join the task results back into a cohesive whole.

Let's write a generic function that handles the boilerplate aspects of splitting work into tasks and bringing the results together for reconciliation. While we're at it, let's also have this function determine whether the operation should be parallelized, or it should run synchronously in the main thread. First, let's look at the task itself that we'll want to run in parallel against each chunk of our data, as it's sliced up:

This task is kept separate from our worker code and other parts of our application that run in the main thread. The reason is that we'll want to use this function in both: the main thread and the worker threads. Now, we'll make a worker that can import this function, and use it with any data that gets passed to the worker in a message:

Earlier in the chapter, we implemented two utility functions. The isConcurrent() function determines the utility of running an operation as a set of smaller tasks in parallel. The other function, getConcurrency(), determines the level of concurrency that we should be running at. We'll use these two functions here, and we'll introduce two new utility functions. In fact, these are generators that will help us later on. Let's take a look at this:

With these two generators in place—workers and id—we're now ready to implement our parallel() higher-order function. The idea is to take a function as input along with some other parameters that allows us to tune the behavior of the parallelization and return a new function that we can simply invoke as normal throughout our app. Let's take a look at this function now:

Now we can use the parallel() function to build concurrent functions that are called all throughout our application. For example, the sumConcurrent() function can be used whenever we have to compute the sum of large inputs. The only thing that's different is the input data.

An obvious limitation here is that we only have a single callback function that we can specify when the parallelized function completes. This and, well, there's a lot of book-keeping to be done here—having IDs to reconcile tasks with their operations is kind of painful; this feels as if we're implementing promises. This is because that's essentially what we're doing here. The next chapter dives into more detail on combining promises with workers to avoid messy abstractions, such as the one that we just implemented.

Candidate problems

In the previous section, you learned to create a generic function that will decide on the fly, how to divide and conquer using workers, or whether it's more beneficial to simply call the function in the main thread. Now that we have a generic parallelization mechanism in place, what kind of problems can we solve? In this section, we'll address the most typical concurrency scenarios that will benefit from a solid concurrency architecture.

Embarrassingly parallel

A problem is embarrassingly parallel when it's obvious how the larger task can be broken down into smaller tasks. These smaller tasks don't depend on one another, which makes it even easier to start off a task that takes input and produces output without relying on the state of other workers. This again comes back to functional programming, and the idea of referential transparency and no side-effects.

These are the types of problems we want to solve with concurrency—at least at first, during the difficult first implementation of our application. These are the low-hanging-fruit as far as concurrency problems go, and they should be easy for us to tackle without risking our ability to deliver functionality.

The last example that we implemented in the preceding section was an embarrassingly parallel problem, where we simply needed each subtask to add up the input values and return them. Global search, when the collection is large and unstructured, is another example of something that takes little effort on our part to divide into smaller tasks and reconcile them into a result. Searching large text inputs is a similar example. Mapping and reducing are yet another example of something that takes relatively little effort to parallelize.

Searching collections

Some collections are sorted. These collections can be searched efficiently because binary search algorithms are able to avoid large sections of data simply based on the premise that the data is sorted. However, there are other times when we work with collections that are largely unstructured or unsorted. In other cases, the time complexity is likely to be O(n) because every item in the collection needs to be checked as no assumptions can be made.

Large strings of text are a good example of a collection that's unstructured. If we were to search this text for a substring, there'd be no way to avoid searching a section of the text based on what we've found so far—the whole search space needs to be covered. We also need to count the number of substring occurrences in a large body of text. This is an embarrassingly parallel problem. Let's write some code that counts the number of substring occurrences in string input. We'll reuse the parallel utilities that we created in the previous section, in particular, the parallel() function. Here's the task that we'll use:

Now let's create a block of text for us to search and a parallel function to search it with:

Here, we're splitting the input string into 20 character chunks, and searching for the input value en. There's 3 results found. Let's see if we can use this task, along with our parallel worker utilities and count the number of times an item appears in an array.

Since we're generating this 10,000 element array using random integers, the output will differ with each run. However, what's nice about our parallel worker utilities is that we were able to call arrayCount() with a substantially larger chunk size.

You may have noticed that we're filtering the input, not finding a specific item within. This is an example of an embarrassingly parallel problem versus something that's a lot more difficult to solve using concurrency. Our worker nodes in the previous filtering code don't need to communicate with one another. If we have several worker nodes looking for a single item, we would inevitably face an early termination scenario.

But to handle early termination, we need workers that'd somehow communicate with one another. This isn't necessarily a bad thing, just more shared state and more concurrency complexity. Its decisions like these that become relevant in concurrent programming—do we optimize elsewhere to avoid certain concurrency challenges?

Mapping and reducing

The Array primitive in JavaScript already has a map() method. As we now know, there are two key factors that impact the scalability and performance of running a given operation against a given set of input data. It's the size of the data multiplied by the complexity of any task that's applied to each item within this data. These constraints can cause problems for our application if we're shoving tons of data into one array, then processing each array item with expensive code.

Let's see whether the approach that we've used for the past couple of code examples can help us map one array to another without having to worry about the native Array.map() method running on a single CPU—a potential bottleneck. We'll also address the issue of reducing large collections. It's a similar issue to mapping, only we use the Array.reduce() method. Here are the task functions:

Now we have generic functions that can be invoked from anywhere—the main thread or from within a worker. We won't look at the worker code again because it uses the same pattern as the examples before this one. It figures out which task to invoke, and it handles formatting the response that's sent back to the main thread. Let's go ahead and use the parallel() utility to create a concurrent map and a concurrent reduce function:

Here, we create 75 tasks that are handed out to workers (75000/1000). Depending on our concurrency level, this means we'll have several property values being plucked from array items simultaneously. The reduce job works the same way; we sum the mapped collections concurrently. We still need to perform summation in the sumConcurrent() callback, but it's very few.

We need to exercise caution when performing concurrent reduce jobs. Mapping is straightforward because we're creating what amounts to a clone of the original array in terms of size and ordering. It's just the values that differ. Reducing could be dependent on the result as it currently stands. Put differently, as each array item makes its way through the reduce function, the result, as it's being built-up, can change the final result outcome. Concurrency makes this difficult, but in this previous example, the problem was embarrassingly parallel—not all reduce jobs are.

Keeping the DOM responsive

So far in this chapter, the focus has been data-centric—taking input and transforming it by using web workers to divide and conquer. This isn't the only use of worker threads; we can also use them to keep the DOM responsive for our users.

In this section, we'll introduce a concept that's used in Linux kernel development to split events into phases for optimal performance. Then, we'll address the challenge of communicating between the DOM and our workers and vice-versa.

Bottom halves

The Linux kernel has the concept of top-halves and bottom-halves. This idea is used by the hardware interrupt request machinery. The problem is that hardware interrupts happen all the time, and it's the kernel's job to make sure they're all captured and processed in a timely-manner. To do this effectively, the kernel splits the task of processing a hardware interrupt into two halves—the top and bottom half.

It's the job of the top-half to respond to external stimuli, such as a mouse click or a keystroke. However, there are severe limitations imposed on the top-half, and this is on purpose. The top-half portion of processing a hardware interrupt request can only schedule the real work—the invocation of all the other system components—for a later time. This later work is done in the bottom-half. The side-effect of this approach is that interrupts are handled swiftly at a low level, allowing more flexibility in terms of prioritizing events.

What does kernel development have to do with JavaScript and concurrency? Well, it turns out that we can borrow these ideas, and have our "bottom-half" work be delegated to a worker. Our event-handling code that responds to the DOM events wouldn't actually do anything except for passing the message to the worker. This ensures that the main thread is only doing what it absolutely needs to do without any extra processing. This means that if the web worker comes back with something to display, it can do so immediately. Remember, the main thread includes the rendering engine, which blocks our code from running and vice-versa.

JavaScript is run-to-completion, which we're well aware at this point. This means that less time spent in the top-half, is time that's spent responding to users by updating the screen. At the same time, JavaScript is also run-to-completion within the web worker where our bottom-halves run. This means that the same limitation applies here; if our worker gets 100 messages sent to it in a short period of time, they're processed in first in first out (FIFO) order.

The difference is that since this code isn't running in the main thread, the UI components still respond when the user interacts with them. This is such a crucial factor in the perception of a quality product that it's worth the time investigating top-halves and bottom-halves. We now just need to figure out an implementation.

Translating DOM manipulation

If we treat web workers as the bottom-halves of our application, then we need a way to manipulate the DOM while spending as little time as possible in the top-half. That is, it's up to the worker to figure out what needs to change in the DOM tree and then to notify the main thread. Then, all that the main thread has to do is translate between the posted message and the required DOM API call. There's no fiddling around with data between receiving these messages and handing control off to the DOM; milliseconds are precious in the main thread.

Let's see how easy this is to implement. We'll start with the worker implementation that sends the DOM manipulation messages to the main thread when it wants to update something in the UI:

This work posts three messages back to the main thread. They're timed using setTimeout(), so we can expect to see a new list item be rendered every second until all three are displayed. Now, let's take a look at how the main thread code makes sense of these messages:

    As we can see, we're giving the top-half (the main thread) very little opportunity to bottleneck, causing the user interaction to freeze. It's quite simple—the only code that's executed here is the DOM manipulation code. This drastically increases the likelihood of completing quickly, allowing the screen to visibly update for the user.

    What about the other direction, getting external events into the system without interfering with the main thread? We'll look at this next.

    Translating DOM events

    As soon as a DOM event is triggered, we want to hand off control to our web worker. This way, the main thread can continue as if nothing else is happening—everyone is happy. There's a little more to it than this unfortunately. For instance, we can't simply listen to every single event on every single element, forwarding each to the worker, that would defeat the purpose of not running code in the main thread if it's constantly responding to events.

    Instead, we only want to listen to the DOM events that the worker cares about. This really is no different from how we would implement any other web application; our components listen to events they're interested in. To implement this with workers, we need a mechanism that tells the main thread to setup a DOM event listener on a particular element. Then, the worker can simply listen to incoming DOM events and react accordingly. Let's take a look at the worker implementation first:

    This worker asks the main thread, who has access to the DOM, to setup two event listeners. It then sets up its own event listener for the DOM events that eventually make their way to the worker. Let's take a look at the DOM code responsible for setting up handlers and forwarding events to the worker:

    For the sake of brevity, there's only a couple of event properties sent back to the worker. We can't send the event object as it is due to serialization limitations in web worker messages. In practice, this same pattern can be used, but we'll probably add more event properties to this, such as clientX and clientY.