Primer On Parallelism

Content

Primer On Parallelism#

Primer on Parallelism#

Parallelism lets you handle resource-intensive and data-heavy computations. This section covers the basics: what parallelism is, where it happens, and how to implement it (in Python).

Implementing Parallelism: Multi-Threading vs. Multi-Processing#

Different programming approaches enable parallelism at different architectural levels. Understanding the distinction between multi-threading and multi-processing is crucial for implementing efficient parallel solutions.

Multi-Threading: Shared Memory Parallelism#

Multi-threading involves running multiple threads within a single process. All threads share the same memory space, enabling efficient communication but requiring careful synchronization.

Characteristics:

Threads within a process share memory and resources
Efficient information exchange through shared variables
Risk of race conditions and deadlocks if not properly managed
Lower overhead for thread creation and context switching

Languages with native parallel multi-threading support:

Rust
Go
C++
Java

These languages can achieve true parallelism via multi-threading, with multiple threads executing simultaneously on different cores.

When to use multi-threading:

Fine-grained parallelism requiring frequent data exchange
Tasks that benefit from shared state
When using languages that support thread-level parallelism without restrictions

Multi-Processing: Isolated Execution#

Multi-processing involves running multiple independent processes, each with its own memory space and Python interpreter instance.

Characteristics:

Processes are completely isolated from each other
Communication requires explicit Inter-Process Communication (IPC) mechanisms
Eliminates race conditions but adds communication overhead
Higher memory footprint (each process has its own memory)

When to use multi-processing:

CPU-intensive tasks in Python or other GIL-restricted languages
Embarrassingly parallel problems with minimal communication needs
When memory isolation is desirable for stability or security
Tasks that can tolerate IPC overhead

The Communication Trade-off#

The key distinction between these approaches lies in communication efficiency:

Multi-threading advantages:

Near-instantaneous data sharing through memory
Minimal overhead for communication
Ideal for tightly coupled problems

Multi-threading challenges:

Synchronization complexity
Potential for difficult-to-debug race conditions
Limited or restricted in some languages (Python’s GIL)

Multi-processing advantages:

True parallelism regardless of language restrictions
Process isolation prevents cross-contamination
Well-suited for coarse-grained parallelism

Multi-processing challenges:

IPC overhead can be substantial
Higher memory consumption
More complex data sharing mechanisms

Language-Specific Considerations#

Many languages offer abstractions that simplify multi-core parallelism, handling task lifecycle and communication automatically. However, for cluster or cloud environments, additional software infrastructure is required:

Workload managers: Systems like Slurm handle job scheduling and resource allocation on HPC clusters
Distributed computing frameworks: Tools like Dask, Ray, or Spark provide abstractions for managing parallelism across multiple machines
Orchestration platforms: Kubernetes and similar systems manage containerized workloads in cloud environments

Practical Guidance#

Identify coupling level: Determine if your problem is embarrassingly parallel, loosely coupled, or tightly coupled
Choose the appropriate approach:
- Embarrassingly parallel → Multi-processing or distributed execution
- Loosely coupled → Multi-processing with periodic synchronization
- Tightly coupled → Multi-threading (if language supports it) or single high-performance node
Start simple: Begin with the simplest parallel approach that addresses your needs. Avoid premature optimization of communication patterns.
Profile before scaling: Test your parallel implementation on a small scale before deploying to large clusters or cloud resources. Identify bottlenecks early.
Consider existing frameworks: Before implementing custom parallelization, investigate whether established libraries or frameworks already solve your problem (see language-specific resources below).

Communication Overhead in Practice#

The impact of IPC overhead varies dramatically with problem characteristics:

Negligible overhead scenarios (ideal for multi-processing):

Processing thousands of independent files
Parameter sweeps with no inter-task dependencies
Monte Carlo simulations
Batch processing of images or data samples

Significant overhead scenarios (consider multi-threading or reduce parallelization):

Iterative algorithms requiring frequent synchronization
Shared mutable state updated by all tasks
Fine-grained operations with minimal computation between communications
Real-time systems with strict latency requirements

Tip

The embarrassingly parallel litmus test: If you can describe your problem as “do the same thing to 1000 different inputs independently,” you likely have an embarrassingly parallel problem—an ideal candidate for multi-processing or distributed computing.

Parallelism in Python: Capabilities and Constraints#

Python presents a unique case in parallel computing. While it is the dominant language in data science and scientific computing, it has specific constraints that affect how parallelism can be achieved.

The Global Interpreter Lock (GIL)#

Python’s most common implementation, CPython, uses a mechanism called the Global Interpreter Lock (GIL) to ensure thread safety.

Key implications of the GIL:

Multiple threads can exist within a Python process and run concurrently
However, only one thread can execute Python bytecode at any given instant
This prevents race conditions but limits parallel execution within a single process
CPU-bound tasks see little to no speedup from multi-threading
I/O-bound tasks can benefit significantly because the GIL is released during I/O waits

Threading vs. Multi-Processing in Python#

Python provides different libraries for different parallelization needs:

For concurrency (task switching):

threading: Standard interface for multi-threaded processes
asyncio: Cooperative multitasking for asynchronous I/O operations
aiohttp: Asynchronous HTTP client/server framework

For true parallelism (simultaneous execution):

multiprocessing: Bypass the GIL by spawning separate processes, each with its own Python interpreter

Experimental Comparison: Threading vs. Multi-Processing#

The following code demonstrates the performance difference between threading and multi-processing for a CPU-bound task:

import time
import threading
import multiprocessing 
import numpy as np

def report_work(t_start, array, target_index:int=1):
    """
    Function repeatedly adding the moment of its execution to a binned array.
    """
    if t_start is None:
        t_start = time.perf_counter_ns()
    dt = array[1,0] - array[0, 0]
    while True:
        since_start = time.perf_counter_ns() - t_start
        i = int(since_start/dt)
        if i >= len(array):
            break
        array[i, target_index] += 1
    return array

def main(nbr_intervals:int=1000):
    """
    Run workload in a single process with a single thread
    """
    t_start = time.perf_counter_ns()
    array = np.zeros((nbr_intervals, 2), dtype=int)
    array[:, 0] = 50000 * np.arange(1, nbr_intervals + 1)
    report_work(t_start, array)
    return array

def multi_threading_main(nbr_intervals:int=1000, nbr_threads:int=4):
    """
    Run the work using multiple threads in a single process
    """
    t_start = time.perf_counter_ns()
    array = np.zeros((nbr_intervals, nbr_threads + 1), dtype=int)
    array[:, 0] = 50000 * np.arange(1, nbr_intervals + 1)
    threads = []
    for i in range(nbr_threads):
        threads.append(
            threading.Thread(target=report_work,
                             args=(t_start, array, i + 1))
        )
    for thread in threads:
        thread.start()
    for thread in threads:
        thread.join()
    return array

def multi_processed_main(nbr_intervals:int=1000, nbr_processes:int=4):
    """
    Run the work using multiple independent processes
    """
    array = np.zeros((nbr_intervals, 2), dtype=int)
    array[:, 0] = 50000 * np.arange(1, nbr_intervals + 1)
    multiprocessing.set_start_method('spawn')
    with multiprocessing.Pool(nbr_processes) as pool:
        arrays = pool.starmap(report_work,
                              [(None, np.copy(array), 1)
                               for _ in range(nbr_processes)])
    return arrays

Key observations:

Multi-threaded performance: Performs roughly the same total operations as single-process execution. This demonstrates the GIL in action—only one thread executes at a time for CPU-bound work.
Multi-process performance: Performs approximately 4x the operations (with 4 processes), showing true parallel execution on different cores.
Memory considerations: Multi-threaded version shares the same array object. Multi-process version requires each process to receive its own copy since processes cannot directly share memory.

Python’s Dual Nature: Orchestration vs. Execution#

This raises an important question: Why is Python the dominant language for computationally intensive data science if it’s “slow”?

The answer: Python is not doing the heavy computation.

Python excels as an orchestration layer for high-performance compiled code. Key libraries are essentially wrappers:

numpy and pandas: Interfaces to optimized C/C++ code
scipy: Fortran and C implementations
scikit-learn, pytorch, tensorflow: Optimized C++/CUDA kernels

When you call np.dot(), you’re not running Python code for the computation. Instead:

Python passes control to compiled C code (NumPy/BLAS)
Python releases the GIL
The C library spawns multiple threads and saturates all available cores
Computation completes, control returns to Python
Python re-acquires the GIL

Demonstration:

import numpy as np

size = 4000
arr1 = np.random.rand(size, size)
arr2 = np.random.rand(size, size)

for _ in range(100):
    np.dot(arr1, arr2)  # Single Python call, multi-core execution

Monitoring with htop or btop reveals all CPU cores at 100% utilization, despite running a single Python process.

The Cost of Context Switching: Data Marshaling#

Using Python as a facade for compiled code introduces a cost: data marshaling—converting between Python objects and C-compatible data structures.

Benchmark demonstration:

import numpy as np
import time

def timeit(func, iterations=100, *args, **kwargs):
    start_time = time.perf_counter()
    for _ in range(iterations):
        func(*args, **kwargs)
    end_time = time.perf_counter()
    print(f"{iterations} calls: {round(end_time - start_time, 3)}s")

def numpy_sum(np_array):
    return np.sum(np_array)

def python_sum(collection):
    return sum(collection)

def python_loop(collection):
    result = 0
    for num in collection:
        result += num
    return result

# Create test data
np_array = np.random.rand(1000000)
py_list = np_array.tolist()

print("NumPy array:")
timeit(numpy_sum, 100, np_array)        # ~0.04s - pure C loop
timeit(python_sum, 100, np_array)       # ~6.1s - Python iteration
timeit(python_loop, 100, np_array)      # ~6.9s - marshaling overhead!

print("Python list:")
timeit(python_sum, 100, py_list)        # ~0.75s - Python built-in
timeit(python_loop, 100, py_list)       # ~2.1s - pure Python loop

Key findings:

NumPy sum is 100x faster: The loop runs entirely in C with minimal Python interaction
Worst case: Python loop over NumPy array: 100x slower than pure NumPy, 3x slower than Python list loop
Marshaling penalty: Each array element access converts from C type to Python object, creating massive overhead

Warning

Anti-pattern: Iterating over NumPy arrays with Python loops defeats the purpose of using NumPy. Always use vectorized operations when possible.

Practical Guidelines for Python Parallelism#

For CPU-bound tasks: Use multiprocessing to bypass the GIL
For I/O-bound tasks: Use threading or asyncio to overlap wait times
Leverage vectorization: Use NumPy/Pandas operations instead of Python loops
Don’t parallelize already-parallel code: Libraries like NumPy already use all cores; adding multiprocessing creates contention
Profile first: Use profiling tools to identify actual bottlenecks before parallelizing
Consider established frameworks: For complex parallel workflows, use specialized libraries:
- dask: Parallel computing with task scheduling
- joblib: Simple parallelization and caching
- ray: Distributed computing framework
- pyspark: Big data processing

When to Avoid Parallelization in Python#

When the task is already vectorized (NumPy, Pandas operations)
When marshaling overhead exceeds computation time
For tightly coupled problems requiring frequent synchronization
When the task completes quickly enough sequentially

Tip

Golden Rule: Make your sequential code as efficient as possible (using vectorization, compiled libraries) before attempting parallelization. Often, proper use of NumPy eliminates the need for manual parallel coding.