Primer On Parallelism#

Primer on Parallelism#

Parallelism lets you handle resource-intensive and data-heavy computations. This section covers the basics: what parallelism is, where it happens, and how to implement it (in Python).

Implementing Parallelism: Multi-Threading vs. Multi-Processing#

Different programming approaches enable parallelism at different architectural levels. Understanding the distinction between multi-threading and multi-processing is crucial for implementing efficient parallel solutions.

Multi-Threading: Shared Memory Parallelism#

Multi-threading involves running multiple threads within a single process. All threads share the same memory space, enabling efficient communication but requiring careful synchronization.

Characteristics:

  • Threads within a process share memory and resources

  • Efficient information exchange through shared variables

  • Risk of race conditions and deadlocks if not properly managed

  • Lower overhead for thread creation and context switching

Languages with native parallel multi-threading support:

  • Rust

  • Go

  • C++

  • Java

These languages can achieve true parallelism via multi-threading, with multiple threads executing simultaneously on different cores.

When to use multi-threading:

  • Fine-grained parallelism requiring frequent data exchange

  • Tasks that benefit from shared state

  • When using languages that support thread-level parallelism without restrictions

Multi-Processing: Isolated Execution#

Multi-processing involves running multiple independent processes, each with its own memory space and Python interpreter instance.

Characteristics:

  • Processes are completely isolated from each other

  • Communication requires explicit Inter-Process Communication (IPC) mechanisms

  • Eliminates race conditions but adds communication overhead

  • Higher memory footprint (each process has its own memory)

When to use multi-processing:

  • CPU-intensive tasks in Python or other GIL-restricted languages

  • Embarrassingly parallel problems with minimal communication needs

  • When memory isolation is desirable for stability or security

  • Tasks that can tolerate IPC overhead

The Communication Trade-off#

The key distinction between these approaches lies in communication efficiency:

Multi-threading advantages:

  • Near-instantaneous data sharing through memory

  • Minimal overhead for communication

  • Ideal for tightly coupled problems

Multi-threading challenges:

  • Synchronization complexity

  • Potential for difficult-to-debug race conditions

  • Limited or restricted in some languages (Python’s GIL)

Multi-processing advantages:

  • True parallelism regardless of language restrictions

  • Process isolation prevents cross-contamination

  • Well-suited for coarse-grained parallelism

Multi-processing challenges:

  • IPC overhead can be substantial

  • Higher memory consumption

  • More complex data sharing mechanisms

Language-Specific Considerations#

Many languages offer abstractions that simplify multi-core parallelism, handling task lifecycle and communication automatically. However, for cluster or cloud environments, additional software infrastructure is required:

  • Workload managers: Systems like Slurm handle job scheduling and resource allocation on HPC clusters

  • Distributed computing frameworks: Tools like Dask, Ray, or Spark provide abstractions for managing parallelism across multiple machines

  • Orchestration platforms: Kubernetes and similar systems manage containerized workloads in cloud environments

Practical Guidance#

  1. Identify coupling level: Determine if your problem is embarrassingly parallel, loosely coupled, or tightly coupled

  2. Choose the appropriate approach:

    • Embarrassingly parallel → Multi-processing or distributed execution

    • Loosely coupled → Multi-processing with periodic synchronization

    • Tightly coupled → Multi-threading (if language supports it) or single high-performance node

  3. Start simple: Begin with the simplest parallel approach that addresses your needs. Avoid premature optimization of communication patterns.

  4. Profile before scaling: Test your parallel implementation on a small scale before deploying to large clusters or cloud resources. Identify bottlenecks early.

  5. Consider existing frameworks: Before implementing custom parallelization, investigate whether established libraries or frameworks already solve your problem (see language-specific resources below).

Communication Overhead in Practice#

The impact of IPC overhead varies dramatically with problem characteristics:

Negligible overhead scenarios (ideal for multi-processing):

  • Processing thousands of independent files

  • Parameter sweeps with no inter-task dependencies

  • Monte Carlo simulations

  • Batch processing of images or data samples

Significant overhead scenarios (consider multi-threading or reduce parallelization):

  • Iterative algorithms requiring frequent synchronization

  • Shared mutable state updated by all tasks

  • Fine-grained operations with minimal computation between communications

  • Real-time systems with strict latency requirements

Tip

The embarrassingly parallel litmus test: If you can describe your problem as “do the same thing to 1000 different inputs independently,” you likely have an embarrassingly parallel problem—an ideal candidate for multi-processing or distributed computing.

Parallelism in Python: Capabilities and Constraints#

Python presents a unique case in parallel computing. While it is the dominant language in data science and scientific computing, it has specific constraints that affect how parallelism can be achieved.

The Global Interpreter Lock (GIL)#

Python’s most common implementation, CPython, uses a mechanism called the Global Interpreter Lock (GIL) to ensure thread safety.

Key implications of the GIL:

  • Multiple threads can exist within a Python process and run concurrently

  • However, only one thread can execute Python bytecode at any given instant

  • This prevents race conditions but limits parallel execution within a single process

  • CPU-bound tasks see little to no speedup from multi-threading

  • I/O-bound tasks can benefit significantly because the GIL is released during I/O waits

Threading vs. Multi-Processing in Python#

Python provides different libraries for different parallelization needs:

For concurrency (task switching):

  • threading: Standard interface for multi-threaded processes

  • asyncio: Cooperative multitasking for asynchronous I/O operations

  • aiohttp: Asynchronous HTTP client/server framework

For true parallelism (simultaneous execution):

  • multiprocessing: Bypass the GIL by spawning separate processes, each with its own Python interpreter

Experimental Comparison: Threading vs. Multi-Processing#

The following code demonstrates the performance difference between threading and multi-processing for a CPU-bound task:

import time
import threading
import multiprocessing 
import numpy as np

def report_work(t_start, array, target_index:int=1):
    """
    Function repeatedly adding the moment of its execution to a binned array.
    """
    if t_start is None:
        t_start = time.perf_counter_ns()
    dt = array[1,0] - array[0, 0]
    while True:
        since_start = time.perf_counter_ns() - t_start
        i = int(since_start/dt)
        if i >= len(array):
            break
        array[i, target_index] += 1
    return array

def main(nbr_intervals:int=1000):
    """
    Run workload in a single process with a single thread
    """
    t_start = time.perf_counter_ns()
    array = np.zeros((nbr_intervals, 2), dtype=int)
    array[:, 0] = 50000 * np.arange(1, nbr_intervals + 1)
    report_work(t_start, array)
    return array

def multi_threading_main(nbr_intervals:int=1000, nbr_threads:int=4):
    """
    Run the work using multiple threads in a single process
    """
    t_start = time.perf_counter_ns()
    array = np.zeros((nbr_intervals, nbr_threads + 1), dtype=int)
    array[:, 0] = 50000 * np.arange(1, nbr_intervals + 1)
    threads = []
    for i in range(nbr_threads):
        threads.append(
            threading.Thread(target=report_work,
                             args=(t_start, array, i + 1))
        )
    for thread in threads:
        thread.start()
    for thread in threads:
        thread.join()
    return array

def multi_processed_main(nbr_intervals:int=1000, nbr_processes:int=4):
    """
    Run the work using multiple independent processes
    """
    array = np.zeros((nbr_intervals, 2), dtype=int)
    array[:, 0] = 50000 * np.arange(1, nbr_intervals + 1)
    multiprocessing.set_start_method('spawn')
    with multiprocessing.Pool(nbr_processes) as pool:
        arrays = pool.starmap(report_work,
                              [(None, np.copy(array), 1)
                               for _ in range(nbr_processes)])
    return arrays

Key observations:

  1. Multi-threaded performance: Performs roughly the same total operations as single-process execution. This demonstrates the GIL in action—only one thread executes at a time for CPU-bound work.

  2. Multi-process performance: Performs approximately 4x the operations (with 4 processes), showing true parallel execution on different cores.

  3. Memory considerations: Multi-threaded version shares the same array object. Multi-process version requires each process to receive its own copy since processes cannot directly share memory.

Python’s Dual Nature: Orchestration vs. Execution#

This raises an important question: Why is Python the dominant language for computationally intensive data science if it’s “slow”?

The answer: Python is not doing the heavy computation.

Python excels as an orchestration layer for high-performance compiled code. Key libraries are essentially wrappers:

  • numpy and pandas: Interfaces to optimized C/C++ code

  • scipy: Fortran and C implementations

  • scikit-learn, pytorch, tensorflow: Optimized C++/CUDA kernels

When you call np.dot(), you’re not running Python code for the computation. Instead:

  1. Python passes control to compiled C code (NumPy/BLAS)

  2. Python releases the GIL

  3. The C library spawns multiple threads and saturates all available cores

  4. Computation completes, control returns to Python

  5. Python re-acquires the GIL

Demonstration:

import numpy as np

size = 4000
arr1 = np.random.rand(size, size)
arr2 = np.random.rand(size, size)

for _ in range(100):
    np.dot(arr1, arr2)  # Single Python call, multi-core execution

Monitoring with htop or btop reveals all CPU cores at 100% utilization, despite running a single Python process.

The Cost of Context Switching: Data Marshaling#

Using Python as a facade for compiled code introduces a cost: data marshaling—converting between Python objects and C-compatible data structures.

Benchmark demonstration:

import numpy as np
import time

def timeit(func, iterations=100, *args, **kwargs):
    start_time = time.perf_counter()
    for _ in range(iterations):
        func(*args, **kwargs)
    end_time = time.perf_counter()
    print(f"{iterations} calls: {round(end_time - start_time, 3)}s")

def numpy_sum(np_array):
    return np.sum(np_array)

def python_sum(collection):
    return sum(collection)

def python_loop(collection):
    result = 0
    for num in collection:
        result += num
    return result

# Create test data
np_array = np.random.rand(1000000)
py_list = np_array.tolist()

print("NumPy array:")
timeit(numpy_sum, 100, np_array)        # ~0.04s - pure C loop
timeit(python_sum, 100, np_array)       # ~6.1s - Python iteration
timeit(python_loop, 100, np_array)      # ~6.9s - marshaling overhead!

print("Python list:")
timeit(python_sum, 100, py_list)        # ~0.75s - Python built-in
timeit(python_loop, 100, py_list)       # ~2.1s - pure Python loop

Key findings:

  1. NumPy sum is 100x faster: The loop runs entirely in C with minimal Python interaction

  2. Worst case: Python loop over NumPy array: 100x slower than pure NumPy, 3x slower than Python list loop

  3. Marshaling penalty: Each array element access converts from C type to Python object, creating massive overhead

Warning

Anti-pattern: Iterating over NumPy arrays with Python loops defeats the purpose of using NumPy. Always use vectorized operations when possible.

Practical Guidelines for Python Parallelism#

  1. For CPU-bound tasks: Use multiprocessing to bypass the GIL

  2. For I/O-bound tasks: Use threading or asyncio to overlap wait times

  3. Leverage vectorization: Use NumPy/Pandas operations instead of Python loops

  4. Don’t parallelize already-parallel code: Libraries like NumPy already use all cores; adding multiprocessing creates contention

  5. Profile first: Use profiling tools to identify actual bottlenecks before parallelizing

  6. Consider established frameworks: For complex parallel workflows, use specialized libraries:

    • dask: Parallel computing with task scheduling

    • joblib: Simple parallelization and caching

    • ray: Distributed computing framework

    • pyspark: Big data processing

When to Avoid Parallelization in Python#

  • When the task is already vectorized (NumPy, Pandas operations)

  • When marshaling overhead exceeds computation time

  • For tightly coupled problems requiring frequent synchronization

  • When the task completes quickly enough sequentially

Tip

Golden Rule: Make your sequential code as efficient as possible (using vectorization, compiled libraries) before attempting parallelization. Often, proper use of NumPy eliminates the need for manual parallel coding.