Primer On Parallelism#
Primer on Parallelism#
Parallelism lets you handle resource-intensive and data-heavy computations. This section covers the basics: what parallelism is, where it happens, and how to implement it (in Python).
Implementing Parallelism: Multi-Threading vs. Multi-Processing#
Different programming approaches enable parallelism at different architectural levels. Understanding the distinction between multi-threading and multi-processing is crucial for implementing efficient parallel solutions.
Multi-Processing: Isolated Execution#
Multi-processing involves running multiple independent processes, each with its own memory space and Python interpreter instance.
Characteristics:
Processes are completely isolated from each other
Communication requires explicit Inter-Process Communication (IPC) mechanisms
Eliminates race conditions but adds communication overhead
Higher memory footprint (each process has its own memory)
When to use multi-processing:
CPU-intensive tasks in Python or other GIL-restricted languages
Embarrassingly parallel problems with minimal communication needs
When memory isolation is desirable for stability or security
Tasks that can tolerate IPC overhead
The Communication Trade-off#
The key distinction between these approaches lies in communication efficiency:
Multi-threading advantages:
Near-instantaneous data sharing through memory
Minimal overhead for communication
Ideal for tightly coupled problems
Multi-threading challenges:
Synchronization complexity
Potential for difficult-to-debug race conditions
Limited or restricted in some languages (Python’s GIL)
Multi-processing advantages:
True parallelism regardless of language restrictions
Process isolation prevents cross-contamination
Well-suited for coarse-grained parallelism
Multi-processing challenges:
IPC overhead can be substantial
Higher memory consumption
More complex data sharing mechanisms
Language-Specific Considerations#
Many languages offer abstractions that simplify multi-core parallelism, handling task lifecycle and communication automatically. However, for cluster or cloud environments, additional software infrastructure is required:
Workload managers: Systems like Slurm handle job scheduling and resource allocation on HPC clusters
Distributed computing frameworks: Tools like Dask, Ray, or Spark provide abstractions for managing parallelism across multiple machines
Orchestration platforms: Kubernetes and similar systems manage containerized workloads in cloud environments
Practical Guidance#
Identify coupling level: Determine if your problem is embarrassingly parallel, loosely coupled, or tightly coupled
Choose the appropriate approach:
Embarrassingly parallel → Multi-processing or distributed execution
Loosely coupled → Multi-processing with periodic synchronization
Tightly coupled → Multi-threading (if language supports it) or single high-performance node
Start simple: Begin with the simplest parallel approach that addresses your needs. Avoid premature optimization of communication patterns.
Profile before scaling: Test your parallel implementation on a small scale before deploying to large clusters or cloud resources. Identify bottlenecks early.
Consider existing frameworks: Before implementing custom parallelization, investigate whether established libraries or frameworks already solve your problem (see language-specific resources below).
Communication Overhead in Practice#
The impact of IPC overhead varies dramatically with problem characteristics:
Negligible overhead scenarios (ideal for multi-processing):
Processing thousands of independent files
Parameter sweeps with no inter-task dependencies
Monte Carlo simulations
Batch processing of images or data samples
Significant overhead scenarios (consider multi-threading or reduce parallelization):
Iterative algorithms requiring frequent synchronization
Shared mutable state updated by all tasks
Fine-grained operations with minimal computation between communications
Real-time systems with strict latency requirements
Tip
The embarrassingly parallel litmus test: If you can describe your problem as “do the same thing to 1000 different inputs independently,” you likely have an embarrassingly parallel problem—an ideal candidate for multi-processing or distributed computing.
Parallelism in Python: Capabilities and Constraints#
Python presents a unique case in parallel computing. While it is the dominant language in data science and scientific computing, it has specific constraints that affect how parallelism can be achieved.
The Global Interpreter Lock (GIL)#
Python’s most common implementation, CPython, uses a mechanism called the Global Interpreter Lock (GIL) to ensure thread safety.
Key implications of the GIL:
Multiple threads can exist within a Python process and run concurrently
However, only one thread can execute Python bytecode at any given instant
This prevents race conditions but limits parallel execution within a single process
CPU-bound tasks see little to no speedup from multi-threading
I/O-bound tasks can benefit significantly because the GIL is released during I/O waits
Threading vs. Multi-Processing in Python#
Python provides different libraries for different parallelization needs:
For concurrency (task switching):
threading: Standard interface for multi-threaded processesasyncio: Cooperative multitasking for asynchronous I/O operationsaiohttp: Asynchronous HTTP client/server framework
For true parallelism (simultaneous execution):
multiprocessing: Bypass the GIL by spawning separate processes, each with its own Python interpreter
Experimental Comparison: Threading vs. Multi-Processing#
The following code demonstrates the performance difference between threading and multi-processing for a CPU-bound task:
import time
import threading
import multiprocessing
import numpy as np
def report_work(t_start, array, target_index:int=1):
"""
Function repeatedly adding the moment of its execution to a binned array.
"""
if t_start is None:
t_start = time.perf_counter_ns()
dt = array[1,0] - array[0, 0]
while True:
since_start = time.perf_counter_ns() - t_start
i = int(since_start/dt)
if i >= len(array):
break
array[i, target_index] += 1
return array
def main(nbr_intervals:int=1000):
"""
Run workload in a single process with a single thread
"""
t_start = time.perf_counter_ns()
array = np.zeros((nbr_intervals, 2), dtype=int)
array[:, 0] = 50000 * np.arange(1, nbr_intervals + 1)
report_work(t_start, array)
return array
def multi_threading_main(nbr_intervals:int=1000, nbr_threads:int=4):
"""
Run the work using multiple threads in a single process
"""
t_start = time.perf_counter_ns()
array = np.zeros((nbr_intervals, nbr_threads + 1), dtype=int)
array[:, 0] = 50000 * np.arange(1, nbr_intervals + 1)
threads = []
for i in range(nbr_threads):
threads.append(
threading.Thread(target=report_work,
args=(t_start, array, i + 1))
)
for thread in threads:
thread.start()
for thread in threads:
thread.join()
return array
def multi_processed_main(nbr_intervals:int=1000, nbr_processes:int=4):
"""
Run the work using multiple independent processes
"""
array = np.zeros((nbr_intervals, 2), dtype=int)
array[:, 0] = 50000 * np.arange(1, nbr_intervals + 1)
multiprocessing.set_start_method('spawn')
with multiprocessing.Pool(nbr_processes) as pool:
arrays = pool.starmap(report_work,
[(None, np.copy(array), 1)
for _ in range(nbr_processes)])
return arrays
Key observations:
Multi-threaded performance: Performs roughly the same total operations as single-process execution. This demonstrates the GIL in action—only one thread executes at a time for CPU-bound work.
Multi-process performance: Performs approximately 4x the operations (with 4 processes), showing true parallel execution on different cores.
Memory considerations: Multi-threaded version shares the same
arrayobject. Multi-process version requires each process to receive its own copy since processes cannot directly share memory.
Python’s Dual Nature: Orchestration vs. Execution#
This raises an important question: Why is Python the dominant language for computationally intensive data science if it’s “slow”?
The answer: Python is not doing the heavy computation.
Python excels as an orchestration layer for high-performance compiled code. Key libraries are essentially wrappers:
numpyandpandas: Interfaces to optimized C/C++ codescipy: Fortran and C implementationsscikit-learn,pytorch,tensorflow: Optimized C++/CUDA kernels
When you call np.dot(), you’re not running Python code for the computation. Instead:
Python passes control to compiled C code (NumPy/BLAS)
Python releases the GIL
The C library spawns multiple threads and saturates all available cores
Computation completes, control returns to Python
Python re-acquires the GIL
Demonstration:
import numpy as np
size = 4000
arr1 = np.random.rand(size, size)
arr2 = np.random.rand(size, size)
for _ in range(100):
np.dot(arr1, arr2) # Single Python call, multi-core execution
Monitoring with htop or btop reveals all CPU cores at 100% utilization, despite running a single Python process.
The Cost of Context Switching: Data Marshaling#
Using Python as a facade for compiled code introduces a cost: data marshaling—converting between Python objects and C-compatible data structures.
Benchmark demonstration:
import numpy as np
import time
def timeit(func, iterations=100, *args, **kwargs):
start_time = time.perf_counter()
for _ in range(iterations):
func(*args, **kwargs)
end_time = time.perf_counter()
print(f"{iterations} calls: {round(end_time - start_time, 3)}s")
def numpy_sum(np_array):
return np.sum(np_array)
def python_sum(collection):
return sum(collection)
def python_loop(collection):
result = 0
for num in collection:
result += num
return result
# Create test data
np_array = np.random.rand(1000000)
py_list = np_array.tolist()
print("NumPy array:")
timeit(numpy_sum, 100, np_array) # ~0.04s - pure C loop
timeit(python_sum, 100, np_array) # ~6.1s - Python iteration
timeit(python_loop, 100, np_array) # ~6.9s - marshaling overhead!
print("Python list:")
timeit(python_sum, 100, py_list) # ~0.75s - Python built-in
timeit(python_loop, 100, py_list) # ~2.1s - pure Python loop
Key findings:
NumPy sum is 100x faster: The loop runs entirely in C with minimal Python interaction
Worst case: Python loop over NumPy array: 100x slower than pure NumPy, 3x slower than Python list loop
Marshaling penalty: Each array element access converts from C type to Python object, creating massive overhead
Warning
Anti-pattern: Iterating over NumPy arrays with Python loops defeats the purpose of using NumPy. Always use vectorized operations when possible.
Practical Guidelines for Python Parallelism#
For CPU-bound tasks: Use
multiprocessingto bypass the GILFor I/O-bound tasks: Use
threadingorasyncioto overlap wait timesLeverage vectorization: Use NumPy/Pandas operations instead of Python loops
Don’t parallelize already-parallel code: Libraries like NumPy already use all cores; adding multiprocessing creates contention
Profile first: Use profiling tools to identify actual bottlenecks before parallelizing
Consider established frameworks: For complex parallel workflows, use specialized libraries:
When to Avoid Parallelization in Python#
When the task is already vectorized (NumPy, Pandas operations)
When marshaling overhead exceeds computation time
For tightly coupled problems requiring frequent synchronization
When the task completes quickly enough sequentially
Tip
Golden Rule: Make your sequential code as efficient as possible (using vectorization, compiled libraries) before attempting parallelization. Often, proper use of NumPy eliminates the need for manual parallel coding.