Python Learning
5.84K subscribers
546 photos
2 videos
85 files
120 links
Python learning resources

Beginner to advanced Python guides, cheatsheets, books and projects.

For data science, backend and automation.
Join 👉 https://rebrand.ly/bigdatachannels

DMCA: @disclosure_bds
Contact: @mldatascientist
Download Telegram
🏎 Python Performance: Vectorization vs. Loops in Pandas 🐼

In standard Python, we are taught to use for loops to process data. However, in Data Science, loops are the enemy of speed. If you use a loop to process a million rows in a Pandas DataFrame, your code will be 100x to 1000x slower than it needs to be.

👉 To be a pro Data Analyst, you must stop "looping" and start "vectorizing."


🐢 The Slow Way: Iterating with Loops
Python is an interpreted language, meaning every time a loop runs a calculation on a row, there is massive "overhead." The computer has to check the data type, find the memory address, and perform the math over and over again.

🚀 The Fast Way: Vectorization
Pandas (and NumPy) use Vectorization, which performs operations on entire arrays (columns) at once. This pushes the heavy lifting down to highly optimized C and Fortran code under the hood.


💻 The "Speed Race" Code

Let's say we have 1 million rows of prices and we want to apply a 10% tax.

import pandas as pd
import numpy as np
import time

# Create a DataFrame with 1 million rows
df = pd.DataFrame({'price': np.random.randint(1, 100, size=1_000_000)})

# THE SLOW WAY: Manual Loop (Don't do this!)
start = time.time()
taxes = []
for p in df['price']:
taxes.append(p * 0.1)
df['tax_loop'] = taxes
print(f"Loop time: {time.time() - start:.4f} seconds")

# THE FAST WAY: Vectorization (The Pandas Way)
start = time.time()
df['tax_vec'] = df['price'] * 0.1
print(f"Vectorized time: {time.time() - start:.4f} seconds")


The result? The loop might take ~0.1 seconds, while the vectorized version takes ~0.001 seconds. On massive datasets, this is the difference between a task taking 10 minutes or 2 seconds.


🛠 When can't you vectorize?
If you have extremely complex logic (like an if/else that depends on three different external APIs), you might use .apply(). While .apply() is slightly better than a manual for loop, it is still significantly slower than true vectorization. Always try math-based column operations first.

👉 Write your code for the column, not for the row!
2