Data science/ML/AI
13.7K subscribers
561 photos
2 videos
145 files
320 links
Data science and machine learning hub

Python, SQL, stats, ML, deep learning, projects, PDFs, roadmaps and AI resources.

For beginners, data scientists and ML engineers
πŸ‘‰ https://rebrand.ly/bigdatachannels

DMCA: @disclosure_bds
Contact: @mldatascientist
Download Telegram
3 Common Questions About Data and Analytics
❀7
πŸ“š Data Science Riddle

You have messy CSVs arriving daily. What's your first production step?
Anonymous Quiz
7%
Train model right away
18%
Manually clean each file
58%
Automate data validation pipeline
17%
Combine all into one CSV
Feature Engineering: The Hidden Skill That Makes or Breaks ML Models

Most people chase better algorithms. Professionals chase better features.

Because no matter how fancy your model is, if the data doesn’t speak the right language. it won’t learn anything meaningful.

πŸ” So What Exactly Is Feature Engineering?

It’s not just cleaning data. It’s translating raw, messy reality into something your model can understand.

You’re basically asking:

β€œHow can I represent the real world in numbers, without losing its meaning?”


Example:

βž– β€œDate of birth” β†’ Age (time-based insight)
βž– β€œText review” β†’ Sentiment score (emotional signal)
βž– β€œPrice” β†’ log(price) (stabilized distribution)

Every transformation teaches your model how to see the world more clearly.

βš™οΈ Why It Matters More Than the Model

You can’t outsmart bad features.
A simple linear model trained on smartly engineered data will outperform a deep neural net trained on noise.

Kaggle winners know this. They spend 80% of their time creating and refining features not tuning hyperparameters.

Why? Because models don’t create intelligence, They extract it from what you feed them.

🧩 The Core Idea: Add Signal, Remove Noise

Feature engineering is about sculpting your data so patterns stand out.

You do that by:

βœ”οΈ Transforming data (scale, encode, log).
βœ”οΈ Creating new signals (ratios, lags, interactions).
βœ”οΈ Reducing redundancy (drop correlated or useless columns).

Every step should make learning easier not prettier.

⚠️ Beware of Data Leakage

Here’s the silent trap: using future information when building features.

For example, when predicting loan default, if you include β€œpayment status after 90 days,” your model will look brilliant in training and fail in production.

Golden rule:
πŸ‘‰ A feature is valid only if it’s available at prediction time.

🧠 Think Like a Domain Expert

Anyone can code transformations.
But great data scientists understand context.

They ask:

❔What actually influences this outcome in real life?
❔How can I capture that influence as a feature?

When you merge domain intuition with technical precision, feature engineering becomes your superpower.

⚑️ Final Takeaway

The model is the student.
The features are the teacher.

And no matter how capable the student if the teacher explains things poorly, learning fails.
Feature engineering isn’t preprocessing. It’s the art of teaching your model how to understand the world.
❀11
πŸ“š Data Science Riddle

You train a CNN for image classification but loss stops decreasing early. What's your next step?
Anonymous Quiz
19%
Reduce batch size
42%
Increase learning rate a bit
22%
Add Dropout
16%
Reduce layers
❀1🀝1
⚑ Parallelism In Databricks ⚑

1️⃣ DEFINITION

Parallelism = running many tasks πŸƒβ€β™‚οΈπŸƒβ€β™€οΈ at the same time
(instead of one by one 🐒).
In Databricks (via Apache Spark), data is split into
πŸ“¦ partitions, and each partition is processed
simultaneously across worker nodes πŸ’»πŸ’»πŸ’».

2️⃣ KEY CONCEPTS

πŸ”Ή Partition = one chunk of data πŸ“¦
πŸ”Ή Task = work done on a partition πŸ› οΈ
πŸ”Ή Stage = group of tasks that run in parallel βš™οΈ
πŸ”Ή Job = complete action (made of stages + tasks) πŸ“Š

3️⃣ HOW IT WORKS

βœ… Step 1: Dataset ➑️ divided into partitions πŸ“¦πŸ“¦πŸ“¦
βœ… Step 2: Each partition ➑️ assigned to a worker πŸ’»
βœ… Step 3: Workers run tasks in parallel ⏩
βœ… Step 4: Results ➑️ combined into final output 🎯

4️⃣ EXAMPLES

# Increase parallelism by repartitioning
df = spark.read.csv("/data/huge_file.csv")
df = df.repartition(200) # ⚑ 200 parallel tasks

# Spark DataFrame ops run in parallel by default πŸš€
result = df.groupBy("category").count()

# Parallelize small Python objects πŸ“‚
rdd = spark.sparkContext.parallelize(range(1000), numSlices=50)
rdd.map(lambda x: x * 2).collect()

# Parallel workflows in Jobs UI ⚑
# Independent tasks = run at the same time.

5️⃣ BEST PRACTICES

βš–οΈ Balance partitions β†’ not too few, not too many
πŸ“‰ Avoid data skew β†’ partitions should be even
πŸ—ƒοΈ Cache data if reused often
πŸ’ͺ Scale cluster β†’ more workers = more parallelism

====================================================
πŸ“Œ SUMMARY
Parallelism in Databricks = split data πŸ“¦ β†’
assign tasks πŸ› οΈ β†’ run them at the same time ⏩ β†’
faster results πŸš€
❀5
πŸ“š Data Science Riddle

In A/B testing, why is random assignment of users essential?
Anonymous Quiz
9%
To reduce experiment time
78%
To ensure groups are unbiased
7%
To increase conversion rate
6%
To simplify analysis
❀3
Instead of starting every project from scratch, use this template to build AI apps with structure and speed
❀9
6 Steps of Data Cleaning Every Data Analyst Should Know
❀5πŸ‘1
πŸ“š Data Science Riddle

Why is data versioning(e.g., DVC, LakeFS) essential in ML workflows?
Anonymous Quiz
11%
It speeds up training
41%
It helps reproduce experiments
19%
It stores backups
29%
It tracks model metrics
60 Generative AI Project Ideas
❀5
Classification Vs Regression By Bigdata Specialist.pdf
3.6 MB
Latest post from our Instagram page, saved as PDF ☝️

You can also find it here: https://www.instagram.com/p/DQJrbCaDBpy/
❀3πŸ‘2
AI Engineer Roadmap
❀5
πŸ“š Data Science Riddle

Why is data validation before model training critical in production ML systems?
Anonymous Quiz
25%
It prevents model drift
24%
It ensures pipeline reproducibility
39%
It catches bad data early
12%
It improves training speed
❀3
Conceptual Modeling For ETL Processes.pdf
460.5 KB
Discusses Modeling ETL workflows for data warehousing, including data sources and transformations, from Drexel University.
❀5
πŸ“š Data Science Riddle

During EDA(Explanatory Data Analysis), what's the main reason we use box plots?
Anonymous Quiz
22%
To visualize distributions
64%
To detect outliers
9%
To see correlations
5%
To test normality
❀5
Hey everyone πŸ‘‹

Some time ago, I asked if I should start a Data Science educational series and since 96% of you said yes, I began creating it.

But many of you also asked for real, hands-on experience with projects, not just lessons. So I decided to shift gears. It’s now becoming a full practical coding course! πŸ’»

My goal is to help you build skills that get you job-ready, not just teach theory. It’s taking a bit longer, but I promise it’ll be worth it.

Thank you all for your support and patience ❀️
I’ll let you know as soon as we’re ready to start!
❀21πŸ‘3πŸ₯°1
Pandas Cheatsheet For Data Analysis
❀4