Data science/ML/AI
13.7K subscribers
561 photos
2 videos
145 files
320 links
Data science and machine learning hub

Python, SQL, stats, ML, deep learning, projects, PDFs, roadmaps and AI resources.

For beginners, data scientists and ML engineers
πŸ‘‰ https://rebrand.ly/bigdatachannels

DMCA: @disclosure_bds
Contact: @mldatascientist
Download Telegram
πŸ“š Data Science Riddle

You train a CNN for image classification but loss stops decreasing early. What's your next step?
Anonymous Quiz
19%
Reduce batch size
42%
Increase learning rate a bit
22%
Add Dropout
16%
Reduce layers
❀1🀝1
⚑ Parallelism In Databricks ⚑

1️⃣ DEFINITION

Parallelism = running many tasks πŸƒβ€β™‚οΈπŸƒβ€β™€οΈ at the same time
(instead of one by one 🐒).
In Databricks (via Apache Spark), data is split into
πŸ“¦ partitions, and each partition is processed
simultaneously across worker nodes πŸ’»πŸ’»πŸ’».

2️⃣ KEY CONCEPTS

πŸ”Ή Partition = one chunk of data πŸ“¦
πŸ”Ή Task = work done on a partition πŸ› οΈ
πŸ”Ή Stage = group of tasks that run in parallel βš™οΈ
πŸ”Ή Job = complete action (made of stages + tasks) πŸ“Š

3️⃣ HOW IT WORKS

βœ… Step 1: Dataset ➑️ divided into partitions πŸ“¦πŸ“¦πŸ“¦
βœ… Step 2: Each partition ➑️ assigned to a worker πŸ’»
βœ… Step 3: Workers run tasks in parallel ⏩
βœ… Step 4: Results ➑️ combined into final output 🎯

4️⃣ EXAMPLES

# Increase parallelism by repartitioning
df = spark.read.csv("/data/huge_file.csv")
df = df.repartition(200) # ⚑ 200 parallel tasks

# Spark DataFrame ops run in parallel by default πŸš€
result = df.groupBy("category").count()

# Parallelize small Python objects πŸ“‚
rdd = spark.sparkContext.parallelize(range(1000), numSlices=50)
rdd.map(lambda x: x * 2).collect()

# Parallel workflows in Jobs UI ⚑
# Independent tasks = run at the same time.

5️⃣ BEST PRACTICES

βš–οΈ Balance partitions β†’ not too few, not too many
πŸ“‰ Avoid data skew β†’ partitions should be even
πŸ—ƒοΈ Cache data if reused often
πŸ’ͺ Scale cluster β†’ more workers = more parallelism

====================================================
πŸ“Œ SUMMARY
Parallelism in Databricks = split data πŸ“¦ β†’
assign tasks πŸ› οΈ β†’ run them at the same time ⏩ β†’
faster results πŸš€
❀5
πŸ“š Data Science Riddle

In A/B testing, why is random assignment of users essential?
Anonymous Quiz
9%
To reduce experiment time
78%
To ensure groups are unbiased
7%
To increase conversion rate
6%
To simplify analysis
❀3
Instead of starting every project from scratch, use this template to build AI apps with structure and speed
❀9
6 Steps of Data Cleaning Every Data Analyst Should Know
❀5πŸ‘1
πŸ“š Data Science Riddle

Why is data versioning(e.g., DVC, LakeFS) essential in ML workflows?
Anonymous Quiz
11%
It speeds up training
41%
It helps reproduce experiments
19%
It stores backups
29%
It tracks model metrics
60 Generative AI Project Ideas
❀5
Classification Vs Regression By Bigdata Specialist.pdf
3.6 MB
Latest post from our Instagram page, saved as PDF ☝️

You can also find it here: https://www.instagram.com/p/DQJrbCaDBpy/
❀3πŸ‘2
AI Engineer Roadmap
❀5
πŸ“š Data Science Riddle

Why is data validation before model training critical in production ML systems?
Anonymous Quiz
25%
It prevents model drift
24%
It ensures pipeline reproducibility
39%
It catches bad data early
12%
It improves training speed
❀3
Conceptual Modeling For ETL Processes.pdf
460.5 KB
Discusses Modeling ETL workflows for data warehousing, including data sources and transformations, from Drexel University.
❀5
πŸ“š Data Science Riddle

During EDA(Explanatory Data Analysis), what's the main reason we use box plots?
Anonymous Quiz
22%
To visualize distributions
64%
To detect outliers
9%
To see correlations
5%
To test normality
❀5
Hey everyone πŸ‘‹

Some time ago, I asked if I should start a Data Science educational series and since 96% of you said yes, I began creating it.

But many of you also asked for real, hands-on experience with projects, not just lessons. So I decided to shift gears. It’s now becoming a full practical coding course! πŸ’»

My goal is to help you build skills that get you job-ready, not just teach theory. It’s taking a bit longer, but I promise it’ll be worth it.

Thank you all for your support and patience ❀️
I’ll let you know as soon as we’re ready to start!
❀21πŸ‘3πŸ₯°1
Pandas Cheatsheet For Data Analysis
❀4
πŸ“š Data Science Riddle

Your batch ETL job runs slower each week despite no code change. What's your first suspect?
Anonymous Quiz
12%
Code inefficiency
20%
Schema mismatch
61%
Data volume growth
7%
Resource throttling
🚨 When & How Jupyter Notebooks Fail (And What To Use Instead)

Hey Data Folks! πŸ‘©β€πŸ’»πŸ‘¨β€πŸ’»
Let’s talk about Jupyter Notebooks β€” powerful for exploration, but risky in production. Here’s why:

❌ Problems with Notebooks:
1. Out-of-order execution β†’ hidden bugs.
2. Code changes after execution β†’ inconsistent results.
3. Data leakage β†’ sensitive info in outputs.
4. Security risks β†’ tokens/keys exposed.
5. Hard to apply engineering practices β†’ no modular code, testing, CI/CD.
6. Collaboration pain β†’ merge conflicts, JSON issues.
7. Reproducibility issues β†’ missing dependencies, versions.

βœ… When They’re Useful:
- Quick data exploration & prototyping.
- Knowledge sharing (clean, runnable from top to bottom).
- Teaching / hands-on tutorials (with solution notebooks).

πŸ”§ What to Use Instead:
- For production code β†’ .py files + IDEs.
- For workflows β†’ template repos & reproducible setups.
- For deployment β†’ MLOps tools, pipelines, automation.

πŸ’‘ Key Takeaways:
- Use notebooks for exploration & teaching.
- Use structured code + pipelines for production & deployment.
- Always document dependencies, keep notebooks clean, never commit secrets!
❀6πŸ‘2
List of AI Project Ideas πŸ‘¨πŸ»β€πŸ’»

Beginner Projects

πŸ”Ή Sentiment Analyzer
πŸ”Ή Image Classifier
πŸ”Ή Spam Detection System
πŸ”Ή Face Detection
πŸ”Ή Chatbot (Rule-based)
πŸ”Ή Movie Recommendation System
πŸ”Ή Handwritten Digit Recognition
πŸ”Ή Speech-to-Text Converter
πŸ”Ή AI-Powered Calculator
πŸ”Ή AI Hangman Game

Intermediate Projects

πŸ”Έ AI Virtual Assistant
πŸ”Έ Fake News Detector
πŸ”Έ Music Genre Classification
πŸ”Έ AI Resume Screener
πŸ”Έ Style Transfer App
πŸ”Έ Real-Time Object Detection
πŸ”Έ Chatbot with Memory
πŸ”Έ Autocorrect Tool
πŸ”Έ Face Recognition Attendance System
πŸ”Έ AI Sudoku Solver

Advanced Projects

πŸ”Ί AI Stock Predictor
πŸ”Ί AI Writer (GPT-based)
πŸ”Ί AI-powered Resume Builder
πŸ”Ί Deepfake Generator
πŸ”Ί AI Lawyer Assistant
πŸ”Ί AI-Powered Medical Diagnosis
πŸ”Ί AI-based Game Bot
πŸ”Ί Custom Voice Cloning
πŸ”Ί Multi-modal AI App
πŸ”Ί AI Research Paper Summarizer
❀9πŸ‘1