Data science/ML/AI
13.7K subscribers
561 photos
2 videos
145 files
320 links
Data science and machine learning hub

Python, SQL, stats, ML, deep learning, projects, PDFs, roadmaps and AI resources.

For beginners, data scientists and ML engineers
πŸ‘‰ https://rebrand.ly/bigdatachannels

DMCA: @disclosure_bds
Contact: @mldatascientist
Download Telegram
Our platform is finally ready. πŸš€

Do you remember the platform I told you we are building for you? πŸ‘€
Free learning materials, job offers, tech updates, Udemy coupons… all in one place.

After almost 3 years of building, testing, talking to many of you and improving it step by step… it’s finally in beta. βœ”οΈ

A lot of you actually participated in developing this, as backend devs, frontend devs or designers.  πŸ§‘β€πŸ’»

That makes me insanely proud.
This is truly built by us, for us. ❀️

I’m opening early access to a small group.
If you want to be one of the first inside, test it, find bugs, suggest ideas, or just see what’s under the hood…join the Beta Testers Group πŸ‘‰ https://t.me/+9vt9IKi6iGAxZDhk

Let’s make this thing amazing. Together. πŸš€
Please open Telegram to view this post
VIEW IN TELEGRAM
πŸ‘5
How To Tell a Data Story
❀6
βœ… Natural Language Processing (NLP) Basics You Should Know πŸ§ πŸ’¬

Understanding NLP is key to working with language-based AI systems like chatbots, translators, and voice assistants.

1️⃣ What is NLP? 
NLP stands for Natural Language Processing. It enables machines to understand, interpret, and respond to human language.

2️⃣ Key NLP Tasks: 
- Text classification (spam detection, sentiment analysis) 
- Named Entity Recognition (NER) (identifying names, places) 
- Tokenization (splitting text into words/sentences) 
- Part-of-speech tagging (noun, verb, etc.) 
- Machine translation (English β†’ French) 
- Text summarization 
- Question answering 

3️⃣ Tokenization Example: 
from nltk.tokenize import word_tokenize  
text = "ChatGPT is awesome!" 
tokens = word_tokenize(text) 
print(tokens)  # ['ChatGPT', 'is', 'awesome', '!']


4️⃣ Sentiment Analysis: 
Detects the emotion of text (positive, negative, neutral). 
from textblob import TextBlob  
TextBlob("I love AI!").sentiment  # Sentiment(polarity=0.5, subjectivity=0.6)


5️⃣ Stopwords Removal: 
Removes common words like β€œis”, β€œthe”, β€œa”. 
from nltk.corpus import stopwords  
words = ["this", "is", "a", "test"]
filtered = [w for w in words if w not in stopwords.words("english")]


6️⃣ Lemmatization vs Stemming: 
- Stemming: Cuts off word endings (running β†’ run) 
- Lemmatization: Uses vocab & grammar (better results)

7️⃣ Vectorization: 
Converts text into numbers for ML models. 
- Bag of Words 
- TF-IDF 
- Word Embeddings (Word2Vec, GloVe)

8️⃣ Transformers in NLP: 
Modern NLP models like BERT, GPT use transformer architecture for deep understanding.

9️⃣ Applications of NLP: 
- Chatbots 
- Virtual assistants (Alexa, Siri) 
- Sentiment analysis 
- Email classification 
- Auto-correction and translation 

πŸ”Ÿ Tools/Libraries: 
- NLTK 
- spaCy 
- TextBlob 
- Hugging Face Transformers

πŸ’¬ Tap ❀️ for more!
❀8
Layers of AI
πŸ”₯7πŸ†3
Pre-Chunking vs. Post-Chunking (On-Demand Chunking)

This visual breaks down two common ways to chunk documents in Retrieval-Augmented Generation (RAG) systems,and when each makes sense.

Pre-Chunking
Documents are cleaned, split into chunks, embedded, and stored ahead of time.
β€’  Pros: Fast retrieval at query time, simpler runtime pipeline.
β€’  Cons: Rigid,changing chunk size or strategy means reprocessing the entire dataset.
β€’  Best for: Stable datasets, high-throughput apps, predictable queries.

Post-Chunking / On-Demand Chunking
Documents are stored whole; chunking happens after retrieval based on the user’s query.
β€’  Pros: More flexible and query-aware, often more relevant context.
β€’  Cons: Higher latency and infrastructure complexity.
β€’  Best for: Evolving content, exploratory queries, precision-focused use cases.

πŸ”‘ Takeaway:
There’s no one-size-fits-all. If speed and scale matter most, pre-chunk. If adaptability and relevance are key, post-chunk. Many production systems even combine both.
❀5
πŸ€―πŸ“ˆ Detect Outliers in 5 Lines

Simple Z score based outlier detection.

import numpy as np

z = (df["salary"] - df["salary"].mean()) / df["salary"].std()
outliers = df[np.abs(z) > 3]


Why this matters:
β€’ Clean data
β€’ Better models
β€’ Fewer surprises in production

Small code. Big impact.
❀8
Type of Data Professionals
❀9πŸ”₯2
Forwarded from Programming Quiz Channel
Unsupervised learning often uses:
Anonymous Quiz
9%
Labels
17%
Regression
17%
Classification
56%
Clustering
❀5
Python for Data Analytics: The Ultimate Library Ecosystem (2026 Edition)

This wheel is the Python data stack that's recommended from raw scraping to production insights:

➑️ Data Manipulation β†’ Pandas, Polars (the fast successor), NumPy

➑️ Visualization β†’ Matplotlib, Seaborn, Plotly (interactive dashboards)

➑️ Analysis β†’ SciPy, Statsmodels, Pingouin

➑️ Time Series β†’ Darts, Kats, Tsfresh, sktime

➑️ NLP β†’ NLTK, spaCy, TextBlob, transformers (BERT & friends)

➑️ Web Scraping β†’ BeautifulSoup, Scrapy, Selenium

πŸ”₯ Pro tip from real projects:
πŸ‘‰Switch to Polars when Pandas starts choking on >1 GB datasets
πŸ‘‰ Use Plotly + Dash when stakeholders want interactive reports
πŸ‘‰ Combine Darts + Tsfresh for serious time-series feature engineering
❀7
βš‘οΈπŸ“Š One Line Feature Scaling

Scaling features without touching sklearn πŸ‘€

df["age_scaled"] = (df["age"] - df["age"].mean()) / df["age"].std()


Why it is useful:
β€’ Quick experiments
β€’ Better intuition
β€’ No pipeline overhead
❀7
Generative AI 101 in 10 Terms
❀8
🧠 LayerNorm vs BatchNorm: Same Goal, Different Behavior

Both techniques normalize activations, but they operate differently.

Batch Normalization

πŸ“¦ Normalizes across the batch
⚑️ Depends on batch statistics
πŸ–Ό Works very well in CNNs
⚠️ Sensitive to small batch sizes

Layer Normalization

πŸ”¬ Normalizes across features per sample
πŸ“ Independent of batch size
πŸ€– Preferred in transformers and NLP
βœ… Stable for sequence models

Why transformers use LayerNorm❔

Sequence models often run with variable or small batches.
LayerNorm avoids reliance on batch statistics and stays stable.

βœ… Rule of thumb

πŸ–Ό CNNs β†’ BatchNorm
πŸ€– Transformers β†’ LayerNorm

πŸ“Œ They look similar mathematically but normalize along different axes.
❀5
LLMs are getting insanely popular lately and suddenly everyone is talking about AI, chatbots, copilots, agents… so let’s clear it up πŸ‘‡

So what are LLMs really? πŸ€”

LLMs = Large Language Models

Think of them as insanely smart text prediction machines that learned from tons of books, code, docs, and conversations πŸ“šπŸ’»

Why everyone is obsessed right now πŸ”₯
β€’ They can write code πŸ§‘β€πŸ’»
β€’ Explain complex stuff like a friend πŸ—£
β€’ Analyze data πŸ“Š
β€’ Power chatbots, copilots, agents πŸ€–
β€’ One model, MANY tasks

Why they exploded now πŸš€
β€’ GPUs got better and cheaper
β€’ Open source models became really good
β€’ Companies realized: this saves time and money πŸ’°

The most famous LLMs you hear about πŸ‘€
β€’ GPT-4 / GPT-4.1 by OpenAI
β€’ Claude 3 by Anthropic
β€’ Gemini by Google
β€’ LLaMA 3 by Meta
β€’ Mistral by Mistral AI

Where LLMs are actually used today πŸ› 
β€’ Chatbots and AI assistants
β€’ Writing SQL and Python
β€’ Data analysis and reporting
β€’ Customer support automation
β€’ Internal company tools

Important truth πŸ’‘
LLMs are not magic πŸͺ„
They are very powerful autocomplete with reasoning skills.

Learn how to use them properly and you are already ahead of most people πŸ˜‰
❀11
Forwarded from Programming Quiz Channel
Which ML concept refers to splitting data into training and testing subsets?
Anonymous Quiz
21%
Normalization
40%
Cross-Validation
32%
Sampling
6%
Augmentation
❀5πŸ‘1
VC Dimension

In theory courses, VC dimension appears abstract.
But it answers a deep question:
How complex is your model’s decision boundary?


VC dimension measures the largest number of points a model can shatter (perfectly classify in all labelings).

Why this is important❔

Two models with similar parameter counts can have very different capacities.

For example:

πŸ“¦ k-NN β†’ very high effective capacity
πŸ“ Linear classifier β†’ limited capacity
🌳 Deep trees β†’ extremely high capacity

What you need to understand

Generalization depends on capacity relative to data size.
Too much capacity with little data leads to overfitting.

βœ… VC dimension is about expressive power, not just number of parameters.
❀4