π Data Science Riddle
Model Accuracy improves after dropping half the features. Why?
Model Accuracy improves after dropping half the features. Why?
Anonymous Quiz
11%
Model became smaller
71%
Overfitting reduced
11%
Data size shrank
7%
Training faster
β€3
Understanding the Forecast Statistics and Four Moments (4P).pdf
181.8 KB
Statistical Moments (M1, M2) for Data Analysis
Here are 5 curated PDFs diving into the mean (M1), variance (M2), and their applications in crafting research questions and sourcing data.
A channel member requested resources on this topic and we delivered.
If you have a topic you want resources on let us know, and weβll make it happen!
@datascience_bds
Here are 5 curated PDFs diving into the mean (M1), variance (M2), and their applications in crafting research questions and sourcing data.
A channel member requested resources on this topic and we delivered.
If you have a topic you want resources on let us know, and weβll make it happen!
@datascience_bds
β€8
π Data Science Riddle
Why do we use Batch Normalization?
Why do we use Batch Normalization?
Anonymous Quiz
28%
Speeds up training
46%
Prevents overfitting
8%
Adds non-linearity
18%
Reduces dataset size
β€5
π Data Science Riddle
Your object detection model misses small objects. Easiest fix?
Your object detection model misses small objects. Easiest fix?
Anonymous Quiz
20%
Use larger input images
33%
Add more classes
29%
Reduce learning rate
18%
Train longer
π€ AI that creates AI: ASI-ARCH finds 106 new SOTA architectures
ASI-ARCH β experimental ASI that autonomously researches and designs neural nets. It hypothesizes, codes, trains & tests models.
π‘ Scale:
1,773 experiments β 20,000+ GPU-hours.
Stage 1 (20M params, 1B tokens): 1,350 candidates beat DeltaNet.
Stage 2 (340M params): 400 models β 106 SOTA winners.
Top 5 trained on 15B tokens vs Mamba2 & Gated DeltaNet.
π Results:
PathGateFusionNet: 48.51 avg (Mamba2: 47.84, Gated DeltaNet: 47.32).
BoolQ: 60.58 vs 60.12 (Gated DeltaNet).
Consistent gains across tasks.
π Insights:
Prefers proven tools (gating, convs), refines them iteratively.
Ideas come from: 51.7% literature, 38.2% self-analysis, 10.1% originality.
SOTA share: self-analysis β to 44.8%, literature β to 48.6%.
@datascience_bds
ASI-ARCH β experimental ASI that autonomously researches and designs neural nets. It hypothesizes, codes, trains & tests models.
π‘ Scale:
1,773 experiments β 20,000+ GPU-hours.
Stage 1 (20M params, 1B tokens): 1,350 candidates beat DeltaNet.
Stage 2 (340M params): 400 models β 106 SOTA winners.
Top 5 trained on 15B tokens vs Mamba2 & Gated DeltaNet.
π Results:
PathGateFusionNet: 48.51 avg (Mamba2: 47.84, Gated DeltaNet: 47.32).
BoolQ: 60.58 vs 60.12 (Gated DeltaNet).
Consistent gains across tasks.
π Insights:
Prefers proven tools (gating, convs), refines them iteratively.
Ideas come from: 51.7% literature, 38.2% self-analysis, 10.1% originality.
SOTA share: self-analysis β to 44.8%, literature β to 48.6%.
@datascience_bds
β€4
π Databricks Tip: REPLACE vs MERGE
When updating Delta tables, youβve got two powerful options:
πΉ REPLACE TABLE β¦ ON
π Like throwing away the entire library and rebuilding it.
- Drops the old table & recreates it.
- Schema + data = fully replaced.
- β‘ Super fast but destructive (old data gone).
- β Best for full refreshes or schema changes.
πΉ MERGE
π Like updating only the books that changed.
- Works row by row.
- Updates, inserts, or deletes specific records.
- π Preserves unchanged data.
- β Best for incremental updates or CDC (Change Data Capture).
βοΈ Key Difference
- REPLACE = Start fresh with a new table.
- MERGE = Surgically update rows without losing the rest.
π Rule of thumb:
Use REPLACE for full rebuilds,
Use MERGE for incremental upserts.
#Databricks #DeltaLake
When updating Delta tables, youβve got two powerful options:
πΉ REPLACE TABLE β¦ ON
π Like throwing away the entire library and rebuilding it.
- Drops the old table & recreates it.
- Schema + data = fully replaced.
- β‘ Super fast but destructive (old data gone).
- β Best for full refreshes or schema changes.
πΉ MERGE
π Like updating only the books that changed.
- Works row by row.
- Updates, inserts, or deletes specific records.
- π Preserves unchanged data.
- β Best for incremental updates or CDC (Change Data Capture).
βοΈ Key Difference
- REPLACE = Start fresh with a new table.
- MERGE = Surgically update rows without losing the rest.
π Rule of thumb:
Use REPLACE for full rebuilds,
Use MERGE for incremental upserts.
#Databricks #DeltaLake
β€6
π Data Science Riddle
You have messy CSVs arriving daily. What's your first production step?
You have messy CSVs arriving daily. What's your first production step?
Anonymous Quiz
7%
Train model right away
18%
Manually clean each file
58%
Automate data validation pipeline
17%
Combine all into one CSV
Feature Engineering: The Hidden Skill That Makes or Breaks ML Models
Most people chase better algorithms. Professionals chase better features.
Because no matter how fancy your model is, if the data doesnβt speak the right language. it wonβt learn anything meaningful.
π So What Exactly Is Feature Engineering?
Itβs not just cleaning data. Itβs translating raw, messy reality into something your model can understand.
Youβre basically asking:
Example:
β βDate of birthβ β Age (time-based insight)
β βText reviewβ β Sentiment score (emotional signal)
β βPriceβ β log(price) (stabilized distribution)
Every transformation teaches your model how to see the world more clearly.
βοΈ Why It Matters More Than the Model
You canβt outsmart bad features.
A simple linear model trained on smartly engineered data will outperform a deep neural net trained on noise.
Kaggle winners know this. They spend 80% of their time creating and refining features not tuning hyperparameters.
Why? Because models donβt create intelligence, They extract it from what you feed them.
π§© The Core Idea: Add Signal, Remove Noise
Feature engineering is about sculpting your data so patterns stand out.
You do that by:
βοΈ Transforming data (scale, encode, log).
βοΈ Creating new signals (ratios, lags, interactions).
βοΈ Reducing redundancy (drop correlated or useless columns).
Every step should make learning easier not prettier.
β οΈ Beware of Data Leakage
Hereβs the silent trap: using future information when building features.
For example, when predicting loan default, if you include βpayment status after 90 days,β your model will look brilliant in training and fail in production.
Golden rule:
π A feature is valid only if itβs available at prediction time.
π§ Think Like a Domain Expert
Anyone can code transformations.
But great data scientists understand context.
They ask:
βWhat actually influences this outcome in real life?
βHow can I capture that influence as a feature?
When you merge domain intuition with technical precision, feature engineering becomes your superpower.
β‘οΈ Final Takeaway
The model is the student.
The features are the teacher.
And no matter how capable the student if the teacher explains things poorly, learning fails.
Most people chase better algorithms. Professionals chase better features.
Because no matter how fancy your model is, if the data doesnβt speak the right language. it wonβt learn anything meaningful.
π So What Exactly Is Feature Engineering?
Itβs not just cleaning data. Itβs translating raw, messy reality into something your model can understand.
Youβre basically asking:
βHow can I represent the real world in numbers, without losing its meaning?β
Example:
β βDate of birthβ β Age (time-based insight)
β βText reviewβ β Sentiment score (emotional signal)
β βPriceβ β log(price) (stabilized distribution)
Every transformation teaches your model how to see the world more clearly.
βοΈ Why It Matters More Than the Model
You canβt outsmart bad features.
A simple linear model trained on smartly engineered data will outperform a deep neural net trained on noise.
Kaggle winners know this. They spend 80% of their time creating and refining features not tuning hyperparameters.
Why? Because models donβt create intelligence, They extract it from what you feed them.
π§© The Core Idea: Add Signal, Remove Noise
Feature engineering is about sculpting your data so patterns stand out.
You do that by:
βοΈ Transforming data (scale, encode, log).
βοΈ Creating new signals (ratios, lags, interactions).
βοΈ Reducing redundancy (drop correlated or useless columns).
Every step should make learning easier not prettier.
β οΈ Beware of Data Leakage
Hereβs the silent trap: using future information when building features.
For example, when predicting loan default, if you include βpayment status after 90 days,β your model will look brilliant in training and fail in production.
Golden rule:
π A feature is valid only if itβs available at prediction time.
π§ Think Like a Domain Expert
Anyone can code transformations.
But great data scientists understand context.
They ask:
βWhat actually influences this outcome in real life?
βHow can I capture that influence as a feature?
When you merge domain intuition with technical precision, feature engineering becomes your superpower.
β‘οΈ Final Takeaway
The model is the student.
The features are the teacher.
And no matter how capable the student if the teacher explains things poorly, learning fails.
Feature engineering isnβt preprocessing. Itβs the art of teaching your model how to understand the world.
β€11
π Data Science Riddle
You train a CNN for image classification but loss stops decreasing early. What's your next step?
You train a CNN for image classification but loss stops decreasing early. What's your next step?
Anonymous Quiz
19%
Reduce batch size
42%
Increase learning rate a bit
22%
Add Dropout
16%
Reduce layers
β€1π€1
β‘ Parallelism In Databricks β‘
1οΈβ£ DEFINITION
Parallelism = running many tasks πββοΈπββοΈ at the same time
(instead of one by one π’).
In Databricks (via Apache Spark), data is split into
π¦ partitions, and each partition is processed
simultaneously across worker nodes π»π»π».
2οΈβ£ KEY CONCEPTS
πΉ Partition = one chunk of data π¦
πΉ Task = work done on a partition π οΈ
πΉ Stage = group of tasks that run in parallel βοΈ
πΉ Job = complete action (made of stages + tasks) π
3οΈβ£ HOW IT WORKS
β Step 1: Dataset β‘οΈ divided into partitions π¦π¦π¦
β Step 2: Each partition β‘οΈ assigned to a worker π»
β Step 3: Workers run tasks in parallel β©
β Step 4: Results β‘οΈ combined into final output π―
4οΈβ£ EXAMPLES
# Increase parallelism by repartitioning
df = spark.read.csv("/data/huge_file.csv")
df = df.repartition(200) # β‘ 200 parallel tasks
# Spark DataFrame ops run in parallel by default π
result = df.groupBy("category").count()
# Parallelize small Python objects π
rdd = spark.sparkContext.parallelize(range(1000), numSlices=50)
rdd.map(lambda x: x * 2).collect()
# Parallel workflows in Jobs UI β‘
# Independent tasks = run at the same time.
5οΈβ£ BEST PRACTICES
βοΈ Balance partitions β not too few, not too many
π Avoid data skew β partitions should be even
ποΈ Cache data if reused often
πͺ Scale cluster β more workers = more parallelism
====================================================
π SUMMARY
Parallelism in Databricks = split data π¦ β
assign tasks π οΈ β run them at the same time β© β
faster results π
1οΈβ£ DEFINITION
Parallelism = running many tasks πββοΈπββοΈ at the same time
(instead of one by one π’).
In Databricks (via Apache Spark), data is split into
π¦ partitions, and each partition is processed
simultaneously across worker nodes π»π»π».
2οΈβ£ KEY CONCEPTS
πΉ Partition = one chunk of data π¦
πΉ Task = work done on a partition π οΈ
πΉ Stage = group of tasks that run in parallel βοΈ
πΉ Job = complete action (made of stages + tasks) π
3οΈβ£ HOW IT WORKS
β Step 1: Dataset β‘οΈ divided into partitions π¦π¦π¦
β Step 2: Each partition β‘οΈ assigned to a worker π»
β Step 3: Workers run tasks in parallel β©
β Step 4: Results β‘οΈ combined into final output π―
4οΈβ£ EXAMPLES
# Increase parallelism by repartitioning
df = spark.read.csv("/data/huge_file.csv")
df = df.repartition(200) # β‘ 200 parallel tasks
# Spark DataFrame ops run in parallel by default π
result = df.groupBy("category").count()
# Parallelize small Python objects π
rdd = spark.sparkContext.parallelize(range(1000), numSlices=50)
rdd.map(lambda x: x * 2).collect()
# Parallel workflows in Jobs UI β‘
# Independent tasks = run at the same time.
5οΈβ£ BEST PRACTICES
βοΈ Balance partitions β not too few, not too many
π Avoid data skew β partitions should be even
ποΈ Cache data if reused often
πͺ Scale cluster β more workers = more parallelism
====================================================
π SUMMARY
Parallelism in Databricks = split data π¦ β
assign tasks π οΈ β run them at the same time β© β
faster results π
β€5