π Data Science Riddle
A data engineer complains that your model training job is failing in production due to schema mismatch. What's the root fix?
A data engineer complains that your model training job is failing in production due to schema mismatch. What's the root fix?
Anonymous Quiz
12%
Cast data types in code
17%
Skip invalid rows
20%
Retrain with old schema
52%
Use a schema registry
Covariance vs. Correlation: Same Family, Different Story
People use them interchangeably but they measure different things.
Covariance tells you the direction of relationship (positive or negative).
Correlation goes further; it tells you the strength, normalized between -1 and 1.
So while covariance can be 2345.67, correlation says 0.92. clear, interpretable, scale-free.
People use them interchangeably but they measure different things.
Covariance tells you the direction of relationship (positive or negative).
Correlation goes further; it tells you the strength, normalized between -1 and 1.
So while covariance can be 2345.67, correlation says 0.92. clear, interpretable, scale-free.
Covariance shows movement, correlation shows consistency.
β€5π1
π Data Science Riddle
You're Processing a dataset with frequent schema evolution. Which format handles it most gracefully?
You're Processing a dataset with frequent schema evolution. Which format handles it most gracefully?
Anonymous Quiz
10%
ORC
13%
Avro
58%
CSV
19%
Parquet
β€5
Eigenvalues & Eigenvectors β Why PCA Actually Works
Youβve heard of PCA. But whatβs really happening underneath?
PCA finds the directions (vectors) where your data varies the most.
Those directions are eigenvectors of the covariance matrix and the eigenvalues tell you how much variance each captures.
Youβre basically rotating your data to find its βnatural axes.β
Youβve heard of PCA. But whatβs really happening underneath?
PCA finds the directions (vectors) where your data varies the most.
Those directions are eigenvectors of the covariance matrix and the eigenvalues tell you how much variance each captures.
Youβre basically rotating your data to find its βnatural axes.β
PCA isnβt compression β itβs discovering how your data wants to be seen.
β€7π2
π Data Science Riddle
Your spark job fails due to executor memory pressure. Most effective optimization?
Your spark job fails due to executor memory pressure. Most effective optimization?
Anonymous Quiz
17%
Broadcast variables
28%
Larger cluster
38%
More shuffle partitions
17%
Persist fewer objects
BigDataAnalytics-Lecture.pdf
10.2 MB
Notes on HDFS, MapReduce, YARN, Hadoop vs. traditional systems and much more... from Columbia University.
β€7
π Data Science Riddle
You fit a forecasting model and residuals show increasing variance. What is needed?
You fit a forecasting model and residuals show increasing variance. What is needed?
Anonymous Quiz
19%
Differnecing
46%
Smoothing
28%
Decomposition
7%
Box-Cox
π3β€1
π Data Science Riddle
A numeric feature has many repeated exact values with occasional jumps. What type of variable is this?
A numeric feature has many repeated exact values with occasional jumps. What type of variable is this?
Anonymous Quiz
27%
Discrete
22%
Ordinal
18%
Continuous
33%
Interval
β€4
Machine Learning Notes.pdf
226.8 KB
A Stanford CS' Lecture note diving into supervised/unsupervised algorithms, neural networks, SVMs with math proofs and Python pseudocode.
β€7
π Data Science Riddle
Two team members run the same notebook but get different results. What's the culprit?
Two team members run the same notebook but get different results. What's the culprit?
Anonymous Quiz
5%
Loss Curves
13%
Batch shapes
62%
Random seeds
21%
Metric choice
π Data Science Riddle
A query runs slowly due to large table scans. What's the most targeted fix?
A query runs slowly due to large table scans. What's the most targeted fix?
Anonymous Quiz
53%
Add indexes
18%
Use aliases
19%
Add DISTINCT
10%
Increase RAM
π Data Science Riddle
You want to detect extreme values visually in one plot. Which one is best?
You want to detect extreme values visually in one plot. Which one is best?
Anonymous Quiz
54%
Box plot
30%
Heatmap
9%
Line chart
7%
Area plot
Mining of Massive Datasets (Leskovec, Stanford).pdf
2.9 MB
The Big Data bible from Stanford: MapReduce, Spark, recommendation systems, PageRank, locality-sensitive hashing, Large scale machine learning and mining social networks/streams all explained clearly with real algorithms you can code today. 500 pages of pure gold.
β€4
π Data Science Riddle
You want to prevent inconsistent data across environments. What helps most?
You want to prevent inconsistent data across environments. What helps most?
Anonymous Quiz
29%
Checkpoints
17%
Contracts
41%
Indexes
13%
Sharding
β€1