If you're looking to work with data, knowing SQL is a must-have π. Most of the time, when you're applying for a data job, showing off your SQL skills is a crucial step. So, it's super important to get the hang of the basic SQL commands and also to understand what else you need to nail those job interviews.
In this video, we've got some cool stuff for you:
* A mock SQL interview for a Data Analyst job at a SaaS company πΌ.
* How the interviewee tackles the problems and areas they can improve π€.
* Feedback that'll clue you in on what you need to ace SQL interviews β .
We created this video to help you understand what SQL interviews are like and what you need to learn to pass them with flying colors π. It's filled with advice and insights to prep you for your next data job interview.
https://youtu.be/qUwbpgbEymE?si=xoMGx2ZdYFfiP68C
In this video, we've got some cool stuff for you:
* A mock SQL interview for a Data Analyst job at a SaaS company πΌ.
* How the interviewee tackles the problems and areas they can improve π€.
* Feedback that'll clue you in on what you need to ace SQL interviews β .
We created this video to help you understand what SQL interviews are like and what you need to learn to pass them with flying colors π. It's filled with advice and insights to prep you for your next data job interview.
https://youtu.be/qUwbpgbEymE?si=xoMGx2ZdYFfiP68C
YouTube
SQL technical round for Data Analyst | Mock Interview
If you're looking to work with data, knowing SQL is a must-have π. Most of the time, when you're applying for a data job, showing off your SQL skills is a crucial step. So, it's super important to get the hang of the basic SQL commands and also to understandβ¦
π6π―2β€1
Forwarded from ΠΠ½ΠΆΠΈΠ½ΠΈΡΠΈΠ½Π³ ΠΠ°Π½Π½ΡΡ
(Dmitry)
Cheat sheet for data engineering ramp up in 2022.
π’ SQL/data modeling
Install Postgres. Load dummy data with \copy
Understand ACID
Different consistency levels: Strict, Linearizable, Serializable, Eventual, Casual
Aggs with GROUP BY.
Joins: INNER, LEFT.
Filter records with WHERE. Filter groups with HAVING.
Aggs persisted to summary tables.
ORDER BY foo LIMIT n --> top n by foo.
Organize queries with CTEs and temp tables. Subselect sparingly
Window functions to look at trends within arbitrary partitions.
Unique constraints. Foreign key constraints.
Dimensions vs Fact tables. Star, snowflake schemas.
EXPLAIN <query>. Read the plan.
Add an index. How does the EXPLAIN change?
π’ CompSci concepts you must know
Uniformly distributed hash functions.
Tuples vs Arrays.
Iterator design pattern.
External Merge sort.
Bit vectors, HashMap. Sets.
Directed graphs.
Serialization
π’ Time representation
ISO 8601 strings
Prefer native date and timestamp objects.
UTC.
π’ Fundamentals of distributed data processing
Cap theorem.
Replication.
Horizontal vs Vertical scaling.
Sharding vs clustering.
Share-nothing principle.
Partitioning data at-rest.
Planning for fault tolerance and failure.
Exponential back off retries.
Latency vs bandwidth.
How to code to iterators without the input fully buffered in memory.
π’ Spark
Dataframe and Dataset batch API.
Shuffle challenges. When to repartition().
Runtime partitioning by key.
Tuning: executors num, memory, cores. Num shuffle partitions.
Dataframe ops split up into composable driver functions.
Construct fixture Dataframes for unit tests.
Create temporary views as staging tables.
Data quality checks over staging views. Lots of logging.
Fail driver if checks fail.
Block any downstream dependencies until the checks pass.
Hive metastore for data lake catalog.
π’ Common structured data formats
TSV, Tar, Zip, XLSX, Avro, Parquet, Protobufs, Delta
Splittable file format?
Compression: Bz2. Gzip, Snappy, LZO.
Hadoop mergeUtils to help small files problem.
π’ Cloud providers
Top players for distributed computing: AWS, GCP, Azure
S3/GCS object stores
Managed Airflow = pipelining
EMR/Dataproc = processing
Container registries. EKS/GKE for container execution
IAM access controls. As-needed access. No shared service creds.
π’ Languages (1 - 10 goal proficiency)
SQL (10)
Python (8)
Java (7)
Scala (4)
Bash (4)
π’ Metrics/monitoring/alerting
Statsd, Prometheus.
HTTP vs. UDP
Gauge vs increment vs histogram.
Measurement vs tag.
Visualize measurement time series: Grafana, Datadog.
Observe what normal is. Alert on abnormal thresholds.
Alert messaging clarity.
Runbooks for on-call.
π’ Linux familiarity
tail -f
tmux
ps -ef, kill -9
vim
ssh, scp
cat, less, more, grep, find, echo, xargs, |, >, >>, <
sudo
.bash_profile, EXPORT, unset, $PATH
π’ Containers
Linux cgroups.
Dockerfile. Docker Desktop.
docker-compose
Basic understanding of Kubernetes. Kubectl
Anything I leave out that's important?
π’ SQL/data modeling
Install Postgres. Load dummy data with \copy
Understand ACID
Different consistency levels: Strict, Linearizable, Serializable, Eventual, Casual
Aggs with GROUP BY.
Joins: INNER, LEFT.
Filter records with WHERE. Filter groups with HAVING.
Aggs persisted to summary tables.
ORDER BY foo LIMIT n --> top n by foo.
Organize queries with CTEs and temp tables. Subselect sparingly
Window functions to look at trends within arbitrary partitions.
Unique constraints. Foreign key constraints.
Dimensions vs Fact tables. Star, snowflake schemas.
EXPLAIN <query>. Read the plan.
Add an index. How does the EXPLAIN change?
π’ CompSci concepts you must know
Uniformly distributed hash functions.
Tuples vs Arrays.
Iterator design pattern.
External Merge sort.
Bit vectors, HashMap. Sets.
Directed graphs.
Serialization
π’ Time representation
ISO 8601 strings
Prefer native date and timestamp objects.
UTC.
π’ Fundamentals of distributed data processing
Cap theorem.
Replication.
Horizontal vs Vertical scaling.
Sharding vs clustering.
Share-nothing principle.
Partitioning data at-rest.
Planning for fault tolerance and failure.
Exponential back off retries.
Latency vs bandwidth.
How to code to iterators without the input fully buffered in memory.
π’ Spark
Dataframe and Dataset batch API.
Shuffle challenges. When to repartition().
Runtime partitioning by key.
Tuning: executors num, memory, cores. Num shuffle partitions.
Dataframe ops split up into composable driver functions.
Construct fixture Dataframes for unit tests.
Create temporary views as staging tables.
Data quality checks over staging views. Lots of logging.
Fail driver if checks fail.
Block any downstream dependencies until the checks pass.
Hive metastore for data lake catalog.
π’ Common structured data formats
TSV, Tar, Zip, XLSX, Avro, Parquet, Protobufs, Delta
Splittable file format?
Compression: Bz2. Gzip, Snappy, LZO.
Hadoop mergeUtils to help small files problem.
π’ Cloud providers
Top players for distributed computing: AWS, GCP, Azure
S3/GCS object stores
Managed Airflow = pipelining
EMR/Dataproc = processing
Container registries. EKS/GKE for container execution
IAM access controls. As-needed access. No shared service creds.
π’ Languages (1 - 10 goal proficiency)
SQL (10)
Python (8)
Java (7)
Scala (4)
Bash (4)
π’ Metrics/monitoring/alerting
Statsd, Prometheus.
HTTP vs. UDP
Gauge vs increment vs histogram.
Measurement vs tag.
Visualize measurement time series: Grafana, Datadog.
Observe what normal is. Alert on abnormal thresholds.
Alert messaging clarity.
Runbooks for on-call.
π’ Linux familiarity
tail -f
tmux
ps -ef, kill -9
vim
ssh, scp
cat, less, more, grep, find, echo, xargs, |, >, >>, <
sudo
.bash_profile, EXPORT, unset, $PATH
π’ Containers
Linux cgroups.
Dockerfile. Docker Desktop.
docker-compose
Basic understanding of Kubernetes. Kubectl
Anything I leave out that's important?
β€5