Spark configuration
There are many ways to set configuration properties in Spark. I keep getting confused as to which is the best place to put it.
Among all the ways you can set Spark properties, the priority order determines which values will be respected.
Based on the loading order:
▪️Any values or flags defined in the
▪️Then the values specified in the command line using the spark-submit or spark-shell
▪️Finally, the values set through SparkSession in the Spark application.
All these properties will be merged, with all duplicate properties discarded in the Spark application. Thus, for example, the values provided on the command line will override the settings in the configuration file, if they are not overwritten in the application itself.
#spark #big_data
There are many ways to set configuration properties in Spark. I keep getting confused as to which is the best place to put it.
Among all the ways you can set Spark properties, the priority order determines which values will be respected.
Based on the loading order:
▪️Any values or flags defined in the
spark-defaults.conf file will be read first▪️Then the values specified in the command line using the spark-submit or spark-shell
▪️Finally, the values set through SparkSession in the Spark application.
All these properties will be merged, with all duplicate properties discarded in the Spark application. Thus, for example, the values provided on the command line will override the settings in the configuration file, if they are not overwritten in the application itself.
#spark #big_data
PySpark documentation will follow numpydoc style. I do not see why — current Python docs for Spark always were ok. More readable than any of the java docs.
So this:
will be something like this:
will be something like this:
Probably it's gonna be more readable HTML and linking between pages. Will see.
#spark #python
So this:
"""Specifies some hint on the current :class:DataFrame.
:param name: A name of the hint.
:param parameters: Optional parameters.
:return: :class:DataFrame
will be something like this:
"""Specifies some hint on the current :class:DataFrame.
:param name: A name of the hint.
:param parameters: Optional parameters.
:return: :class:DataFrame
will be something like this:
"""Specifies some hint on the current :class:DataFrame.
Parameters
----------
name : str
A name of the hint.
parameters : dict, optional
Optional parameters
Returns
-------
DataFrame
Probably it's gonna be more readable HTML and linking between pages. Will see.
#spark #python
PySpark configuration provides the
Although, it is more an impression than a result of systematic tests.
#spark #big_data
spark.python.worker.reuse option which can be used to choose between forking Python process for each task and reusing the existing process. In it equals to true the process pool will be created and reuse on the executors. It should be useful to avoid expensive serialization, transfer data between JVM and Python and even garbage collection.Although, it is more an impression than a result of systematic tests.
#spark #big_data
partitionOverwriteMode
Sometimes it is necessary to overwrite some failed partitions that Spark failed to process and you need to run the job again to have all the data written correctly.
There are two options here:
1. process and overwrite all data
2. process and overwrite data for the relevant partition
The first option sounds very dumb - to do all the work all over again. But to do the second option you need to rewrite the job. Meh - more code means more problems.
Luckily spark has a parameter
This configuration works well in cases where it is possible to overwrite external table metadata with a simple
#spark #big_data
Sometimes it is necessary to overwrite some failed partitions that Spark failed to process and you need to run the job again to have all the data written correctly.
There are two options here:
1. process and overwrite all data
2. process and overwrite data for the relevant partition
The first option sounds very dumb - to do all the work all over again. But to do the second option you need to rewrite the job. Meh - more code means more problems.
Luckily spark has a parameter
spark.sql.sources.partitionOverwriteMode with option dynamic. This only overwrites data for partitions present in the current batch. This configuration works well in cases where it is possible to overwrite external table metadata with a simple
CREATE EXTERNAL TABLE when writing data to an external data store such as HDFS or S3.#spark #big_data
Performance Monitoring Tools for Spark aside from print statements
https://supergloo.com/spark-monitoring/spark-performance-monitoring-tools/
#spark #big_data
https://supergloo.com/spark-monitoring/spark-performance-monitoring-tools/
#spark #big_data
Supergloo Inc
Spark Performance Monitoring Tools - A List of Options
Which Spark performance monitoring tools are available to monitor the performance of your Spark cluster? Let's find out with these list of tools.
I've combined my experience on publishing book into the post
Machine Learning roadmap — an interesting overview of the world of ML. A lot of information and links that will help not get lost.
Link
#ml
Link
#ml
YouTube
2020 Machine Learning Roadmap (87% valid for 2024)
Getting into machine learning is quite the adventure. And as any adventurer knows, sometimes it can be helpful to have a compass to figure out if you're heading in the right direction.
Although the title of this video says machine learning roadmap, you should…
Although the title of this video says machine learning roadmap, you should…
Boston Dynamics has shown how the robot dog Spot's arm works.
Now that Spot has an arm in addition to its legs and cameras, it can do mobile copying. He finds and picks up objects (trash), cleans the living room, opens doors, operates switches and valves, tends the garden and generally has fun.
https://youtu.be/6Zbhvaac68Y
Now that Spot has an arm in addition to its legs and cameras, it can do mobile copying. He finds and picks up objects (trash), cleans the living room, opens doors, operates switches and valves, tends the garden and generally has fun.
https://youtu.be/6Zbhvaac68Y
YouTube
Spot's Got an Arm!
Now that Spot has an arm in addition to legs and cameras, it can do mobile manipulation. It finds and picks up objects (trash), tidies up the living room, opens doors, operates switches and valves, tends the garden, and generally has fun. Motion of the…
Niiiiiiice. Fresh vulnerability for Linux:
https://www.sudo.ws/alerts/unescape_overflow.html
Sudo before 1.9.5p2 has a Heap-based Buffer Overflow, allowing privilege escalation to root via "sudoedit -s" and a command-line argument that ends with a single backslash character.https://www.sudo.ws/alerts/unescape_overflow.html
Sudo
Buffer overflow in command line unescaping
A serious heap-based buffer overflow has been discovered in sudo that is exploitable by any local user. It has been given the name Baron Samedit by its discoverer. The bug can be leveraged to elevate privileges to root, even if the user is not listed in the…
Twitter is opening up its full tweet archive to academic researchers for free.
Company is now opening up access to independent researchers or journalists. You’ll have to be a student or part of an academic institution.
Twitter also says it will not be providing access to data from accounts that have been suspended or banned, which could complicate efforts to study hate speech, misinformation, and other types of conversations that violate Twitter rules.
theverge
Company is now opening up access to independent researchers or journalists. You’ll have to be a student or part of an academic institution.
Twitter also says it will not be providing access to data from accounts that have been suspended or banned, which could complicate efforts to study hate speech, misinformation, and other types of conversations that violate Twitter rules.
theverge
The Verge
Twitter is opening up its full tweet archive to academic researchers for free
A full searchable archive of public tweets will now be available for free.
Folks at AWS publish a really great resource for anyone, who is designing cloud architecture. Even if you are using already or thinking about Azure or GCP, it is a really good read and it is not your typical sleep-provoking dry white-paper.
AWS did an awesome job packing a lot of practical recommendations, best practices, tips, and suggestions into this document.
AWS Well-architected framework focuses on 5 pillars:
✓Operational Excellence
✓Security
✓Reliability
✓Performance Efficiency
✓Cost Optimization
#aws #big_data
AWS did an awesome job packing a lot of practical recommendations, best practices, tips, and suggestions into this document.
AWS Well-architected framework focuses on 5 pillars:
✓Operational Excellence
✓Security
✓Reliability
✓Performance Efficiency
✓Cost Optimization
#aws #big_data
For any error you can say the cause of it is between the monitor and the chair — and it's true, but it doesn't help fix the error in any way. To stand out today you need to bring both hard & soft skills to the table.
https://luminousmen.com/post/soft-skills-guide-for-software-engineer
#soft_skills
https://luminousmen.com/post/soft-skills-guide-for-software-engineer
#soft_skills
Often I rewrite my old articles to align them with my current understanding and very often find myself misunderstanding concepts or omitting important questions.
But sometimes when I rewrite, I realize that originally everything was correct - I'm the one thinking bullshit. It's a funny feeling.
It's exhausting to be a perfectionist. Don't be one. But read the article:
Data Lake vs Data Warehouse
But sometimes when I rewrite, I realize that originally everything was correct - I'm the one thinking bullshit. It's a funny feeling.
It's exhausting to be a perfectionist. Don't be one. But read the article:
Data Lake vs Data Warehouse
Blog | iamluminousmen
Data Lake vs Data Warehouse
Data Lake and the Data Warehouse. They seemed similar, but there are differences.
Big Data and Go. It's started
Hdfs client written in go. Good for scripting. Since it doesn't have to wait for the JVM to start up, it's also a lot faster than
https://github.com/colinmarc/hdfs
#big_data
Hdfs client written in go. Good for scripting. Since it doesn't have to wait for the JVM to start up, it's also a lot faster than
hadoop -fs.https://github.com/colinmarc/hdfs
#big_data
GitHub
GitHub - colinmarc/hdfs: A native go client for HDFS
A native go client for HDFS. Contribute to colinmarc/hdfs development by creating an account on GitHub.
Forwarded from Инжиниринг Данных (Dmitry Anoshin)
Kaggle State of Machine Learning and Data Science 2020.pdf
14 MB
Kaggle State of Machine Learning and Data Science 2020
What does the
When you look at the C implementation, the rule seems to be:
1. If
2. If
3. If
4. Whatever
5. Calling
7. If none of the above applies, then
An in-depth article on the `not' operator in Python from the core developer
#python
not operator do? It simply yields True if its argument is false, False otherwise. It turns out it's pretty hard to determine what true is.When you look at the C implementation, the rule seems to be:
1. If
True, then True;2. If
False, then False;3. If
None, then False;4. Whatever
__bool__ returns as long as it's a subclass of bool;5. Calling
len() on the object - True if greater than 0, otherwise False;7. If none of the above applies, then
True.An in-depth article on the `not' operator in Python from the core developer
#python
Tall, Snarky Canadian
Unravelling `not` in Python
For this next blog post in my series of Python's syntactic sugar, I'm tackling what would seem to be a very simple bit of syntax, but which actually requires diving into multiple layers to fully implement: not. On the surface, the definition of not is very…
Data engineering in 2020-2021
Another view on the Data Management landscape. There 9 mentions of SQL and 5 mentions of BI in the article. SQL is required knowledge for data engineer by it's not in any way the only requirement nowadays.
The author sees the future of Data Management as a way towards SQL-engines and outsource the complexity to the platforms. Unfortunately that's probably true.
Although:
▪️In practice, engineers spend most of the time on letter "T" in ETL(and not only using SQL). For example, the most popular framework for data processing Spark is much more than just RDDs today
▪️Those emerging platforms cost a pile of money now. For example AWS was born because of Oracle platform huge maintanance cost.
▪️I’m very sceptical of tools that clams “everyone can build a data product in several easy steps”.
Article
Another view on the Data Management landscape. There 9 mentions of SQL and 5 mentions of BI in the article. SQL is required knowledge for data engineer by it's not in any way the only requirement nowadays.
The author sees the future of Data Management as a way towards SQL-engines and outsource the complexity to the platforms. Unfortunately that's probably true.
Although:
▪️In practice, engineers spend most of the time on letter "T" in ETL(and not only using SQL). For example, the most popular framework for data processing Spark is much more than just RDDs today
▪️Those emerging platforms cost a pile of money now. For example AWS was born because of Oracle platform huge maintanance cost.
▪️I’m very sceptical of tools that clams “everyone can build a data product in several easy steps”.
Article
Medium
Data engineering in 2020
It is incredible how fast data processing tools are evolving. And with it, the nature of the data engineering discipline is changing as…
Working with data in big data always involves some kind of complexities related to its size, storage, and processing. What skills are needed to deal with them?
https://luminousmen.com/post/data-challenges-in-big-data
https://luminousmen.com/post/data-challenges-in-big-data
Blog | iamluminousmen
Data Challenges in Big Data
Working with data in big data always involves some kind of complexities related to its size, storage, and processing. What skills are needed to deal with them?
PEP: 585
Started trying out the new release Python 3.9. I don't follow the features that much, but there are things that piss me off, like the implementation of static typing in Python.
Static typing has been built on top of the existing Python runtime incrementally over the time. As a consequence, collection hierarchies got duplicated, as an application could use the types from typing module at the same time as the built-in ones.
This created a bit of confusion, as we had two parallel type systems, not really competing with each other, but we always had to keep an eye out for that parallelism.
Well, now this is over.
Examples of types that previously had to be imported to use would be List, Dict, Set, Tuple, Optional. Right now, you can just import them as a general list or dict, set, tuple, optional, etc.
These types can also be parameterized. A parameterized type is an example of a generic universal type with expected types for container elements of type
PEP 585
#python
Started trying out the new release Python 3.9. I don't follow the features that much, but there are things that piss me off, like the implementation of static typing in Python.
Static typing has been built on top of the existing Python runtime incrementally over the time. As a consequence, collection hierarchies got duplicated, as an application could use the types from typing module at the same time as the built-in ones.
This created a bit of confusion, as we had two parallel type systems, not really competing with each other, but we always had to keep an eye out for that parallelism.
Well, now this is over.
Examples of types that previously had to be imported to use would be List, Dict, Set, Tuple, Optional. Right now, you can just import them as a general list or dict, set, tuple, optional, etc.
>>> issubclass(list, T.List)
TrueThese types can also be parameterized. A parameterized type is an example of a generic universal type with expected types for container elements of type
list[str].PEP 585
#python
Python Enhancement Proposals (PEPs)
PEP 585 – Type Hinting Generics In Standard Collections | peps.python.org
Static typing as defined by PEPs 484, 526, 544, 560, and 563 was built incrementally on top of the existing Python runtime and constrained by existing syntax and runtime behavior. This led to the existence of a duplicated collection hierarchy in the ty...