Am Neumarkt 😱

#DS

Hullman J, Gelman A. Designing for interactive exploratory data analysis requires theories of graphical inference. Harvard Data Science Review. 2021. doi:10.1162/99608f92.3ab8a587
https://hdsr.mitpress.mit.edu/pub/w075glo6/release/2

Creating visualizations seems to be a creative task. At least for entry-level visualization tasks, we follow our hearts and build whatever is needed. However, visualizations are made for different purposes. Some visualizations are simply explorations and for us to get some feelings on the data. Some others are built for the validation of hypotheses. These are very different things.

Confirmation of an idea using charts is usually hard. In most cases, we need statistical tests to (dis)prove a hypothesis instead of just looking at the charts. Thus, visualizations become a tool to help us formulate a good question.

However, not everyone is using charts as hints only. Instead, many use charts to conclude. As a result, even experienced analysts draw spurious conclusions. These so-called insights are not going to be too solid.

The visual analysis seems to be an adversarial game between humans and the visualizations. There are many different models for this process. A crude and probably stupid model can be illustrated through an example of analysis by the histogram of a variable.
The histogram looks like a bell. It is symmetric. It is centered at 10 with an FWHM of 2.6. I guess this is a Gaussian distribution with a mean 10 and sigma 1. This is the posterior p(model | chart).
Imagine a curve like what was just guessed on top of the original curve. Would my guess and the actual curve overlap with each other?
If not, what do we have to adjust? Do we need to introduce another parameter?
Guess the parameter of the new distribution model and compare it with the actual curve again.
The above process is very similar to a repetitive Bayesian inference. Though, the actual analysis may be much more complicated as the analysts would carrier a lot of prior knowledge about the generating process of the data.

Through this example, we see that integrating explorations with preliminary model building as Confirmatory Data Analysis may bring in more confidence in drawing insights from charts.

On the other hand, including complicated statistical models leads to misinterpretations since not everyone is familiar with statistical hypothesis testing. So the complexity has to be balanced.

Harvard Data Science Review

Designing for Interactive Exploratory Data Analysis Requires Theories of Graphical Inference · Issue 3.3, Summer 2021

242 viewsMarkt Mai, edited 10:05

Am Neumarkt 😱

#DS

Jetbrains released a new IDE for data scientist.

https://www.jetbrains.com/dataspell/

JetBrains

JetBrains DataSpell: The IDE for Data Scientists.

JetBrains DataSpell is an IDE for data science with intelligent Jupyter notebooks, interactive Python scripts, and lots of other built-in tools.

1.3K viewsMarkt Mai, 21:04

Am Neumarkt 😱

#DS

Cute comics on interactive data visualization

https://hdsr.mitpress.mit.edu/pub/49opxv6v/release/1

Harvard Data Science Review

Why Do We Plot Data? · Harvard Data Science Review

Accompanying text for the “Designing for interactive exploratory data analysis requires theory of graphical inference” Explainer Zine

200 viewsMarkt Mai, edited 07:40

Am Neumarkt 😱

#DS #ML

Microsoft created two depositories for Machine Learning and Data Science beginners. They created many sketches. I love this style.

https://github.com/microsoft/Data-Science-For-Beginners

https://github.com/microsoft/ML-For-Beginners

230 viewsMarkt Mai, 06:56

Am Neumarkt 😱

#DS #fun

Looks familiar.

https://xkcd.com/2533/

xkcd

Slope Hypothesis Testing

222 viewsMarkt Mai, 07:26

Am Neumarkt 😱

#DS #news

This is a post about Zillow's Zetimate Model.

Zillow (https://zillow.com/ ) is an online real-estate marketplace and it is a big player. But last week, Zillow withdrew from the house flipping market and planned to layoff a handful of employees.

There are rumors indicating that this action is related to their machine learning based price estimation tool, Zestimate ( https://www.zillow.com/z/zestimate/ ).

At a first glance, Zestimate seems fine. Though the metrics shown on the website may not be that convincing, I am sure they've benchmarked more metrics than those shown on the website.
There are some discussions on reddit.

Anyways, this is not the best story for data scientists.

1. News: https://www.reddit.com/r/MachineLearning/comments/qlilnf/n_zillows_nnbased_zestimate_leads_to_massive/
2. This is Zestimate: https://www.zillow.com/z/zestimate/
3. https://www.wired.com/story/zillow-ibuyer-real-estate/

Zillow

Zillow: Real Estate, Apartments, Mortgages & Home Values

The leading real estate marketplace. Search millions of for-sale and rental listings, compare Zestimate® home values and connect with local professionals.

194 viewsMarkt Mai, edited 16:06

Am Neumarkt 😱

#DS #Visualization

Okay, I'll tell you the reason I wrote this post. It is because xkcd made [this](https://xkcd.com/2537/).

---

Choosing proper colormaps for our visualizations is important. It's almost like shooting a photo using your phone. Some phones capture details in every corner, while some phones give us overexposed photos and we get no details in the bright regions.

A proper colormap should make sure we see the details we need to see. To address the importance of colormaps, we use the two examples shown on the website of colorcet[^colorcet]. The two colormaps, hot, and fire, can be found in matplotlib and colorcet, respectively.

I can not post multiple images in one message, please see the full post for the comparisons of the two colormaps. Really, it is amazing. Find the link below:
https://github.com/kausalflow/community/discussions/20

It is clear that "hot" brings in some overexposure. The other colormap, "fire", is a so-called perceptually uniform colormap. More experiments are performed in colorcet. Glasbey et al showed some examples of inspecting different properties using different colormaps[^Glasbey2007].

One of the methods to make sure the colormap shows enough details is to use perceptually uniform colrmaps[^Kovesi2015]. Kovesi provides a method to validate if a color map has uniform perceptual contrast[^Kovesi2015].

---
References and links mentioned in this post:

[^colorcet]: Anaconda. colorcet 1.0.0 documentation. [cited 12 Nov 2021]. Available: https://colorcet.holoviz.org/
[^colorcet-github]: holoviz. colorcet/index.ipynb at master · holoviz/colorcet. In: GitHub [Internet]. [cited 12 Nov 2021]. Available: https://github.com/holoviz/colorcet/blob/master/examples/index.ipynb
[^Kovesi2015]: Kovesi P. Good Colour Maps: How to Design Them. arXiv [cs.GR]. 2015. Available: http://arxiv.org/abs/1509.03700
[^Glasbey2007]: Glasbey C, van der Heijden G, Toh VFK, Gray A. Colour displays for categorical images. Color Research & Application. 2007. pp. 304–309. doi:10.1002/col.20327
[^matplotlib-colormaps]: Choosing Colormaps in Matplotlib — Matplotlib 3.4.3 documentation. [cited 12 Nov 2021]. Available: https://matplotlib.org/stable/tutorials/colors/colormaps.html

xkcd

Painbow Award

235 viewsMarkt Mai, edited 14:14

Am Neumarkt 😱

#DS

Just in case you are also struggling with Python packages on Apple M1 Macs

I am using the third option: anaconda + miniforge.

https://www.anaconda.com/blog/apple-silicon-transition

Anaconda

Anaconda | A Python Data Scientist’s Guide to the Apple Silicon…

Even if you are not a Mac user, you have likely heard Apple is switching from Intel CPUs to their own custom CPUs, which they refer to collectively as "Apple Silicon." The last time Apple changed its computer architecture this dramatically was 15 years ago…

269 viewsMarkt Mai, edited 10:36

Am Neumarkt 😱

#DS #visualization

https://percival.ink/

A new lightweight language for data analysis and visualization. It looks promising.

I hate jupyter notebooks and I don't use them on most of my projects. One of the reasons is low reproducibility due to its non-reative nature. You changed some old cells and forgot to run a cell below, you may read wrong results.
This new language is reactive. If old cells are changed, related results are also updated.

percival.ink

Percival • Web-based, reactive Datalog notebooks

Percival is a declarative data query and visualization language for exploring complex datasets, producing interactive graphics, and sharing results.

293 viewsMarkt Mai, edited 07:40

Am Neumarkt 😱

#data #ds

Disclaimer: I'm no expert in state diagram nor statecharts.

It might be something trivial but I find this useful: Combined with some techniques in statecharts (something frontend people like a lot), state diagram is a great way to document what our data is going through in data (pre)processing.

For complicated data transformations, we can make the corresponding state diagram and follow your code to make sure it is working as expected. The only thing is that we are focusing on the state of data not any other system.

We can use some techniques from statecharts, such as hierarchies and parallels.

State diagram is better than flowchart in this scenario because we are more interested in the different states of the data. State diagrams automatically highlights the states and we can easily spot the relevant part in the diagram and we don’t have to start from the beginning.

I documented some data transformations using state diagrams already. I haven't tired but it might also help us document our ML models.

References:
1. https://statecharts.dev
2. https://en.wikipedia.org/wiki/State_diagram

statecharts.dev

Welcome to the world of Statecharts

The world of statecharts describes what statecharts are, their benefits and drawbacks, how they differ from state machines, and practical examples on how to use them.

332 viewsMarkt Mai, edited 21:26

Am Neumarkt 😱

#ds

https://2022.pycon.de/blog/pyconde-pydata-berlin-tickets/

2022.pycon.de

PyConDE & PyData Berlin 2022 Tickets

Tickets for PyConDE & PyData Berlin 2022

278 viewsMarkt Mai, edited 23:49

Am Neumarkt 😱

#ds

Deepnote supports Great Expectations (GE) now.

I ran their template notebook:

https://deepnote.com/project/Reduce-Pipeline-Debt-With-Great-Expectations-mLT9DFCQSpW4kUBAzzdhBw/%2Fnotebook.ipynb/#00000-e170fae0-7e06-4a7a-85f3-343584ec4b94

277 viewsMarkt Mai, 07:39

About

Blog

Apps

Platform