Spark in me
2.27K subscribers
745 photos
47 videos
114 files
2.63K links
Lost like tears in rain. DS, ML, a bit of philosophy and math. No bs or ads.
Download Telegram
Spark in me
Streamlit vs. viola vs. panel vs. dash vs. bokeh server TLDR - make scientific web-apps via python only w/o any web-programming (i.e. django, tornado). Dash - Mostly for BI - Also a paid product - Looks like the new Tableau - Serving and out-of-the-box…
Using Viola With The Power of Vue.js

Remember the last post about python dashboard / demo solutions?

Since then we tried Viola and Streamlit.
Streamlit is very cool, but you cannot build really custom things with it. You should have 100% support of widgets that you need, and there are always issues with caching.
Also it is painful to change the default appearance.

Remember that viola "In theory has customizable grid and CSS"?
Of course someone took care of that!

Enter ipyvuetify + voila-vuetify.
TLDR - this allows you to have viola demos with Vue UI Library, all in 100% python.

Problems:

- No native table widget / method to show and UPDATE pandas tables. There are solutions that load Vue UI tables, but no updates out-of-the-box
- All plotting libraries will work mostly fine
- All jupiter widgets will work fine. But when you will need a custom widget - you will have to either code it, or find some hack with js-links or manual HTML manipulations
- Takes some time to load, will NOT scale to hundreds / thousands of concurrent users
- ipyvuetify is poorly documented, not very intuitive examples

Links:

- https://github.com/voila-dashboards/voila-vuetify
- https://github.com/mariobuikhuizen/ipyvuetify

#data_science
Some Proxy Related Tips and Hacks ... Quick Ez and in Style =)

DO is not cool anymore

First of all - let's get the Elephant out of the room. Remember I recommended Digital Ocean?
Looks like they behave like a f**g corporation now. They require you selfie with a passport now.
F**k this. Even AWS does not need this.


Why would you need proxies?

Scraping mostly. Circumventing anal restrictions.
Sometimes there are other legit use cases like proxying your tg api requests.


Which framework to use?

None.
One of our team tried scrapy, but there is too much hassle (imho) w/o any benefits.
(apart from their corporate platform crawlera, but I do not still understand why it exists, enterprise maybe)
Just use aiohttp, asyncio, bs4, requests, threading and multiprocessing.
And just write mostly good enough code.
If you do not need to scrape 100m pages per day or use selenium to scrape JS, this is more than enough.
Really. Do not buy-in into this cargo cult stuff.


Video

For video-content there are libraries:

- youtube-dl - lots of features, horrible python API, nice CLI, it really works
- pytube - was really cool and pythonic, but author abandoned it. Most likely he just wrote a ton of regexp that he decided not to support. Some methods still work though

Also remember that many HTTP libraries have HTTP / SOCK5 support.
If the libraries are old, this may be supported via env variables.


Where to get proxies?

The most interesting part.
There are "dedicated" services (e.g. smartproxy / luminati.io / proxymesh / bestproxy.ru / mobile proxy services).
And they probably are the only option when you scrape Amazon.
But if you need high bandwidth and many requests - such proxies usually have garbage speed.
(and morally - probably 50% of them are hacked routers)
Ofc there is scrapoxy.io/ - but this is just too much!


Enter Vultr

They have always been a DO look-alike serice.

I found a simple hacky way to get 10-20-40 proxies quickly.
You can just use Vultr + Ubuntu 18.04 docker image + write a plain startup script.
That is is. Literally.

With Docker already installed your script may looks something like this:

docker run -d --name socks5_1 -p 1080:1080 -e PROXY_USER=user -e PROXY_PASSWORD=password serjs/go-socks5-proxy && \
docker run -d --name socks5_2 -p 1081:1080 -e PROXY_USER=user -e PROXY_PASSWORD=password serjs/go-socks5-proxy

There are cheaper hosting alternatives - Vultr is quite expensive.
But his script feature + Docker images really save time.
But now they have the nice features - backups / snapshots / startup scripts / ez scaling / etc / etc w/o the corporate bs.
Also use my Give $100, Get $25 link!

Beware - new accounts may be limited to 5 servers per account.
You may want to create several accounts at first.

#data_science
#scraping
Finally Migrating to JupyterLab?

TLDR - there is no killer feature, most likely this is just future-proofing.

With these plugins (some of which even work with the latest version of JupyterLab) you can finally migrate:

- jupyterlab_filetree
- toc
- collapsible_headings

Extensions can be installed using jupyter labextension install. Depending on your conda installation, sometimes you can even install them in your JupyterLab UI.

Migration from Notebook

Just add this and replace how you run your notebook:

RUN conda install -c conda-forge jupyterlab && \
conda install nodejs
...

CMD jupyter lab --port=8888 --ip=0.0.0.0 --no-browser

Obvious Downsides

- Looks like it is slower than notebooks (most annoying factor)
- No clear UX improvement, text editors are worse than IDEs, notebooks are the same
- Terminal is much less useful than standard Linux terminal or Putty
- With larger / more structured notebooks it crashes
- Most likely JupyterHub will continue to work with notebooks


I understand that pushing code to tested modules / having more smaller notebooks is preferable, but now when I have given this a test, most likely I will migrate only when forced to.


#data_science
Pandas Official Guide

Pandas now has a human readable best practices guide!

https://pandas.pydata.org/pandas-docs/stable/user_guide/

#data_science
Building Hyper Professional Looking PDFs in One Shell Command

You know, there are 2 types of people - those who value form over substance and substance over form.

I really like writing my documents in markdown and using CVS to store them, but many people do not understand this.

Enter Pandoc

You can build very professional-looking, whitepaper almost quality PDF documents with a single shell command using pandoc.

Its original template kind of sucks (do not also get me started on Latex and its witnesses) and shows its age. But I found a perfect solution - Eisvogel pandoc template.

It takes some fiddling with pandoc params, but in the end it is worth the effort.

- https://github.com/Wandmalfarbe/pandoc-latex-template
- https://pandoc.org/MANUAL.html

With this, you command may look like this:

pandoc \
meeting.md -o \
meeting.pdf \
--from markdown \
--template eisvogel \
--latex-engine=xelatex \
--highlight-style pygments


And viola, you have a perfect investment bank looking document.

Enjoy!

#data_science
Notebooks + Spreadsheets

Notebooks and spreadsheets (Excel or Google Sheets) have always been two most useful and helpful instruments I have ever used. Whole companies were built based on pseudo-relational Excel databases (this is ofc does not scale well).

Now there is a new library in python that integrates some JS tables library seamlessly with ipywidgets and notebooks. It is news and predictably sucks a little bit (as most of interactive tables in JS).

It goes without saying that it opens up a lot of possibilities for ML annotation - you can essentially combine tables and ipywidgets easily.

As far as I see It does not have an option to embed some HTML code, but recently there just appeared and Audio widget in ipywidgets (buried in the release notes somewhere)

So you can just use this to load audio into ipysheet:
wavb = open('test.wav', "rb").read()
audio = Audio(value=wavb,
format='wav',
autoplay=False)

#data_science
Excel in Notebooks

This notebook tool looks awesome.
It enables you to essentially replicate most useful excel functionality within notebook pandas dataframes:

- https://github.com/quantopian/qgrid

Was looking for something like this for a long time!

Similar tools I saw before required some fiddling (just search spreadsheets / excel on the channel), this one just works with an existing pandas dataframe! You can also hack in HTML elements and there are ample callbacks for your custom functionality.

#data_science
Compressed Feather in Pandas

A nifty feature in pandas I totally missed - saving not only .csv data frames compressed, but also .feather ones. Reduces files size 4-5x for repetitive data.

- Pandas to feather doc
- Pyarrow to feather doc

#data_science
Serialization of Standard Python Data Type in PyArrow

Some time ago, when I tried to save a pandas dataframe to feather or parquet and I had something like:

[1, 2, 3]

in some cells, pyarrow broke and refused to work.

Now it works at least with python lists and dicts.
It is very cool. Combined with compression='zstd' it is also fast and provides some really needed compression for text data.

The downside that is for example instead of list of lists you get a nested numpy array after reading the file back, but this is a small price to pay, right?

#data_science