Trying to migrate to JupyterLab from Jupyter Notebook?
Some time ago I noticed that the Jupyter extensions project was more or less frozen => JupyterLab obviously is trying to shift community attention to npm / nodejs plugins.
Again (like 6-12 months ago) I tried to do this.
This time Lab is more mature:
- Now at version >1;
- Now they have built-in package manager;
- They have some of the most necessary extensions (i.e. git, toc, google drive, etc);
- UI got polished a bit, but window in a window still produces a bit of mental friction. Only the most popular file formats are supported. Text editor inherited the best features, but it is still a bit rudimentary;
- Full screen width by default;
- Some useful things (like codefolding) are now turned on in settings json file;
- Using these extensions is a bit of a chore in edge cases (i.e. some user permission problems / you have to re-build an app each time you add an extensions);
But I could not switch mostly for one reason - this one
- https://github.com/jupyterlab/jupyterlab/issues/2275#issuecomment-498323475
If you have a Jupyter environment it is very easy to switch. For me, before it was:
And it just became:
#data_science
Some time ago I noticed that the Jupyter extensions project was more or less frozen => JupyterLab obviously is trying to shift community attention to npm / nodejs plugins.
Again (like 6-12 months ago) I tried to do this.
This time Lab is more mature:
- Now at version >1;
- Now they have built-in package manager;
- They have some of the most necessary extensions (i.e. git, toc, google drive, etc);
- UI got polished a bit, but window in a window still produces a bit of mental friction. Only the most popular file formats are supported. Text editor inherited the best features, but it is still a bit rudimentary;
- Full screen width by default;
- Some useful things (like codefolding) are now turned on in settings json file;
- Using these extensions is a bit of a chore in edge cases (i.e. some user permission problems / you have to re-build an app each time you add an extensions);
But I could not switch mostly for one reason - this one
- https://github.com/jupyterlab/jupyterlab/issues/2275#issuecomment-498323475
If you have a Jupyter environment it is very easy to switch. For me, before it was:
# 5.6 because otherwise I have a bug with installing extensions
RUN conda install notebook=5.6
RUN pip install git+https://github.com/ipython-contrib/jupyter_contrib_nbextensions && \
jupyter contrib nbextension install --user
CMD jupyter notebook --port=8888 --ip=0.0.0.0 --no-browser
And it just became:
RUN conda install -c conda-forge jupyterlab
CMD jupyter lab --port=8888 --ip=0.0.0.0 --no-browser
#data_science
GitHub
Support collapsible hierarchy of sections · Issue #2275 · jupyterlab/jupyterlab
Allow users to toggle open/close sections by clicking on some kind of UI element. This helps with navigating and organizing large notebooks.
Full IDE in a browser?
Almost)
You all know all the pros and cons of:
- IDEs (PyCharm);
- Advanced text editors (Atom, Sublime Text);
- Interactive environments (notebook / lab, Atom + Hydrogen);
I personally dislike local IDEs - not because connecting to a remote / remote kernel / remote interpreter is a bit of a chore. Setting up is easy, but always thinking about what is synced and what is not - is just pain. Also when your daily driver machine is on Windows, using Linux subsystem all the time with Windows paths is just pain. (Also I diskile bulky interfaces, but this is just a habit and it depends).
But what if I told you there is a third option? =)
If you work as a team on a remote machine / set of machines?
TLDR - you can run a modern web "IDE" (it is something between Atom and real IDE - less bulky, but less functions) in a browser.
Now you can just run it with one command.
Pros:
- It is open source (though shipped as a part of some enterprise packages like Eclipse Che);
- Pre-built images available;
- It is extendible - new modules get released - you can build yourself or just find a build;
- It has extensive linting, python language server (just a standard library though);
- It has full text search ... kind of;
- Follow definition in your code;
- Docstrings and auto-complete work for your modules and standard library (not for you packages);
Looks cool af!
If they ship a build with a remote python kernel, then it will be a perfect option for teams!
I hope it will not follow a path taken by another crowd favourite similar web editor (it was purhcased by Amazon).
Links
- Website;
- Pre-built apps for python;
- Language server they are using;
#data_science
Almost)
You all know all the pros and cons of:
- IDEs (PyCharm);
- Advanced text editors (Atom, Sublime Text);
- Interactive environments (notebook / lab, Atom + Hydrogen);
I personally dislike local IDEs - not because connecting to a remote / remote kernel / remote interpreter is a bit of a chore. Setting up is easy, but always thinking about what is synced and what is not - is just pain. Also when your daily driver machine is on Windows, using Linux subsystem all the time with Windows paths is just pain. (Also I diskile bulky interfaces, but this is just a habit and it depends).
But what if I told you there is a third option? =)
If you work as a team on a remote machine / set of machines?
TLDR - you can run a modern web "IDE" (it is something between Atom and real IDE - less bulky, but less functions) in a browser.
Now you can just run it with one command.
Pros:
- It is open source (though shipped as a part of some enterprise packages like Eclipse Che);
- Pre-built images available;
- It is extendible - new modules get released - you can build yourself or just find a build;
- It has extensive linting, python language server (just a standard library though);
- It has full text search ... kind of;
- Follow definition in your code;
- Docstrings and auto-complete work for your modules and standard library (not for you packages);
Looks cool af!
If they ship a build with a remote python kernel, then it will be a perfect option for teams!
I hope it will not follow a path taken by another crowd favourite similar web editor (it was purhcased by Amazon).
Links
- Website;
- Pre-built apps for python;
- Language server they are using;
#data_science
www.theia-ide.org
Theia - Cloud and Desktop IDE
Theia is an open-source cloud desktop IDE framework implemented in TypeScript.
An ideal remote IDE?
Joking?
No, looks like VScode recently got its remote development extensions (it was only in insiders build a couple of months ago) working just right.
I tried remote-ssh extension and it looks quite polished. No syncing your large data folders and loading all python dependencies locally for hours.
The problem? It took me an hour just to open ssh session under Windows properly (permissions and Linux folder path substitution is hell on Windows). When I opened it - it worked like a charm.
So for now (it is personal) - best tools are in my opinion:
- Notebooks - for exploration and testing;
- VScode for codebase;
- Atom - for local scripts;
#data_science
Joking?
No, looks like VScode recently got its remote development extensions (it was only in insiders build a couple of months ago) working just right.
I tried remote-ssh extension and it looks quite polished. No syncing your large data folders and loading all python dependencies locally for hours.
The problem? It took me an hour just to open ssh session under Windows properly (permissions and Linux folder path substitution is hell on Windows). When I opened it - it worked like a charm.
So for now (it is personal) - best tools are in my opinion:
- Notebooks - for exploration and testing;
- VScode for codebase;
- Atom - for local scripts;
#data_science
Visualstudio
Visual Studio Code Remote Development
Managing your DS / ML environment neatly and in style
If you have a sophisticated environment that you need to do DS / ML / DL, then using a set of Docker images may be a good idea.
You can also tap into a vast community of amazing and well-maintained Dockerhub repositories (e.g. nvidia, pytorch).
But what you have to do this for several people? And use it with a proper IDE via ssh?
A well-known features of Docker include copy on write and user "forwarding". If you approach naively, each user will store his own images, which take quite some space.
And also you have to make your ssh daemon works inside of a container as a second service.
So I solved these "challenges" and created 2 public layers so far:
- Basic DS / ML layer -
- DS / ML libraries -
Your final dockerfile may look something like this just pulling from any of those layers.
Note that when building this, you will need to pass your
When launched, this launched a notebook with extensions. You can just
#deep_learning
#data_science
If you have a sophisticated environment that you need to do DS / ML / DL, then using a set of Docker images may be a good idea.
You can also tap into a vast community of amazing and well-maintained Dockerhub repositories (e.g. nvidia, pytorch).
But what you have to do this for several people? And use it with a proper IDE via ssh?
A well-known features of Docker include copy on write and user "forwarding". If you approach naively, each user will store his own images, which take quite some space.
And also you have to make your ssh daemon works inside of a container as a second service.
So I solved these "challenges" and created 2 public layers so far:
- Basic DS / ML layer -
FROM aveysov/ml_images:layer-0
- from dockerfile;- DS / ML libraries -
FROM aveysov/ml_images:layer-0
- from dockerfile;Your final dockerfile may look something like this just pulling from any of those layers.
Note that when building this, you will need to pass your
UID
as a variable, e.g.:docker build --build-arg NB_UID=1000 -t av_final_layer -f Layer_final.dockerfile .
When launched, this launched a notebook with extensions. You can just
exec
into the machine itself to run scripts or use an ssh
daemon inside (do not forget to add your ssh key and service ssh start
).#deep_learning
#data_science
GitHub
gpu-box-setup/Layer0_gpu_base_apex_python.dockerfile at master · snakers4/gpu-box-setup
Contribute to snakers4/gpu-box-setup development by creating an account on GitHub.
Extreme NLP network miniaturization
Tried some plain RNNs on a custom in the wild NER task.
The dataset is huge - literally infinite, but manually generated to mimick in-the-wild data.
I use EmbeddingBag + 1m n-grams (an optimal cut-off). Yeah, on NER / classification it is a handy trick that makes your pipeline totally misprint / error / OOV agnostic. Also FAIR themselves just guessed this too. Very cool! Just add PyTorch and you are golden.
What is interesting:
- Model works with embedding sizes 300, 100, 50 and even 5! 5 is dangerously close to OHE, but doing OHE on 1m n-grams kind-of does not make sense;
- Model works with various hidden sizes
- Naturally all of the models run on CPU very fast, but the smallest model also is very light in terms of its weights;
- The only difference is - convergence time. It kind of scales as a log of model size, i.e. model with 5 takes 5-7x more time to converge compared to model with 50. I wonder what if I use embedding size of 1?;
As added bonus - you can just store such miniature model in git w/o lfs.
What is with training transformers on US$250k worth of compute credits you say?)
#nlp
#data_science
#deep_learning
Tried some plain RNNs on a custom in the wild NER task.
The dataset is huge - literally infinite, but manually generated to mimick in-the-wild data.
I use EmbeddingBag + 1m n-grams (an optimal cut-off). Yeah, on NER / classification it is a handy trick that makes your pipeline totally misprint / error / OOV agnostic. Also FAIR themselves just guessed this too. Very cool! Just add PyTorch and you are golden.
What is interesting:
- Model works with embedding sizes 300, 100, 50 and even 5! 5 is dangerously close to OHE, but doing OHE on 1m n-grams kind-of does not make sense;
- Model works with various hidden sizes
- Naturally all of the models run on CPU very fast, but the smallest model also is very light in terms of its weights;
- The only difference is - convergence time. It kind of scales as a log of model size, i.e. model with 5 takes 5-7x more time to converge compared to model with 50. I wonder what if I use embedding size of 1?;
As added bonus - you can just store such miniature model in git w/o lfs.
What is with training transformers on US$250k worth of compute credits you say?)
#nlp
#data_science
#deep_learning
Facebook
A new model for word embeddings that are resilient to misspellings
Misspelling Oblivious Embeddings (MOE) is a new model for word embeddings that are resilient to misspellings, improving the ability to apply word embeddings to real-world situations, where misspellings are common.
My foray into the STT Dark Forest
My tongue-in-cheek article on ML in general, and how to make your STT model train 3-4x faster with 4-5x less weights with the same quality
https://spark-in.me/post/stt-dark-forest
#data_science
#deep_learning
#stt
My tongue-in-cheek article on ML in general, and how to make your STT model train 3-4x faster with 4-5x less weights with the same quality
https://spark-in.me/post/stt-dark-forest
#data_science
#deep_learning
#stt
Poor man's ensembling techniques
So you want to improve your model's performance a bit.
Ensembling helps. But as is ... it's useful only on Kaggle competitions, where people stack over9000 networks trained on 100MB of data.
But for real life usage / production, there exist ensembling techniques, that do not require significant computation cost increase (!).
All of this is not mainstream yet, but it may work on you dataset!
Especially if your task is easy and the dataset is small.
- SWA (proven to work, usually used as a last stage when training a model);
- Lookahead optimizer (kind of new, not thoroughly tested);
- Multi-Sample Dropout (seems like a cheap ensemble, should work for classification);
Applicability will vary with your task.
Plain vanilla classification can use all of these, s2s networks probably only partially.
#data_science
#deep_learning
So you want to improve your model's performance a bit.
Ensembling helps. But as is ... it's useful only on Kaggle competitions, where people stack over9000 networks trained on 100MB of data.
But for real life usage / production, there exist ensembling techniques, that do not require significant computation cost increase (!).
All of this is not mainstream yet, but it may work on you dataset!
Especially if your task is easy and the dataset is small.
- SWA (proven to work, usually used as a last stage when training a model);
- Lookahead optimizer (kind of new, not thoroughly tested);
- Multi-Sample Dropout (seems like a cheap ensemble, should work for classification);
Applicability will vary with your task.
Plain vanilla classification can use all of these, s2s networks probably only partially.
#data_science
#deep_learning
PyTorch
Stochastic Weight Averaging in PyTorch
In this blogpost we describe the recently proposed Stochastic Weight Averaging (SWA) technique [1, 2], and its new implementation in torchcontrib. SWA is a simple procedure that improves generalization in deep learning over Stochastic Gradient Descent (SGD)…
Playing with name NER
Premise
So, I needed to separate street names that are actual name + surname. Do not ask me why.
Yeah I know that maybe 70% of streets are human names more or less.
So you need 99% precision and at least 30-40% recall.
Or you can imagine a creepy soviet name like
So, today making a NER parser is easy, take out our favourite framework (plan PyTorch ofc) of choice.
Even use FastText or something even less true. Add data and boom you have it.
The pain
But not so fast. Turns our there is a reason why cutting out proper names is a pain.
For Russian there is the natasha library, but since it works on YARGY, it has some assumptions about data structure.
I.e. names should be capitalized, come in pairs (name - surname), etc etc - I did not look their rules under the hood, but I would write it like this.
So probably this would be a name -
Ofc no, it just assumes some stuff that may not hold for your dataset.
And yeah it works for streets just fine.
Also recognizing a proper name without context does not really work. And good luck finding (or generating) corpora for that.
Why deep learning may not work
So I downloaded some free databases with names (VK.com respects your secutity lol - the 100M leaked database is available, but useless, too much noise) and surnames.
Got 700k surnames of different origin, around 100-200k male and female names. Used just random words from CC + wiki + taiga for hard negative mining.
Got 92% accuracy on 4 classes (just word, female name, male name, surname) with some naive models.
... and it works .... kind of. If you give it 10M unique word forms, it can distinguish name-like stuff in 90% of cases.
But for addresses it is useless more or less and heuristics from natasha work much better.
The moral
- A tool that works on one case may be 90% useless on another;
- Heuristics have very high precision, low recall and are fragile;
- Neural networks are superior, but you should match your artifically created dataset to the real data (it may take a month to pull off properly);
- In any case, properly cracking both approaches may take time, but both heuristics and NNs are very fast to create, but sometimes 3 plain rules give you 100% precision with 10% recall and sometimes generating a fake dataset that matches your domain is a no-brainer. It depends.
#data_science
#nlp
#deep_learning
Premise
So, I needed to separate street names that are actual name + surname. Do not ask me why.
Yeah I know that maybe 70% of streets are human names more or less.
So you need 99% precision and at least 30-40% recall.
Or you can imagine a creepy soviet name like
Трактор
.So, today making a NER parser is easy, take out our favourite framework (plan PyTorch ofc) of choice.
Even use FastText or something even less true. Add data and boom you have it.
The pain
But not so fast. Turns our there is a reason why cutting out proper names is a pain.
For Russian there is the natasha library, but since it works on YARGY, it has some assumptions about data structure.
I.e. names should be capitalized, come in pairs (name - surname), etc etc - I did not look their rules under the hood, but I would write it like this.
So probably this would be a name -
Иван Иванов
But this probably would not ванечка иванофф
Is it bad? Ofc no, it just assumes some stuff that may not hold for your dataset.
And yeah it works for streets just fine.
Also recognizing a proper name without context does not really work. And good luck finding (or generating) corpora for that.
Why deep learning may not work
So I downloaded some free databases with names (VK.com respects your secutity lol - the 100M leaked database is available, but useless, too much noise) and surnames.
Got 700k surnames of different origin, around 100-200k male and female names. Used just random words from CC + wiki + taiga for hard negative mining.
Got 92% accuracy on 4 classes (just word, female name, male name, surname) with some naive models.
... and it works .... kind of. If you give it 10M unique word forms, it can distinguish name-like stuff in 90% of cases.
But for addresses it is useless more or less and heuristics from natasha work much better.
The moral
- A tool that works on one case may be 90% useless on another;
- Heuristics have very high precision, low recall and are fragile;
- Neural networks are superior, but you should match your artifically created dataset to the real data (it may take a month to pull off properly);
- In any case, properly cracking both approaches may take time, but both heuristics and NNs are very fast to create, but sometimes 3 plain rules give you 100% precision with 10% recall and sometimes generating a fake dataset that matches your domain is a no-brainer. It depends.
#data_science
#nlp
#deep_learning
GitHub
GitHub - natasha/yargy: Rule-based facts extraction for Russian language
Rule-based facts extraction for Russian language. Contribute to natasha/yargy development by creating an account on GitHub.
Easiest solutions to manage configs for ML models
When you have a lot of experiments, you need to minimize your code bulk and manage model configs concisely.
(This also kind of can be done via CLI parameters, but usually these things complement each other)
I know 3 ways:
(0) dicts + kwargs + dotdicts
(1) [attr](https://github.com/python-attrs/attrs)
(2) new python 3.7 [DataClass](https://docs.python.org/3/library/dataclasses.html) (which is very similar to attr)
Which one do you use?
#data_science
When you have a lot of experiments, you need to minimize your code bulk and manage model configs concisely.
(This also kind of can be done via CLI parameters, but usually these things complement each other)
I know 3 ways:
(0) dicts + kwargs + dotdicts
(1) [attr](https://github.com/python-attrs/attrs)
(2) new python 3.7 [DataClass](https://docs.python.org/3/library/dataclasses.html) (which is very similar to attr)
Which one do you use?
#data_science
GitHub
GitHub - python-attrs/attrs: Python Classes Without Boilerplate
Python Classes Without Boilerplate. Contribute to python-attrs/attrs development by creating an account on GitHub.
Streamlit vs. viola vs. panel vs. dash vs. bokeh server
TLDR - make scientific web-apps via python only w/o any web-programming (i.e. django, tornado).
Dash
- Mostly for BI
- Also a paid product
- Looks like the new Tableau
- Serving and out-of-the-box scaling options
Bokeh server
- Mostly plotting (very flexible, unlimited capabilities)
- High entry cost, though bokeh is kind of easy to use
- Also should scale well
Panel
- A bokeh server wrapper with a lot of capabilities for geo + templates
Streamlit
- The nicest looking app for interactive ML apps (maybe even annotation)
- Has pre-built styles and grid
- Limited only to its pre-built widgets
- Built on tornado with a very specific data model incompatible with the majority of available widgets
- Supposed to scale well - built on top of tornado
Viola
- If it runs in a notebook - it will run in viola
- Just turns a notebook into a server
- The app with the most promise for DS / ML
- Scales kind of meh - you need to run a jupyter kernel for each user - also takes some time to spin up a kernel
- Fully benefits from a rich ecosystem of jupyter / python / widgets
- In theory has customizable grid and CSS, but does not come pre-built with this => higher barrier to entry
Also most of these apps have no authentication buil-in.
More details:
- A nice summary here;
- A very detailed pros and cons summary of Streamlit + Viola. Also a very in-depth detailed discussion;
- Also awesome streamlit boilerplate is awesome;
#data_science
TLDR - make scientific web-apps via python only w/o any web-programming (i.e. django, tornado).
Dash
- Mostly for BI
- Also a paid product
- Looks like the new Tableau
- Serving and out-of-the-box scaling options
Bokeh server
- Mostly plotting (very flexible, unlimited capabilities)
- High entry cost, though bokeh is kind of easy to use
- Also should scale well
Panel
- A bokeh server wrapper with a lot of capabilities for geo + templates
Streamlit
- The nicest looking app for interactive ML apps (maybe even annotation)
- Has pre-built styles and grid
- Limited only to its pre-built widgets
- Built on tornado with a very specific data model incompatible with the majority of available widgets
- Supposed to scale well - built on top of tornado
Viola
- If it runs in a notebook - it will run in viola
- Just turns a notebook into a server
- The app with the most promise for DS / ML
- Scales kind of meh - you need to run a jupyter kernel for each user - also takes some time to spin up a kernel
- Fully benefits from a rich ecosystem of jupyter / python / widgets
- In theory has customizable grid and CSS, but does not come pre-built with this => higher barrier to entry
Also most of these apps have no authentication buil-in.
More details:
- A nice summary here;
- A very detailed pros and cons summary of Streamlit + Viola. Also a very in-depth detailed discussion;
- Also awesome streamlit boilerplate is awesome;
#data_science
Medium
Jupyter Dashboarding — some thoughts on Voila, Panel and Dash
There are three main players in the Python dashboarding space, let’s discuss.
Collapsible Headings now in JupyterLab
- https://github.com/aquirdTurtle/Collapsible_Headings
The only thing that kept me from switching!
Please share your favourite plugins for JupyterLab!
#data_science
- https://github.com/aquirdTurtle/Collapsible_Headings
The only thing that kept me from switching!
Please share your favourite plugins for JupyterLab!
#data_science
GitHub
GitHub - aquirdTurtle/Collapsible_Headings: Implements Collapsible Headers for Jupyter Lab Notebooks
Implements Collapsible Headers for Jupyter Lab Notebooks - GitHub - aquirdTurtle/Collapsible_Headings: Implements Collapsible Headers for Jupyter Lab Notebooks
Spark in me
Streamlit vs. viola vs. panel vs. dash vs. bokeh server TLDR - make scientific web-apps via python only w/o any web-programming (i.e. django, tornado). Dash - Mostly for BI - Also a paid product - Looks like the new Tableau - Serving and out-of-the-box…
Using Viola With The Power of Vue.js
Remember the last post about python dashboard / demo solutions?
Since then we tried Viola and Streamlit.
Streamlit is very cool, but you cannot build really custom things with it. You should have 100% support of widgets that you need, and there are always issues with caching.
Also it is painful to change the default appearance.
Remember that viola "In theory has customizable grid and CSS"?
Of course someone took care of that!
Enter
TLDR - this allows you to have viola demos with Vue UI Library, all in 100% python.
Problems:
- No native table widget / method to show and UPDATE pandas tables. There are solutions that load Vue UI tables, but no updates out-of-the-box
- All plotting libraries will work mostly fine
- All jupiter widgets will work fine. But when you will need a custom widget - you will have to either code it, or find some hack with js-links or manual HTML manipulations
- Takes some time to load, will NOT scale to hundreds / thousands of concurrent users
-
Links:
- https://github.com/voila-dashboards/voila-vuetify
- https://github.com/mariobuikhuizen/ipyvuetify
#data_science
Remember the last post about python dashboard / demo solutions?
Since then we tried Viola and Streamlit.
Streamlit is very cool, but you cannot build really custom things with it. You should have 100% support of widgets that you need, and there are always issues with caching.
Also it is painful to change the default appearance.
Remember that viola "In theory has customizable grid and CSS"?
Of course someone took care of that!
Enter
ipyvuetify
+ voila-vuetify
. TLDR - this allows you to have viola demos with Vue UI Library, all in 100% python.
Problems:
- No native table widget / method to show and UPDATE pandas tables. There are solutions that load Vue UI tables, but no updates out-of-the-box
- All plotting libraries will work mostly fine
- All jupiter widgets will work fine. But when you will need a custom widget - you will have to either code it, or find some hack with js-links or manual HTML manipulations
- Takes some time to load, will NOT scale to hundreds / thousands of concurrent users
-
ipyvuetify
is poorly documented, not very intuitive examplesLinks:
- https://github.com/voila-dashboards/voila-vuetify
- https://github.com/mariobuikhuizen/ipyvuetify
#data_science
GitHub
GitHub - voila-dashboards/voila-vuetify: Dashboard template for Voilà based on VuetifyJS
Dashboard template for Voilà based on VuetifyJS. Contribute to voila-dashboards/voila-vuetify development by creating an account on GitHub.
Some Proxy Related Tips and Hacks ... Quick Ez and in Style =)
DO is not cool anymore
First of all - let's get the Elephant out of the room. Remember I recommended Digital Ocean?
Looks like they behave like a f**g corporation now. They require you selfie with a passport now.
F**k this. Even AWS does not need this.
Why would you need proxies?
Scraping mostly. Circumventing anal restrictions.
Sometimes there are other legit use cases like proxying your tg api requests.
Which framework to use?
None.
One of our team tried
(apart from their corporate platform crawlera, but I do not still understand why it exists, enterprise maybe)
Just use
And just write mostly good enough code.
If you do not need to scrape 100m pages per day or use
Really. Do not buy-in into this cargo cult stuff.
Video
For video-content there are libraries:
- youtube-dl - lots of features, horrible python API, nice CLI, it really works
- pytube - was really cool and pythonic, but author abandoned it. Most likely he just wrote a ton of
Also remember that many HTTP libraries have HTTP / SOCK5 support.
If the libraries are old, this may be supported via
Where to get proxies?
The most interesting part.
There are "dedicated" services (e.g. smartproxy / luminati.io / proxymesh / bestproxy.ru / mobile proxy services).
And they probably are the only option when you scrape Amazon.
But if you need high bandwidth and many requests - such proxies usually have garbage speed.
(and morally - probably 50% of them are hacked routers)
Ofc there is
Enter Vultr
They have always been a DO look-alike serice.
I found a simple hacky way to get 10-20-40 proxies quickly.
You can just use
That is is. Literally.
With
But his script feature + Docker images really save time.
But now they have the nice features - backups / snapshots / startup scripts / ez scaling / etc / etc w/o the corporate bs.
Also use my
Beware - new accounts may be limited to 5 servers per account.
You may want to create several accounts at first.
#data_science
#scraping
DO is not cool anymore
First of all - let's get the Elephant out of the room. Remember I recommended Digital Ocean?
Looks like they behave like a f**g corporation now. They require you selfie with a passport now.
F**k this. Even AWS does not need this.
Why would you need proxies?
Scraping mostly. Circumventing anal restrictions.
Sometimes there are other legit use cases like proxying your tg api requests.
Which framework to use?
None.
One of our team tried
scrapy
, but there is too much hassle (imho) w/o any benefits.(apart from their corporate platform crawlera, but I do not still understand why it exists, enterprise maybe)
Just use
aiohttp
, asyncio
, bs4
, requests
, threading
and multiprocessing
.And just write mostly good enough code.
If you do not need to scrape 100m pages per day or use
selenium
to scrape JS, this is more than enough.Really. Do not buy-in into this cargo cult stuff.
Video
For video-content there are libraries:
- youtube-dl - lots of features, horrible python API, nice CLI, it really works
- pytube - was really cool and pythonic, but author abandoned it. Most likely he just wrote a ton of
regexp
that he decided not to support. Some methods still work thoughAlso remember that many HTTP libraries have HTTP / SOCK5 support.
If the libraries are old, this may be supported via
env
variables.Where to get proxies?
The most interesting part.
There are "dedicated" services (e.g. smartproxy / luminati.io / proxymesh / bestproxy.ru / mobile proxy services).
And they probably are the only option when you scrape Amazon.
But if you need high bandwidth and many requests - such proxies usually have garbage speed.
(and morally - probably 50% of them are hacked routers)
Ofc there is
scrapoxy.io/
- but this is just too much!Enter Vultr
They have always been a DO look-alike serice.
I found a simple hacky way to get 10-20-40 proxies quickly.
You can just use
Vultr
+ Ubuntu 18.04 docker
image + write a plain startup script.That is is. Literally.
With
Docker
already installed your script may looks something like this:docker run -d --name socks5_1 -p 1080:1080 -e PROXY_USER=user -e PROXY_PASSWORD=password serjs/go-socks5-proxy && \There are cheaper hosting alternatives - Vultr is quite expensive.
docker run -d --name socks5_2 -p 1081:1080 -e PROXY_USER=user -e PROXY_PASSWORD=password serjs/go-socks5-proxy
But his script feature + Docker images really save time.
But now they have the nice features - backups / snapshots / startup scripts / ez scaling / etc / etc w/o the corporate bs.
Also use my
Give $100, Get $25
link!Beware - new accounts may be limited to 5 servers per account.
You may want to create several accounts at first.
#data_science
#scraping
Vultr
SSD VPS Servers, Cloud Servers and Cloud Hosting
Vultr Global Cloud Hosting - Brilliantly Fast SSD VPS Cloud Servers. 100% KVM Virtualization
Finally Migrating to JupyterLab?
TLDR - there is no killer feature, most likely this is just future-proofing.
With these plugins (some of which even work with the latest version of JupyterLab) you can finally migrate:
-
Migration from Notebook
Just add this and replace how you run your notebook:
- Looks like it is slower than notebooks (most annoying factor)
- No clear UX improvement, text editors are worse than IDEs, notebooks are the same
- Terminal is much less useful than standard Linux terminal or Putty
- With larger / more structured notebooks it crashes
- Most likely JupyterHub will continue to work with notebooks
I understand that pushing code to tested modules / having more smaller notebooks is preferable, but now when I have given this a test, most likely I will migrate only when forced to.
#data_science
TLDR - there is no killer feature, most likely this is just future-proofing.
With these plugins (some of which even work with the latest version of JupyterLab) you can finally migrate:
-
jupyterlab_filetree
- toc
- collapsible_headings
Extensions can be installed using jupyter labextension install
. Depending on your conda installation, sometimes you can even install them in your JupyterLab UI.Migration from Notebook
Just add this and replace how you run your notebook:
RUN conda install -c conda-forge jupyterlab && \Obvious Downsides
conda install nodejs
...
CMD jupyter lab --port=8888 --ip=0.0.0.0 --no-browser
- Looks like it is slower than notebooks (most annoying factor)
- No clear UX improvement, text editors are worse than IDEs, notebooks are the same
- Terminal is much less useful than standard Linux terminal or Putty
- With larger / more structured notebooks it crashes
- Most likely JupyterHub will continue to work with notebooks
I understand that pushing code to tested modules / having more smaller notebooks is preferable, but now when I have given this a test, most likely I will migrate only when forced to.
#data_science
Pandas Official Guide
Pandas now has a human readable best practices guide!
https://pandas.pydata.org/pandas-docs/stable/user_guide/
#data_science
Pandas now has a human readable best practices guide!
https://pandas.pydata.org/pandas-docs/stable/user_guide/
#data_science
Building Hyper Professional Looking PDFs in One Shell Command
You know, there are 2 types of people - those who value form over substance and substance over form.
I really like writing my documents in markdown and using CVS to store them, but many people do not understand this.
Enter Pandoc
You can build very professional-looking, whitepaper almost quality PDF documents with a single shell command using pandoc.
Its original template kind of sucks (do not also get me started on Latex and its witnesses) and shows its age. But I found a perfect solution - Eisvogel pandoc template.
It takes some fiddling with pandoc params, but in the end it is worth the effort.
- https://github.com/Wandmalfarbe/pandoc-latex-template
- https://pandoc.org/MANUAL.html
With this, you command may look like this:
And viola, you have a perfect investment bank looking document.
Enjoy!
#data_science
You know, there are 2 types of people - those who value form over substance and substance over form.
I really like writing my documents in markdown and using CVS to store them, but many people do not understand this.
Enter Pandoc
You can build very professional-looking, whitepaper almost quality PDF documents with a single shell command using pandoc.
Its original template kind of sucks (do not also get me started on Latex and its witnesses) and shows its age. But I found a perfect solution - Eisvogel pandoc template.
It takes some fiddling with pandoc params, but in the end it is worth the effort.
- https://github.com/Wandmalfarbe/pandoc-latex-template
- https://pandoc.org/MANUAL.html
With this, you command may look like this:
pandoc \
meeting.md -o \
meeting.pdf \
--from markdown \
--template eisvogel \
--latex-engine=xelatex \
--highlight-style pygments
And viola, you have a perfect investment bank looking document.
Enjoy!
#data_science
GitHub
GitHub - Wandmalfarbe/pandoc-latex-template: A pandoc LaTeX template to convert markdown files to PDF or LaTeX.
A pandoc LaTeX template to convert markdown files to PDF or LaTeX. - Wandmalfarbe/pandoc-latex-template
Notebooks + Spreadsheets
Notebooks and spreadsheets (Excel or Google Sheets) have always been two most useful and helpful instruments I have ever used. Whole companies were built based on pseudo-relational Excel databases (this is ofc does not scale well).
Now there is a new library in python that integrates some JS tables library seamlessly with ipywidgets and notebooks. It is news and predictably sucks a little bit (as most of interactive tables in JS).
It goes without saying that it opens up a lot of possibilities for ML annotation - you can essentially combine tables and ipywidgets easily.
As far as I see It does not have an option to embed some HTML code, but recently there just appeared and Audio widget in ipywidgets (buried in the release notes somewhere)
So you can just use this to load audio into ipysheet:
Notebooks and spreadsheets (Excel or Google Sheets) have always been two most useful and helpful instruments I have ever used. Whole companies were built based on pseudo-relational Excel databases (this is ofc does not scale well).
Now there is a new library in python that integrates some JS tables library seamlessly with ipywidgets and notebooks. It is news and predictably sucks a little bit (as most of interactive tables in JS).
It goes without saying that it opens up a lot of possibilities for ML annotation - you can essentially combine tables and ipywidgets easily.
As far as I see It does not have an option to embed some HTML code, but recently there just appeared and Audio widget in ipywidgets (buried in the release notes somewhere)
So you can just use this to load audio into ipysheet:
wavb = open('test.wav', "rb").read()#data_science
audio = Audio(value=wavb,
format='wav',
autoplay=False)
GitHub
GitHub - QuantStack/ipysheet: Jupyter handsontable integration
Jupyter handsontable integration. Contribute to QuantStack/ipysheet development by creating an account on GitHub.
Excel in Notebooks
This notebook tool looks awesome.
It enables you to essentially replicate most useful excel functionality within notebook pandas dataframes:
- https://github.com/quantopian/qgrid
Was looking for something like this for a long time!
Similar tools I saw before required some fiddling (just search spreadsheets / excel on the channel), this one just works with an existing pandas dataframe! You can also hack in HTML elements and there are ample callbacks for your custom functionality.
#data_science
This notebook tool looks awesome.
It enables you to essentially replicate most useful excel functionality within notebook pandas dataframes:
- https://github.com/quantopian/qgrid
Was looking for something like this for a long time!
Similar tools I saw before required some fiddling (just search spreadsheets / excel on the channel), this one just works with an existing pandas dataframe! You can also hack in HTML elements and there are ample callbacks for your custom functionality.
#data_science
GitHub
GitHub - quantopian/qgrid: An interactive grid for sorting, filtering, and editing DataFrames in Jupyter notebooks
An interactive grid for sorting, filtering, and editing DataFrames in Jupyter notebooks - quantopian/qgrid
Compressed Feather in Pandas
A nifty feature in pandas I totally missed - saving not only
- Pandas to feather doc
- Pyarrow to feather doc
#data_science
A nifty feature in pandas I totally missed - saving not only
.csv
data frames compressed, but also .feather
ones. Reduces files size 4-5x for repetitive data.- Pandas to feather doc
- Pyarrow to feather doc
#data_science
Serialization of Standard Python Data Type in PyArrow
Some time ago, when I tried to save a pandas dataframe to
Now it works at least with python lists and dicts.
It is very cool. Combined with
The downside that is for example instead of list of lists you get a nested numpy array after reading the file back, but this is a small price to pay, right?
#data_science
Some time ago, when I tried to save a pandas dataframe to
feather
or parquet
and I had something like:[1, 2, 3]in some cells, pyarrow broke and refused to work.
Now it works at least with python lists and dicts.
It is very cool. Combined with
compression='zstd'
it is also fast and provides some really needed compression for text data.The downside that is for example instead of list of lists you get a nested numpy array after reading the file back, but this is a small price to pay, right?
#data_science