Spark in me
2.2K subscribers
822 photos
48 videos
116 files
2.68K links
Lost like tears in rain. DS, ML, a bit of philosophy and math. No bs or ads.
Download Telegram
Some Proxy Related Tips and Hacks ... Quick Ez and in Style =)

DO is not cool anymore

First of all - let's get the Elephant out of the room. Remember I recommended Digital Ocean?
Looks like they behave like a f**g corporation now. They require you selfie with a passport now.
F**k this. Even AWS does not need this.


Why would you need proxies?

Scraping mostly. Circumventing anal restrictions.
Sometimes there are other legit use cases like proxying your tg api requests.


Which framework to use?

None.
One of our team tried scrapy, but there is too much hassle (imho) w/o any benefits.
(apart from their corporate platform crawlera, but I do not still understand why it exists, enterprise maybe)
Just use aiohttp, asyncio, bs4, requests, threading and multiprocessing.
And just write mostly good enough code.
If you do not need to scrape 100m pages per day or use selenium to scrape JS, this is more than enough.
Really. Do not buy-in into this cargo cult stuff.


Video

For video-content there are libraries:

- youtube-dl - lots of features, horrible python API, nice CLI, it really works
- pytube - was really cool and pythonic, but author abandoned it. Most likely he just wrote a ton of regexp that he decided not to support. Some methods still work though

Also remember that many HTTP libraries have HTTP / SOCK5 support.
If the libraries are old, this may be supported via env variables.


Where to get proxies?

The most interesting part.
There are "dedicated" services (e.g. smartproxy / luminati.io / proxymesh / bestproxy.ru / mobile proxy services).
And they probably are the only option when you scrape Amazon.
But if you need high bandwidth and many requests - such proxies usually have garbage speed.
(and morally - probably 50% of them are hacked routers)
Ofc there is scrapoxy.io/ - but this is just too much!


Enter Vultr

They have always been a DO look-alike serice.

I found a simple hacky way to get 10-20-40 proxies quickly.
You can just use Vultr + Ubuntu 18.04 docker image + write a plain startup script.
That is is. Literally.

With Docker already installed your script may looks something like this:

docker run -d --name socks5_1 -p 1080:1080 -e PROXY_USER=user -e PROXY_PASSWORD=password serjs/go-socks5-proxy && \
docker run -d --name socks5_2 -p 1081:1080 -e PROXY_USER=user -e PROXY_PASSWORD=password serjs/go-socks5-proxy

There are cheaper hosting alternatives - Vultr is quite expensive.
But his script feature + Docker images really save time.
But now they have the nice features - backups / snapshots / startup scripts / ez scaling / etc / etc w/o the corporate bs.
Also use my Give $100, Get $25 link!

Beware - new accounts may be limited to 5 servers per account.
You may want to create several accounts at first.

#data_science
#scraping