Some Proxy Related Tips and Hacks ... Quick Ez and in Style =)
DO is not cool anymore
First of all - let's get the Elephant out of the room. Remember I recommended Digital Ocean?
Looks like they behave like a f**g corporation now. They require you selfie with a passport now.
F**k this. Even AWS does not need this.
Why would you need proxies?
Scraping mostly. Circumventing anal restrictions.
Sometimes there are other legit use cases like proxying your tg api requests.
Which framework to use?
None.
One of our team tried
(apart from their corporate platform crawlera, but I do not still understand why it exists, enterprise maybe)
Just use
And just write mostly good enough code.
If you do not need to scrape 100m pages per day or use
Really. Do not buy-in into this cargo cult stuff.
Video
For video-content there are libraries:
- youtube-dl - lots of features, horrible python API, nice CLI, it really works
- pytube - was really cool and pythonic, but author abandoned it. Most likely he just wrote a ton of
Also remember that many HTTP libraries have HTTP / SOCK5 support.
If the libraries are old, this may be supported via
Where to get proxies?
The most interesting part.
There are "dedicated" services (e.g. smartproxy / luminati.io / proxymesh / bestproxy.ru / mobile proxy services).
And they probably are the only option when you scrape Amazon.
But if you need high bandwidth and many requests - such proxies usually have garbage speed.
(and morally - probably 50% of them are hacked routers)
Ofc there is
Enter Vultr
They have always been a DO look-alike serice.
I found a simple hacky way to get 10-20-40 proxies quickly.
You can just use
That is is. Literally.
With
But his script feature + Docker images really save time.
But now they have the nice features - backups / snapshots / startup scripts / ez scaling / etc / etc w/o the corporate bs.
Also use my
Beware - new accounts may be limited to 5 servers per account.
You may want to create several accounts at first.
#data_science
#scraping
DO is not cool anymore
First of all - let's get the Elephant out of the room. Remember I recommended Digital Ocean?
Looks like they behave like a f**g corporation now. They require you selfie with a passport now.
F**k this. Even AWS does not need this.
Why would you need proxies?
Scraping mostly. Circumventing anal restrictions.
Sometimes there are other legit use cases like proxying your tg api requests.
Which framework to use?
None.
One of our team tried
scrapy
, but there is too much hassle (imho) w/o any benefits.(apart from their corporate platform crawlera, but I do not still understand why it exists, enterprise maybe)
Just use
aiohttp
, asyncio
, bs4
, requests
, threading
and multiprocessing
.And just write mostly good enough code.
If you do not need to scrape 100m pages per day or use
selenium
to scrape JS, this is more than enough.Really. Do not buy-in into this cargo cult stuff.
Video
For video-content there are libraries:
- youtube-dl - lots of features, horrible python API, nice CLI, it really works
- pytube - was really cool and pythonic, but author abandoned it. Most likely he just wrote a ton of
regexp
that he decided not to support. Some methods still work thoughAlso remember that many HTTP libraries have HTTP / SOCK5 support.
If the libraries are old, this may be supported via
env
variables.Where to get proxies?
The most interesting part.
There are "dedicated" services (e.g. smartproxy / luminati.io / proxymesh / bestproxy.ru / mobile proxy services).
And they probably are the only option when you scrape Amazon.
But if you need high bandwidth and many requests - such proxies usually have garbage speed.
(and morally - probably 50% of them are hacked routers)
Ofc there is
scrapoxy.io/
- but this is just too much!Enter Vultr
They have always been a DO look-alike serice.
I found a simple hacky way to get 10-20-40 proxies quickly.
You can just use
Vultr
+ Ubuntu 18.04 docker
image + write a plain startup script.That is is. Literally.
With
Docker
already installed your script may looks something like this:docker run -d --name socks5_1 -p 1080:1080 -e PROXY_USER=user -e PROXY_PASSWORD=password serjs/go-socks5-proxy && \There are cheaper hosting alternatives - Vultr is quite expensive.
docker run -d --name socks5_2 -p 1081:1080 -e PROXY_USER=user -e PROXY_PASSWORD=password serjs/go-socks5-proxy
But his script feature + Docker images really save time.
But now they have the nice features - backups / snapshots / startup scripts / ez scaling / etc / etc w/o the corporate bs.
Also use my
Give $100, Get $25
link!Beware - new accounts may be limited to 5 servers per account.
You may want to create several accounts at first.
#data_science
#scraping
Vultr
SSD VPS Servers, Cloud Servers and Cloud Hosting
Vultr Global Cloud Hosting - Brilliantly Fast SSD VPS Cloud Servers. 100% KVM Virtualization