Web Scraping & Data Extraction
872 subscribers
10 photos
2 videos
868 links
Ultimate web scraping related hub.
A shortcut for your learning journey.
Image credit: Clubhouse data extraction by x.com/rashiq
Download Telegram
πŸ—“ Update on the Scraping Universe

Did you know that TikTok's parent company scrapes all around the web massively? One of the most aggressive scraping on the internet! 🀯

ByteDance, the company behind TikTok, has been scraping data from websites at an insane rate.

According to Sam Crowther, the CEO of Kasada, the bot called Bytespider is blowing away the competition, hoovering up data 25 times faster than GPTbot, which scrapes data for ChatGPT. And 3,000 times quicker than ClaudeBot, the scraper bot used by Anthropic. πŸ€–

What do you think guys? πŸ˜€
πŸ’Ό Job market requirement insight for Scrapers

Zyte is opening a position as Principal Reverse Engineer and these are skills they required for the candidate:

β€’ Hacker mindset
β€’ Understand techniques and tools for crawling, extracting, and processing data
β€’ Proficiency in programming languages: JavaScript/Node.js, Python, Java
β€’ Reverse engineering skills: static, dynamic, and concolic analysis
β€’ Understand operating systems and computer networking concepts
β€’ Can use tools like Wireshark, Burp Suite, etc to intercept and debug network traffic
β€’ Understand browser engines, browser fingerprinting, and ad-blocker mechanisms

And will be liked if:
β€’ Experience with Decompilers, IDA Pro, Ghidra or Frida, Jadx, and Babel
β€’ Experience with C/C++
β€’ Core contributions to Mozilla or Chromium projects
βš™ Tech Stack at Apify

Frontend: React.js, styled-components, Storybook, Cypress

Backend: TypeScript/Node.js, Next.js, Nest.js, Docusaurus, Jest

Infra: AWS, Kubernetes, Helm, MongoDB, Redis, DynamoDB, S3, GitHub Actions

Monitoring: New Relic, LogDNA, Sentry, PagerDuty

Tools: GitHub, ZenHub, Notion, GSuite

AI Tools: Langchain, LlamaIndex, Pinecone, OpenAI API, Web agents
JA3 transport is a way to identify a client’s TLS configuration. It includes the list of cipher suites supported, the extensions sent, and other details.

This fingerprint can be used to recognize the browser or device making the connection, even if it's using encryption.

Fingerprinting is a common method to detect automation bot and crawler.

How JA3 fingerprints can be impersonated? πŸ€”
Please open Telegram to view this post
VIEW IN TELEGRAM