In this channel we will be talking about all things related to IP looks tools and services, with some specialized ones, like Reverse IP APis.
First let us start with basic definitions. What is an IP address? Every device on the internet is assigned a so called IP address. From computers to IoT devices. It is a kind of street address of your house, except that it is for devices, not houses and it is for internet space. There are actually two kind of IP addresses. IPv4 is the first kind and it is abbreviation for Internet Protocol version 4. An example of IP address according to this protocol is 22.214.171.124. But then there is also another IPv6 version of Ip addresses. This is because the world was running out of IPv4 addresses and something had to be done. Enter IPv6. it can support 2^128 Internet addresses—340,282,366,920,938,463,463,374,607,431,768,211,456 of them to be exact. The number of IPv6 addresses are 1024 times as much as the IPv4 version, so we should be safe for a while.
Next, let us turn our attention to domains. Domain is a text that can be used as a unique identifier for websites. E.g. I am writing this on a website that has domain telegram.com. Note the extension at the end. For a long time, the only extensions were .com, .net, .org and a few other TLD as well as country level domains like .de or co.uk. Recently this has changed with introduction of many other domain extensions. Here is a list of all possible domain extensions: https://www.namecheap.com/domains/full-tld-list/. The interesting thing is that several domains can reside on the same IP. IP can belong e.g. to a server of some provider like digital ocean and then many different domains can be hosted on this server and thus this IP. To find out all domains that reside on the same IP one needs special reverse IP lookup tools and AP services to accomplish the task.
What are the potential use cases for reverse IP lookup tools and services? One can integrate it with the IAB classification of websites using API to find all websites that reside on the same IP but belong to some specific category, like fashion website. This can be then used as part of web content filtering to prevent employees to spending time on some non-work related websites on their computers. For more information please check out introduction to Website Classification API.
Website categorization is an example of text classification models. They usually use logistic regression, support vector machines, naive bayes or neural nets for classification. One can build website categorization models on your own or use machine learning consulting company for this puporse.
Looking up a task of URL classification, primarily for the purpose of identifying the problematic websites. So how one can approach such a problem. First part is the feature engineering or the extraction of relevant features that will be used for the ML models for predicting whether URL is problematic or not. So given that we are dealing with URLs, what do we have available for this? Here are some of the possible features that we can use: total number of digits in URL, total number of characters in URL, total number of query parameters in URL, total number of parts in URL, e.g. separated by '-', type of domain extension (e.g. .info domain are more often associated with problematic domains), number of times that we find %20 in URL, number of @ character in the URL, is an IP address present in IP (boolean type), is URL using http protocol, is URL using https protocol, is website online or offline, what is the number of days since the domain was last registered, ....
Some other features that are available for URL classification: Total number of images, Total number of links, Total number of characters , Total number of special characters, The ratio of total length of script to special characters