Welcome to our channel where we will try to keep up with interesting research and blog articles on the topics of website categorization, product categorization, trending products, online stores and similar.
First, let us try to define, what is website categorization. The latter is any kind of mapping of a website (or URL) to some specific category. In many cases, there can actually be more categories assigned, in more Tiers. E.g. Tier 1 could be some broad categories. E.g. check out IAB taxonomy at https://iabtechlab.com/standards/content-taxonomy/ and look up Tier 1 column, you will see broad categories like Automotive, Books and Literature, Business and Finance, Careers, Education etc. This IAB taxonomy is specialized for ads use cases, e.g. real-time bidding on ads, involving ad exchanges. When the advertiser wants to publish an ad, it is namely important that the ad is placed on a website which is in the desired category by the publisher. So this is why the advertiser needs all the websites to be categorized.If you are looking for free tool to categorize websites, there are plenty available out there. If you are more interested in E-Commerce sector, then a better taxonomy or categories definition is the one from Google, called Google Products Taxonomy. You can learn more about it e.g. here: https://support.google.com/merchants/answer/6324436?hl=en.
In our second post we want to talk about news classification. News classification is actually one of the standard text classification tasks and there are data sets available that are used for this purpose, the so-called 20 newsgroup data set. You can find it here: https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html. Although that is not the only site where it is made accessible. This data set has 18,000 posts, news articles, that are segmented into 20 news groups. The main idea then is to build a machine learning model on this data set which is then able to predict on new, previously unseen (out-of-sample) news items, in which news group do they belong. The data set is small enough that it can serve as the baseline data set for different text classification models. Here is a great overview article which compares results from different text classification models, on various data sets including news20 groups data set: https://github.com/kk7nc/Text_Classification. If you want to learn more about website categorization field, you can also check out these two articles: https://medium.com/website-categorization/website-categorization-api-ca6c3e0f6c4d and https://www.alpha-quantum.com/blog/website-categorization/website-categorization-api/
When dealing with news articles from websites, it is important that we do article extraction first. The article itself is namely embedded in website structure and DOM structure of websites is usually such that a lot of elements, like menus and footers are kind of generic or rather common to all webpages and are better to be removed in order to improve signal to noise ratio before the actual classification. This is done by special libraries for article extractions. If you are interested in full list of these content extraction methods and how they perform, then please check this link: https://github.com/scrapinghub/article-extraction-benchmark
Website categorization tool is very useful if one wants to determine trending purchases of goods in 2022. E.g. some of the top trending products in category of Apparel & accessories right now are mirabel dress, infinite hoop, sky glasses, print costume, ruby slide, sidekick wallet. This can be done by first determining trends for a large number of products and then classify the products in Tier 1, Tier 2, Tier 3 product categories.
Just read another article about Website Categorization API. Interesting that they offer both Google Taxonomy as well as IAB taxonomy for their categories. Usually most providers opt for the IAB taxonomy for some reason. Mostly because it is so well known being used by real-time bidding companies, and others.
GPT3 is quite good for Q&A. Seems that being trained on Wikipedia, it is quite capable of correctly answering questions like is Lugano in Switzerland and similar. Seems to me, that many people will be developing many apps based on the functionality of the underlying engine of GPT3 by OPenAI. will be interesting to see how this goes as the quality seems to have substantially increased from GPT1 to GPT2 and now to GPT3.