Machine learning and Deep learning
4 subscribers
12 links
Machine learning and Deep learning
Download Telegram
Product categorization can be understood as a text classification problem and there are many possible Machine Learning models that can be used for tackling this. Here are some of them: Naive Bayes classifier
K-Nearest Neighbors
Support Vector Machines (SVM)
Logistic regression
Decision Trees
Random Forests
Deep Neural Networks
Recurrent Neural Networks (RNN)
Convolutional Neural Network (CNN)
Ensembles of neural nets
SVM models are great for text classification models except when you are dealing with large training data sets. The problem is that SVM complexity scales rapidly with size of problem. So you run into problem in those cases.
One of the important text classification models is website categorization. Website categorization is useful for brand safety, online filtering, enhancing search features on your websites and for many other uses. Tools in this are offer both website categorization API or classification of both base domain URLs and full path URLs. Base domain url classification means that only domain is classified. Whereas full path website categorization means you can send any URL to website categorization tool and it will return the classification in form of json dictionary with categories and predicted probabilities. There are also other websites which are specialized for producing categorization of products using API restful and are very useful to use if you are operating an online store. Better product categorizations means your customers can find your products easier and faster and it is also useful to get more of your webpages indexed on search engines thus leading to more visits from search engines.
Website categorization is a great start if you are a data scientist and want to get started on some text classification models. The first step you need to make is to decide on the domain. E.g. you could start with news. Then the next step is to build a training data set for your machine learning model. The best way would be to collect the news from some well known news sites and then categorize them. You can do categorizations manually or use their menus for this purpose. Note however that it will be hard to keep categories the same as you go from website to website. So best to stick with your own custom labels and assign news to them. After you collect many news items the next step for your case of website categorization is to select machine learning model. We suggest a SVM model or Logistic regression to get you started.
A subset of text classification problem is also the categorization of products. This can be interesting if you are e.g. dealing with a e-commerce company, e.g. like an online store and you would like to have all of your products categorized. Then you can add more webpages, e.g. you could add subpages for different categories which would mean more webpages for your website indexed by search engines and thus more visits via the search engines. So this is very beneficial. Another thing that you could add if you implement product categorization in e-commerce is that you have better filtering abilities for your page. So the customers can find your products easier and more quickly.
Machine Learning and Data Science Consulting has seen a rise in popularity in recent years, due to increase of machine learning models in companies, for various purposes. Companies generate a lot of data and data science helps them uncover insights from their data sets, with business intelligence discipline then helping put this into use and uncover new value from their business processes. Machine Learning and Data Science Consulting usually involves a team of people, consisting from data scientists as well as data engineers.
Just trying out GPT-3 for text generation, it is really good. You can even repurpose it for fact checking, e.g. let us say that you want to verify the fact that "bologna is in italy". Then you can pass this sentence to the OpenAI API and you will get the answer back: Yes or No. Just make sure that you use the Q&A API: https://beta.openai.com/playground/p/default-qa?model=text-davinci-002
OpenAI is actually also quite affordable for this kind of tasks, even their top model called Da Vinci which is most expensive. When you set up the account they also give you some initial credits so you can start coding right away. Best is to start with their documentation at https://beta.openai.com/docs/introduction/overview. The quality of results are so good that a lot of startups will come up with all sorts of ideas for launching new apps based off using these models.
Have in the last months developed a large collaborative filtering model and have only good things to say about this library: https://github.com/benfred/implicit. It is was developed for Spotify, probably still used there in some kind of modified way though not sure. Anyhow, what I like is that it works even on huge user - item matrix. We are using it in context of 100 million data items and works remarkably well. There are some tricks on how to get retraining done quickly. but other than that, 5.
Mean CVaR methodology as implemented in Alpha Quantum Portfolio Optimiser tool can be used in three different formulations in a wide variety of products and services. Target return formulation for example can be used for pension funds, with predefined return objectives or for total return fund. Target risk formulation can be employed for insurance portfolio in new Solvency 2 regulatory framework. Risk aversion formulation is appropriate for financial advisory service where portfolio composition reflects individual risk profile of investor.
An interesting thing about domains is that they are often being hosted on a server which hosts other domains. So when we get some IP for some domain, we may think that this is unique to domain but it not so. What one needs to do is perform reverse ip lookup of domain, which returns for given IP other domains that may be hosted on the same IP. This reverse IP lookup of domain can be done if one has beforehand obtained IPs of a large number of domains. There are literally hundreds of millions of domains out there.
If you are interested in machine learning based website classification, here is a nice python package that allows you to quickly start with classifying the websites: https://pypi.org/project/websiteclassificationapi/