How to Fix: KeyError in Pandas?
The KeyError in Pandas occurs when you try to access the columns in pandas DataFrame, which does not exist, or you misspell them.
Typically, we import data from the excel name, which imports the column names, and there are high chances that you misspell the column names or include an unwanted space before or after the column name.
https://itsmycode.com/how-to-fix-keyerror-in-pandas/
@epythonlab
The KeyError in Pandas occurs when you try to access the columns in pandas DataFrame, which does not exist, or you misspell them.
Typically, we import data from the excel name, which imports the column names, and there are high chances that you misspell the column names or include an unwanted space before or after the column name.
https://itsmycode.com/how-to-fix-keyerror-in-pandas/
@epythonlab
👍4
Learn Python from Scratch
PDF: https://cfm.ehu.es/ricardo/docs/python/Learning_Python.pdf
@epythonlab #pythonbooks
PDF: https://cfm.ehu.es/ricardo/docs/python/Learning_Python.pdf
@epythonlab #pythonbooks
👍7🔥1
Project Idea: Building a spam classifier
Introduction
Spam detection is one of the major applications of Machine Learning in the interwebs today. Pretty much all of the major email service providers have spam detection systems built in and automatically classify such mail as 'Junk Mail'.
In this mission we will be using the Naive Bayes algorithm to create a model that can classify dataset SMS messages as spam or not spam, based on the training we give to the model. It is important to have some level of intuition as to what a spammy text message might look like.
What are spammy messages?
Usually they have words like 'free', 'win', 'winner', 'cash', 'prize', or similar words in them, as these texts are designed to catch your eye and tempt you to open them. Also, spam messages tend to have words written in all capitals and also tend to use a lot of exclamation marks. To the recipient, it is usually pretty straightforward to identify a spam text and our objective here is to train a model to do that for us!
Being able to identify spam messages is a binary classification problem as messages are classified as either 'Spam' or 'Not Spam' and nothing else. Also, this is a supervised learning problem, as we know what are trying to predict. We will be feeding a labelled dataset into the model, that it can learn from, to make future predictions.
@epythonlab #projectIdea #ml
Introduction
Spam detection is one of the major applications of Machine Learning in the interwebs today. Pretty much all of the major email service providers have spam detection systems built in and automatically classify such mail as 'Junk Mail'.
In this mission we will be using the Naive Bayes algorithm to create a model that can classify dataset SMS messages as spam or not spam, based on the training we give to the model. It is important to have some level of intuition as to what a spammy text message might look like.
What are spammy messages?
Usually they have words like 'free', 'win', 'winner', 'cash', 'prize', or similar words in them, as these texts are designed to catch your eye and tempt you to open them. Also, spam messages tend to have words written in all capitals and also tend to use a lot of exclamation marks. To the recipient, it is usually pretty straightforward to identify a spam text and our objective here is to train a model to do that for us!
Being able to identify spam messages is a binary classification problem as messages are classified as either 'Spam' or 'Not Spam' and nothing else. Also, this is a supervised learning problem, as we know what are trying to predict. We will be feeding a labelled dataset into the model, that it can learn from, to make future predictions.
@epythonlab #projectIdea #ml
👍5
Implementing Bag of Words from scratch
Before we dive into scikit-learn's Bag of Words (BoW) library to do the dirty work for us, let's implement it ourselves first so that we can understand what's happening behind the scenes.
Step 1: Convert all strings to their lower case form.
Let's say we have a document set:
documents = ['Hello, how are you!',
'Win money, win from home.',
'Call me now.',
'Hello, Call hello you tomorrow?']
Instructions:
Convert all the strings in the documents set to their lower case. Save them into a list called 'lower_case_documents'. You can convert strings to their lower case in python by using the lower() method.
@epythonlab #codetip #naive_bayes #ml #AI
Before we dive into scikit-learn's Bag of Words (BoW) library to do the dirty work for us, let's implement it ourselves first so that we can understand what's happening behind the scenes.
Step 1: Convert all strings to their lower case form.
Let's say we have a document set:
documents = ['Hello, how are you!',
'Win money, win from home.',
'Call me now.',
'Hello, Call hello you tomorrow?']
Instructions:
Convert all the strings in the documents set to their lower case. Save them into a list called 'lower_case_documents'. You can convert strings to their lower case in python by using the lower() method.
@epythonlab #codetip #naive_bayes #ml #AI
👍3🔥1
Epython Lab
Implementing Bag of Words from scratch Before we dive into scikit-learn's Bag of Words (BoW) library to do the dirty work for us, let's implement it ourselves first so that we can understand what's happening behind the scenes. Step 1: Convert all strings…
Step 2: Removing all punctuation
Instructions: Remove all punctuation from the strings in the document set. Save the strings into a list called 'sans_punctuation_documents'.
@epythonlab #codetip #ml #AI #naive_bayes
Instructions: Remove all punctuation from the strings in the document set. Save the strings into a list called 'sans_punctuation_documents'.
@epythonlab #codetip #ml #AI #naive_bayes
👍3
Epython Lab
Step 2: Removing all punctuation Instructions: Remove all punctuation from the strings in the document set. Save the strings into a list called 'sans_punctuation_documents'. @epythonlab #codetip #ml #AI #naive_bayes
Step 3: Tokenization
Tokenizing a sentence in a document set means splitting up the sentence into individual words using a delimiter. The delimiter specifies what character we will use to identify the beginning and end of a word. Most commonly, we use a single space as the delimiter character for identifying words, and this is true in our documents in this case also.
Instructions: Tokenize the strings stored in 'sans_punctuation_documents' using the split() method. Store the final document set in a list called 'preprocessed_documents'.
@epythonlab #codetip #ml #AI #naive_bayes
Tokenizing a sentence in a document set means splitting up the sentence into individual words using a delimiter. The delimiter specifies what character we will use to identify the beginning and end of a word. Most commonly, we use a single space as the delimiter character for identifying words, and this is true in our documents in this case also.
Instructions: Tokenize the strings stored in 'sans_punctuation_documents' using the split() method. Store the final document set in a list called 'preprocessed_documents'.
@epythonlab #codetip #ml #AI #naive_bayes
👍3
Epython Lab
Step 3: Tokenization Tokenizing a sentence in a document set means splitting up the sentence into individual words using a delimiter. The delimiter specifies what character we will use to identify the beginning and end of a word. Most commonly, we use a single…
Step 4 and the last step: Count frequencies
Now that we have our document set in the required format, we can proceed to counting the occurrence of each word in each document of the document set. We will use the Counter method from the Python collections library for this purpose.
Counter counts the occurrence of each item in the list and returns a dictionary with the key as the item being counted and the corresponding value being the count of that item in the list.
Instructions: Using the Counter() method and preprocessed_documents as the input, create a dictionary with the keys being each word in each document and the corresponding values being the frequency of occurrence of that word. Save each Counter dictionary as an item in a list called 'frequency_list'.
@epythonlab #codetip #ml #AI #naive_bayes
Now that we have our document set in the required format, we can proceed to counting the occurrence of each word in each document of the document set. We will use the Counter method from the Python collections library for this purpose.
Counter counts the occurrence of each item in the list and returns a dictionary with the key as the item being counted and the corresponding value being the count of that item in the list.
Instructions: Using the Counter() method and preprocessed_documents as the input, create a dictionary with the keys being each word in each document and the corresponding values being the frequency of occurrence of that word. Save each Counter dictionary as an item in a list called 'frequency_list'.
@epythonlab #codetip #ml #AI #naive_bayes
👍5
Congratulations We have implemented BoW from scratch using Python.
Here is a Partially mplementation of spam classifier using naive_bayes algorithm
https://t.me/epythonlab/689
In this post we have implemented bag of words without using scikit learn library.
We have followed lots of steps to implement BoW without library.
1. Convert all strings to their lowercase
https://t.me/epythonlab/690
2. Removing all punctuations
https://t.me/epythonlab/691
3. Tokenize https://t.me/epythonlab/692
4. Count freequencies
https://t.me/epythonlab/693
N.B: Follow all steps above and implement BoW using scikit learn by yourself.
@epythonlab #AI #ML
Here is a Partially mplementation of spam classifier using naive_bayes algorithm
https://t.me/epythonlab/689
In this post we have implemented bag of words without using scikit learn library.
We have followed lots of steps to implement BoW without library.
1. Convert all strings to their lowercase
https://t.me/epythonlab/690
2. Removing all punctuations
https://t.me/epythonlab/691
3. Tokenize https://t.me/epythonlab/692
4. Count freequencies
https://t.me/epythonlab/693
N.B: Follow all steps above and implement BoW using scikit learn by yourself.
@epythonlab #AI #ML
Telegram
EPYTHON LAB
Project Idea: Building a spam classifier
Introduction
Spam detection is one of the major applications of Machine Learning in the interwebs today. Pretty much all of the major email service providers have spam detection systems built in and automatically…
Introduction
Spam detection is one of the major applications of Machine Learning in the interwebs today. Pretty much all of the major email service providers have spam detection systems built in and automatically…
👍7
Epython Lab
Congratulations We have implemented BoW from scratch using Python. Here is a Partially mplementation of spam classifier using naive_bayes algorithm https://t.me/epythonlab/689 In this post we have implemented bag of words without using scikit learn library.…
Previously, we have implemented BoW without scikit-learn python based ML library. You can read the above post and We will now implement sklearn.feature_extraction.text.CountVectorizer method in the next step.
👍5
Epython Lab
Project Idea: Building a spam classifier Introduction Spam detection is one of the major applications of Machine Learning in the interwebs today. Pretty much all of the major email service providers have spam detection systems built in and automatically…
Implementing Bag of Words in scikit-learn
Now that we have implemented the BoW concept from scratch, let's go ahead and use scikit-learn to do this process in a clean and succinct way. We will use the same document set as we used in the previous step.
'''
Here we will look to create a frequency matrix on a smaller document set to make sure we understand how the
document-term matrix generation happens. We have created a sample document set 'documents'.
'''
Step 1:
documents = ['Hello, how are you!',
'Win money, win from home.',
'Call me now.',
'Hello, Call hello you tomorrow?']
Instructions: Import the sklearn.feature_extraction.text.CountVectorizer method and create an instance of it called 'count_vector'.
Now that we have implemented the BoW concept from scratch, let's go ahead and use scikit-learn to do this process in a clean and succinct way. We will use the same document set as we used in the previous step.
'''
Here we will look to create a frequency matrix on a smaller document set to make sure we understand how the
document-term matrix generation happens. We have created a sample document set 'documents'.
'''
Step 1:
documents = ['Hello, how are you!',
'Win money, win from home.',
'Call me now.',
'Hello, Call hello you tomorrow?']
Instructions: Import the sklearn.feature_extraction.text.CountVectorizer method and create an instance of it called 'count_vector'.
👍3
Data preprocessing with CountVectorizer()
In Step 2.2, we implemented a version of the CountVectorizer() method from scratch that entailed cleaning our data first. This cleaning involved converting all of our data to lower case and removing all punctuation marks. CountVectorizer() has certain parameters which take care of these steps for us. They are:
lowercase = True
The lowercase parameter has a default value of True which converts all of our text to its lower case form.
token_pattern = (?u)\\b\\w\\w+\\b
The token_pattern parameter has a default regular expression value of (?u)\\b\\w\\w+\\b which ignores all punctuation marks and treats them as delimiters, while accepting alphanumeric strings of length greater than or equal to 2, as individual tokens or words.
stop_words
The stop_words parameter, if set to english will remove all words from our document set that match a list of English stop words which is defined in scikit-learn. Considering the size of our dataset and the fac
In Step 2.2, we implemented a version of the CountVectorizer() method from scratch that entailed cleaning our data first. This cleaning involved converting all of our data to lower case and removing all punctuation marks. CountVectorizer() has certain parameters which take care of these steps for us. They are:
lowercase = True
The lowercase parameter has a default value of True which converts all of our text to its lower case form.
token_pattern = (?u)\\b\\w\\w+\\b
The token_pattern parameter has a default regular expression value of (?u)\\b\\w\\w+\\b which ignores all punctuation marks and treats them as delimiters, while accepting alphanumeric strings of length greater than or equal to 2, as individual tokens or words.
stop_words
The stop_words parameter, if set to english will remove all words from our document set that match a list of English stop words which is defined in scikit-learn. Considering the size of our dataset and the fac
👍4
Considering the small size of our dataset and the fact that we are dealing with SMS messages and not larger text sources like e-mail, we will not use stop words, and we won't be setting this parameter value.
You can take a look at all the parameter values of your count_vector object by simply printing out the object as follows:
You can take a look at all the parameter values of your count_vector object by simply printing out the object as follows:
👍3
The get_feature_names() method returns our feature names for this dataset, which is the set of words that make up our vocabulary for 'documents'.
Instructions: Create a matrix with each row representing one of the 4 documents, and each column representing a word (feature name). Each value in the matrix will represent the frequency of the word in that column occurring in the particular document in that row. You can do this using the transform() method of CountVectorizer, passing in the document data set as the argument. The transform() method returns a matrix of NumPy integers, which you can convert to an array using toarray(). Call the array 'doc_array'.
Instructions: Create a matrix with each row representing one of the 4 documents, and each column representing a word (feature name). Each value in the matrix will represent the frequency of the word in that column occurring in the particular document in that row. You can do this using the transform() method of CountVectorizer, passing in the document data set as the argument. The transform() method returns a matrix of NumPy integers, which you can convert to an array using toarray(). Call the array 'doc_array'.
👍3
Now we have a clean representation of the documents in terms of the frequency distribution of the words in them. To make it easier to understand our next step is to convert this array into a dataframe and name the columns appropriately.
Instructions: Convert the 'doc_array' we created into a dataframe, with the column names as the words (feature names). Call the dataframe 'frequency_matrix'.
Instructions: Convert the 'doc_array' we created into a dataframe, with the column names as the words (feature names). Call the dataframe 'frequency_matrix'.
👍3
Epython Lab
Project Idea: Building a spam classifier Introduction Spam detection is one of the major applications of Machine Learning in the interwebs today. Pretty much all of the major email service providers have spam detection systems built in and automatically…
Congratulations! We have successfully implemented a Bag of Words problem for a document dataset that we created. https://t.me/epythonlab/689
One potential issue that can arise from using this method is that if our dataset of text is extremely large (say if we have a large collection of news articles or email data), there will be certain values that are more common than others simply due to the structure of the language itself. For example, words like 'is', 'the', 'an', pronouns, grammatical constructs, etc., could skew our matrix and affect our analysis.
There are a couple of ways to mitigate this. One way is to use the stop_words parameter and set its value to english. This will automatically ignore all the words in our input text that are found in a built-in list of English stop words in scikit-learn.
Another way of mitigating this is by using the tfidf(Term frequency index document frequency) method.
One potential issue that can arise from using this method is that if our dataset of text is extremely large (say if we have a large collection of news articles or email data), there will be certain values that are more common than others simply due to the structure of the language itself. For example, words like 'is', 'the', 'an', pronouns, grammatical constructs, etc., could skew our matrix and affect our analysis.
There are a couple of ways to mitigate this. One way is to use the stop_words parameter and set its value to english. This will automatically ignore all the words in our input text that are found in a built-in list of English stop words in scikit-learn.
Another way of mitigating this is by using the tfidf(Term frequency index document frequency) method.
Telegram
EPYTHON LAB
Project Idea: Building a spam classifier
Introduction
Spam detection is one of the major applications of Machine Learning in the interwebs today. Pretty much all of the major email service providers have spam detection systems built in and automatically…
Introduction
Spam detection is one of the major applications of Machine Learning in the interwebs today. Pretty much all of the major email service providers have spam detection systems built in and automatically…
👍4
Question: We have implemented bag of words with and without scikit-learn. Let you understand each solutions of the problem posted here https://t.me/epythonlab/689 and write short summary which method(without scikit-learn or using python code and with using scikit-learn library) do you think is best option to implement bag of words?
Send your summary to @asibehtenager. Your summary will be posted on the channel that help others to learn from you.
Send your summary to @asibehtenager. Your summary will be posted on the channel that help others to learn from you.
Telegram
EPYTHON LAB
Project Idea: Building a spam classifier
Introduction
Spam detection is one of the major applications of Machine Learning in the interwebs today. Pretty much all of the major email service providers have spam detection systems built in and automatically…
Introduction
Spam detection is one of the major applications of Machine Learning in the interwebs today. Pretty much all of the major email service providers have spam detection systems built in and automatically…
👍3
How to get data from wikipedia
https://www.youtube.com/watch?v=xF-clSS2zM0&list=UUsFz0IGS9qFcwrh7a91juPg&index=35
https://www.youtube.com/watch?v=xF-clSS2zM0&list=UUsFz0IGS9qFcwrh7a91juPg&index=35
YouTube
Access and Parse Data from Wikipedia with Wikipedia API in Python
In this video, you will learn about how to access data from Wikipedia and parse into your own local language using Wikipedia API in Python.
👍2
Data Cleansing in Pandas
https://www.youtube.com/watch?v=veF0_Bgk5Aw&list=UUsFz0IGS9qFcwrh7a91juPg&index=14
https://www.youtube.com/watch?v=veF0_Bgk5Aw&list=UUsFz0IGS9qFcwrh7a91juPg&index=14
YouTube
How to Parse and Clean Data using Pandas Library
Hello everyone and welcome. In this video, you will learn about how to parse and clean data using Pandas Library.
#pandas #datacleansing #python
Ask your question in the telegram group https://t.me/epythonlab/
Thanks for watching!
#pandas #datacleansing #python
Ask your question in the telegram group https://t.me/epythonlab/
Thanks for watching!
👍4🔥1
Best Selling Python Book on Amazon for free. Who want this book? you can find the pdf at the next post. Like or Dislike?
#pythonbooks @epythonlab
#pythonbooks @epythonlab
👍4❤1