Everyweek we will be connecting for DataScience Dialogue. We discussed AutoML and DataRobot Demo #2 in this week. I have demonstrated the DataRobot which market leading AutoML tool.
Unicorn - A Magical Full-stack Framework for Django - My Talk #2 https://youtu.be/UlFcUXxYSto
YouTube
Unicorn - A Magical Full-stack Framework for Django
Unicorn is a reactive component framework for the Django web framework. It makes AJAX calls in the background, and dynamically updates the DOM.
Unicorn Benefits
1. Add in simple interactions to regular Django templates without learning a new templating…
Unicorn Benefits
1. Add in simple interactions to regular Django templates without learning a new templating…
Machine learning steps:
Analyze the problem
Gather the data
Prepare the data
Choose the right model
Train the model
Evaluate the results
Look for biases
Tune it
Deploy the model
Monitor it
Retrain it
Analyze the problem
Gather the data
Prepare the data
Choose the right model
Train the model
Evaluate the results
Look for biases
Tune it
Deploy the model
Monitor it
Retrain it
Fixing overfitting:
Simplify the model (fewer parameters)
Simplify training data (fewer attributes)
Constrain the model (regularization)
Use cross-validation
Use Early stopping
Build an ensemble
Gather more data
Simplify the model (fewer parameters)
Simplify training data (fewer attributes)
Constrain the model (regularization)
Use cross-validation
Use Early stopping
Build an ensemble
Gather more data
Fixing underfitting:
More complex model (more parameters)
Increase number of features
Feature engineer should help
Unconstrain the model (no regularization)
Reduce noise on the data
Train for longer
More complex model (more parameters)
Increase number of features
Feature engineer should help
Unconstrain the model (no regularization)
Reduce noise on the data
Train for longer
I just want to know what do you want to learn ? hit the poll. Thank you.
Anonymous Poll
40%
AI/ ML/ DL
7%
API in Python - Rest
14%
Python Web - Django & Flask
7%
Python Datascience
4%
Stats and Maths for Datascience
4%
Webscraping using Python
4%
Blockchain using Python
9%
Hacking using Python
4%
Core and Advanced Python
7%
DevOps using Python
YAML- Introduction, Read and Write YAML files using Python Script - Talk #3 https://youtu.be/XrPwTkyoBok
YouTube
YAML in 29 mins video (PYYAML)
In this video, we are discussing about another configuration and data serialisation concept is called YAML.
Introduction to YAML
How to create YAML file ?
How to read YAML using Python Script ?
How to write YAML File using Python Script ?
Official Website…
Introduction to YAML
How to create YAML file ?
How to read YAML using Python Script ?
How to write YAML File using Python Script ?
Official Website…
This is an important configuration file for deployment AI models and DevOps configurations
Free Azure training and Free Voucher for Az-900 (Azure Fundamentals)
https://www.microsoft.com/en-ie/training-days?mkt_tok=eyJpIjoiTWpZeFlqVXlabU13T1RaayIsInQiOiJpelVQcVFlSHZQUWR3YXBBT0dpVGhoVzdyNGJNc3R4TENJdzlaSjZUdlhha2JMdzR5b2JubmRZbnJObVJ1NjB4MHF6WFVEdE5HR2p4OXA5SkFBbmk5dlFlWTl5eDdzUG9LK2lNZE9jSVdsdDI1MnRcL29OZXVhRHduWitIb1RhNjREWmVIUTBnTEx4MVAxWXU5OSs3UGJnPT0ifQ==
https://www.microsoft.com/en-ie/training-days?mkt_tok=eyJpIjoiTWpZeFlqVXlabU13T1RaayIsInQiOiJpelVQcVFlSHZQUWR3YXBBT0dpVGhoVzdyNGJNc3R4TENJdzlaSjZUdlhha2JMdzR5b2JubmRZbnJObVJ1NjB4MHF6WFVEdE5HR2p4OXA5SkFBbmk5dlFlWTl5eDdzUG9LK2lNZE9jSVdsdDI1MnRcL29OZXVhRHduWitIb1RhNjREWmVIUTBnTEx4MVAxWXU5OSs3UGJnPT0ifQ==
Microsoft
Microsoft Virtual Training Days
Develop your skills in the free Microsoft Virtual Training Days
Simple multi-dataset detection
Github: https://github.com/xingyizhou/UniDet
Paper: https://arxiv.org/abs/2102.13086v1
Github: https://github.com/xingyizhou/UniDet
Paper: https://arxiv.org/abs/2102.13086v1
GitHub
GitHub - xingyizhou/UniDet: Object detection on multiple datasets with an automatically learned unified label space.
Object detection on multiple datasets with an automatically learned unified label space. - xingyizhou/UniDet
Amazon Web Services
◦ Amazon DynamoDB - 25GB NoSQL DB
◦ Amazon Lambda - 1 Million requests per month
◦ Amazon SNS - 1 million publishes per month
◦ Amazon Cloudwatch - 10 custom metrics and 10 alarms
◦ Amazon Glacier - 10GB long-term object storage
◦ Amazon SQS - 1 million messaging queue requests
◦ Amazon CodeBuild - 100min of build time per month
◦ Amazon Code Commit - 5 active users per month
◦ Amazon Code Pipeline - 1 active pipeline per month
◦ Amazon DynamoDB - 25GB NoSQL DB
◦ Amazon Lambda - 1 Million requests per month
◦ Amazon SNS - 1 million publishes per month
◦ Amazon Cloudwatch - 10 custom metrics and 10 alarms
◦ Amazon Glacier - 10GB long-term object storage
◦ Amazon SQS - 1 million messaging queue requests
◦ Amazon CodeBuild - 100min of build time per month
◦ Amazon Code Commit - 5 active users per month
◦ Amazon Code Pipeline - 1 active pipeline per month
IBM Cloud
◦ Cloud Functions - 5 million executions per month
◦ Object Storage - 25GB per month
◦ Cloudant database - 1 GB of data storage
◦ Db2 database - 100MB of data storage
◦ API Connect - 50,000 API calls per month
◦ Availability Monitoring - 3 million data points per month
◦ Log Analysis - 500MB of daily log
◦ Cloud Functions - 5 million executions per month
◦ Object Storage - 25GB per month
◦ Cloudant database - 1 GB of data storage
◦ Db2 database - 100MB of data storage
◦ API Connect - 50,000 API calls per month
◦ Availability Monitoring - 3 million data points per month
◦ Log Analysis - 500MB of daily log
Oracle Cloud
◦ Compute - 2 VM.Standard.E2.1.Micro 1GB RAM
◦ Block Volume - 2 volumes, 100 GB total (used for compute)
◦ Object Storage - 10 GB
◦ Load balancer - 1 instance with 10 Mbps
◦ Databases - 2 DBs, 20 GB each
◦ Monitoring - 500 million ingestion datapoints, 1 billion retrieval datapoints
◦ Bandwidth - 10TB egress per month
◦ Notifications - 1 million delivery options per month, 1000 emails sent per month
◦ Compute - 2 VM.Standard.E2.1.Micro 1GB RAM
◦ Block Volume - 2 volumes, 100 GB total (used for compute)
◦ Object Storage - 10 GB
◦ Load balancer - 1 instance with 10 Mbps
◦ Databases - 2 DBs, 20 GB each
◦ Monitoring - 500 million ingestion datapoints, 1 billion retrieval datapoints
◦ Bandwidth - 10TB egress per month
◦ Notifications - 1 million delivery options per month, 1000 emails sent per month
- 611 datasets you can download in one line of python
- 467 languages covered, 99 with at least 10 datasets
- efficient pre-processing to free you from memory constraints
https://github.com/huggingface/datasets
- 467 languages covered, 99 with at least 10 datasets
- efficient pre-processing to free you from memory constraints
https://github.com/huggingface/datasets
GitHub
GitHub - huggingface/datasets: 🤗 The largest hub of ready-to-use datasets for AI models with fast, easy-to-use and efficient data…
🤗 The largest hub of ready-to-use datasets for AI models with fast, easy-to-use and efficient data manipulation tools - huggingface/datasets
Q: Shall we remove the duplicate records (i.e. records with exactly the same features) from the dataset before training an ML model?
A: It depends. If the duplicated records belong to a single instance/event (e.g. when one instance is captured twice), they should be removed. For example, by looking at the customer_IDs, we may notice some of the customers are duplicated in our data. In this case, we should deduplicate. Otherwise, the ML model cannot estimate the prior probability distribution correctly.
On the other hand, if the records with the same features belong to different instances/events, we should keep them. For example, if two customers have the same age, sex, balance, and etc, their data should be used to train the model.
To have a better understanding, consider a Naive Bayes model for a classification problem. By removing the samples with the same features, the model misestimates the prior probabilities that eventually affects the output.
Intuitively, the model needs to know the frequency/distribution of those duplicated records.
A: It depends. If the duplicated records belong to a single instance/event (e.g. when one instance is captured twice), they should be removed. For example, by looking at the customer_IDs, we may notice some of the customers are duplicated in our data. In this case, we should deduplicate. Otherwise, the ML model cannot estimate the prior probability distribution correctly.
On the other hand, if the records with the same features belong to different instances/events, we should keep them. For example, if two customers have the same age, sex, balance, and etc, their data should be used to train the model.
To have a better understanding, consider a Naive Bayes model for a classification problem. By removing the samples with the same features, the model misestimates the prior probabilities that eventually affects the output.
Intuitively, the model needs to know the frequency/distribution of those duplicated records.