20 Popular Machine Learning Metrics. Part 1: Classification & Regression Evaluation Metrics
#DataScience #MachineLearning #ArtificialIntelligence
http://bit.ly/34aNEKN
❇️ @AI_Python_EN
#DataScience #MachineLearning #ArtificialIntelligence
http://bit.ly/34aNEKN
❇️ @AI_Python_EN
What's the purpose of statistics?
"Do you think the purpose of existence is to pass out of existence is the purpose of existence?" - Ray Manzarek
The former Doors organist poses some fundamental questions to which definitive answers remain elusive. Happily, the purpose of statistics is easier to fathom since humans are its creator. Put simply, it is to enhance decision making.
These decisions could be those made by scientists, businesspeople, politicians and other government officials, by medical and legal professionals, or even by religious authorities. In informal ways, ordinary folks also use statistics to help make better decisions.
How does it do this?
One way is by providing basic information, such as how many, how much and how often. Stat in statistics is derived from the word state, as in nation state and, as it emerged as a formal discipline, describing nations quantitatively (e.g., population size, number of citizens working in manufacturing) became a fundamental purpose. Frequencies, means, medians and standard deviations are now familiar to anyone.
Often we must rely on samples to make inferences about our population of interest. From a consumer survey, for example, we might estimate mean annual household expenditures on snack foods. This is known as inferential statistics, and confidence intervals will be familiar to anyone who has taken an introductory course in statistics. So will methods such as t-tests and chi-squared tests which can be used to make population inferences about groups (e.g., are males more likely than females to eat pretzels?).
Another way statistics helps us make decisions is by exploring relationships among variables through the use of cross tabulations, correlations and data visualizations. Exploratory data analysis (EDA) can also take on more complex forms and draw upon methods such as principal components analysis, regression and cluster analysis. EDA is often used to develop hypotheses which will be assessed more rigorously in subsequent research.
These hypotheses are often causal in nature, for example, why some people avoid snacks. Randomized experiments are generally considered the best approach in causal analysis but are not always possible or appropriate; see Why experiment? for some more thoughts on this subject. Hypotheses can be further developed and refined, not simply tested through Null Hypothesis Significance Testing, though this has been traditionally frowned upon since we are using the same data for multiple purposes.
Many statisticians are actively involved in designing research, not merely using secondary data. This is a large subject but briefly summarized in Preaching About Primary Research.
Making classifications, predictions and forecasts is another traditional role of statistics. In a data science context, the first two are often called predictive analytics and employ methods such as random forests and standard (OLS) regression. Forecasting sales for the next year is a different matter and normally requires the use of time-series analysis. There is also unsupervised learning, which aims to find previously unknown patterns in unlabeled data. Using K-means clustering to partition consumer survey respondents into segments based on their attitudes is an example of this.
Quality control, operations research, what-if simulations and risk assessment are other areas where statistics play a key role. There are many others, as this page illustrates.
The fuzzy buzzy term analytics is frequently used interchangeably with statistics, an offense to which I also plead guilty.
"The best thing about being a statistician is that you get to play in everyone's backyard." - John Tukey
#ai #artificialintelligence #ml #statistics #bigdata #machinelearning
#datascience
❇️ @AI_Python_EN
"Do you think the purpose of existence is to pass out of existence is the purpose of existence?" - Ray Manzarek
The former Doors organist poses some fundamental questions to which definitive answers remain elusive. Happily, the purpose of statistics is easier to fathom since humans are its creator. Put simply, it is to enhance decision making.
These decisions could be those made by scientists, businesspeople, politicians and other government officials, by medical and legal professionals, or even by religious authorities. In informal ways, ordinary folks also use statistics to help make better decisions.
How does it do this?
One way is by providing basic information, such as how many, how much and how often. Stat in statistics is derived from the word state, as in nation state and, as it emerged as a formal discipline, describing nations quantitatively (e.g., population size, number of citizens working in manufacturing) became a fundamental purpose. Frequencies, means, medians and standard deviations are now familiar to anyone.
Often we must rely on samples to make inferences about our population of interest. From a consumer survey, for example, we might estimate mean annual household expenditures on snack foods. This is known as inferential statistics, and confidence intervals will be familiar to anyone who has taken an introductory course in statistics. So will methods such as t-tests and chi-squared tests which can be used to make population inferences about groups (e.g., are males more likely than females to eat pretzels?).
Another way statistics helps us make decisions is by exploring relationships among variables through the use of cross tabulations, correlations and data visualizations. Exploratory data analysis (EDA) can also take on more complex forms and draw upon methods such as principal components analysis, regression and cluster analysis. EDA is often used to develop hypotheses which will be assessed more rigorously in subsequent research.
These hypotheses are often causal in nature, for example, why some people avoid snacks. Randomized experiments are generally considered the best approach in causal analysis but are not always possible or appropriate; see Why experiment? for some more thoughts on this subject. Hypotheses can be further developed and refined, not simply tested through Null Hypothesis Significance Testing, though this has been traditionally frowned upon since we are using the same data for multiple purposes.
Many statisticians are actively involved in designing research, not merely using secondary data. This is a large subject but briefly summarized in Preaching About Primary Research.
Making classifications, predictions and forecasts is another traditional role of statistics. In a data science context, the first two are often called predictive analytics and employ methods such as random forests and standard (OLS) regression. Forecasting sales for the next year is a different matter and normally requires the use of time-series analysis. There is also unsupervised learning, which aims to find previously unknown patterns in unlabeled data. Using K-means clustering to partition consumer survey respondents into segments based on their attitudes is an example of this.
Quality control, operations research, what-if simulations and risk assessment are other areas where statistics play a key role. There are many others, as this page illustrates.
The fuzzy buzzy term analytics is frequently used interchangeably with statistics, an offense to which I also plead guilty.
"The best thing about being a statistician is that you get to play in everyone's backyard." - John Tukey
#ai #artificialintelligence #ml #statistics #bigdata #machinelearning
#datascience
❇️ @AI_Python_EN
What is a Time Series?
Many data sets are cross-sectional and represent a single slice of time. However, we also have data collected over many periods - weekly sales data, for instance. This is an example of time series data. Time series analysis is a specialized branch of statistics used extensively in fields such as Econometrics and Operations Research. Unfortunately, most Marketing Researchers and Data Scientists still have had little exposure to it. As we'll see, it has many very important applications for marketers.
Just to get our terms straight, below is a simple illustration of what a time series data file looks like. The column labeled DATE is the date variable and corresponds to a respondent ID in survey research data. WEEK, the sequence number of each week, is included because using this column rather than the actual dates can make graphs less cluttered. The sequence number can also serve as a trend variable in certain kinds of time series models.
I should note that the unit of analysis doesn't have to be brands and can include individual consumers or groups of consumers whose behavior is followed over time.
But first, why do we need to distinguish between cross-sectional and time series analysis? For several reasons, one being that our research objectives will usually be different. Another is that most statistical methods we learn in college and make use of in marketing research are intended for cross-sectional data, and if we apply them to time series data the results we obtain may be misleading. Time is a dimension in the data we need to take into account.
Time series analysis is a complex subject but, in short, when we use our usual cross-sectional techniques such as regression on time series data:
1- Standard errors can be far off. More often than not, p-values will be too small and variables can appear "more significant" than they really are;
2- In some cases regression coefficients can be seriously biased; and
3- We are not taking advantage of the information the serial correlation in the data provides.
Univariate Analysis
To return to our example data, one objective might be to forecast sales for our brand. There are many ways to do this and the most straightforward is univariate analysis, in which we essentially extrapolate future data from past data. Two popular univariate time series methods are Exponential Smoothing (e.g., Holt-Winters) and ARIMA (Autoregressive Integrated Moving Average). Causal Modeling
Obviously, there are risks in assuming the future will be like the past but, fortunately, we can also include "causal" (predictor) variables to help mitigate these risks. But besides improving the accuracy of our forecasts, another objective may be to understand which marketing activities most influence sales.
Causal variables will typically include data such as GRPs and price and also may incorporate data from consumer surveys or exogenous variables such as GDP. These kinds of analyses are called Market Response or Marketing Mix modeling and are a central component of ROMI (Return on Marketing Investment) analysis. They can be thought of as key driver analysis for time series data. The findings are often used in simulations to try to find the "optimal" marketing mix.
Transfer Function Models, ARMAX and Dynamic Regression are terms that refer to specialized regression procedures developed for time series data. There are more sophisticated methods, in addition, and I'll touch on a few in just a bit.
Part 1
❇️ @AI_Python_EN
Many data sets are cross-sectional and represent a single slice of time. However, we also have data collected over many periods - weekly sales data, for instance. This is an example of time series data. Time series analysis is a specialized branch of statistics used extensively in fields such as Econometrics and Operations Research. Unfortunately, most Marketing Researchers and Data Scientists still have had little exposure to it. As we'll see, it has many very important applications for marketers.
Just to get our terms straight, below is a simple illustration of what a time series data file looks like. The column labeled DATE is the date variable and corresponds to a respondent ID in survey research data. WEEK, the sequence number of each week, is included because using this column rather than the actual dates can make graphs less cluttered. The sequence number can also serve as a trend variable in certain kinds of time series models.
I should note that the unit of analysis doesn't have to be brands and can include individual consumers or groups of consumers whose behavior is followed over time.
But first, why do we need to distinguish between cross-sectional and time series analysis? For several reasons, one being that our research objectives will usually be different. Another is that most statistical methods we learn in college and make use of in marketing research are intended for cross-sectional data, and if we apply them to time series data the results we obtain may be misleading. Time is a dimension in the data we need to take into account.
Time series analysis is a complex subject but, in short, when we use our usual cross-sectional techniques such as regression on time series data:
1- Standard errors can be far off. More often than not, p-values will be too small and variables can appear "more significant" than they really are;
2- In some cases regression coefficients can be seriously biased; and
3- We are not taking advantage of the information the serial correlation in the data provides.
Univariate Analysis
To return to our example data, one objective might be to forecast sales for our brand. There are many ways to do this and the most straightforward is univariate analysis, in which we essentially extrapolate future data from past data. Two popular univariate time series methods are Exponential Smoothing (e.g., Holt-Winters) and ARIMA (Autoregressive Integrated Moving Average). Causal Modeling
Obviously, there are risks in assuming the future will be like the past but, fortunately, we can also include "causal" (predictor) variables to help mitigate these risks. But besides improving the accuracy of our forecasts, another objective may be to understand which marketing activities most influence sales.
Causal variables will typically include data such as GRPs and price and also may incorporate data from consumer surveys or exogenous variables such as GDP. These kinds of analyses are called Market Response or Marketing Mix modeling and are a central component of ROMI (Return on Marketing Investment) analysis. They can be thought of as key driver analysis for time series data. The findings are often used in simulations to try to find the "optimal" marketing mix.
Transfer Function Models, ARMAX and Dynamic Regression are terms that refer to specialized regression procedures developed for time series data. There are more sophisticated methods, in addition, and I'll touch on a few in just a bit.
Part 1
❇️ @AI_Python_EN
AI, Python, Cognitive Neuroscience
What is a Time Series? Many data sets are cross-sectional and represent a single slice of time. However, we also have data collected over many periods - weekly sales data, for instance. This is an example of time series data. Time series analysis is a…
What is a Time Series?
Multiple Time Series
You might need to analyze multiple time series simultaneously, e.g., sales of your brands and key competitors. Figure 2 below is an example and shows weekly sales data for three brands over a one-year period. Since sales movements of brands competing with each other will typically be correlated over time, it often will make sense, and be more statistically rigorous, to include data for all key brands in one model instead of running separate models for each brand.
Vector Autoregression (VAR), the Vector Error Correction Model (VECM) and the more general State Space framework are three frequently-used approaches to multiple time series analysis. Causal data can be included and Market Response/Marketing Mix modeling conducted.
Other Methods
There are several additional methods relevant to marketing research and data science I'll now briefly describe.
Panel Models include cross sections in a time series analysis. Sales and marketing data for several brands, for instance, can be stacked on top of one another and analyzed simultaneously. Panel modeling permits category-level analysis and also comes in handy when data are infrequent (e.g., monthly or quarterly).
Longitudinal Analysis is a generic and sometimes confusingly-used term that can refer to Panel modeling with a small number of periods ("short panels"), as well as to Repeated Measures, Growth Curve Analysis or Multilevel Analysis. In a literal sense it subsumes time series analysis but many authorities reserve that term for analysis of data with many time periods (e.g., >25). Structural Equation Modeling (SEM) is one method widely-used in Growth Curve modeling and other longitudinal analyses.
Survival Analysis is a branch of #statistics for analyzing the expected length of time until one or more events happen, such as death in biological organisms and failure in mechanical systems. It's also called Duration Analysis in Economics and Event History Analysis in Sociology. It is often used in customer churn analysis.
In some instances one model will not fit an entire series well because of structural changes within the series, and model parameters will vary across time. There are numerous breakpoint tests and models (e.g., State Space, Switching Regression) available for these circumstances.
You may also notice that sales, call center activity or other data series you are tracking exhibit clusters of volatility. That is, there may be periods in which the figures move up and down in much more extreme fashion than other periods.
In these cases, you should consider a class of models with the forbidding name of GARCH (Generalized Autoregressive Conditional Heteroskedasticity). ARCH and GARCH models were originally developed for financial markets but can used for other kinds of time series data when volatility is of interest. Volatility can fall into many patterns and, accordingly, there are many flavors of GARCH models. Causal variables can be included. There are also multivariate extensions (MGARCH) if you have two or more series you wish to analyze jointly.
Non-Parametric Econometrics is a very different approach to studying time series and longitudinal data that is now receiving a lot of attention because of #bigdata and the greater computing power we now enjoy. These methods are increasingly feasible and useful as alternatives to the more familiar methods such as those described in this article.
#MachineLearning (e.g., #ArtificialNeuralNetwork s) is also useful in some circumstances but the results can be hard to interpret - they predict well but may not help us understand the mechanism that generated to data (the Why). To some extent, this drawback also applies to non-parametric techniques.
Most of the methods I've mentioned are Time Domain techniques. Another group of methods known as Frequency Domain, plays a more limited role in Marketing Research.
❇️ @AI_Python_EN
Multiple Time Series
You might need to analyze multiple time series simultaneously, e.g., sales of your brands and key competitors. Figure 2 below is an example and shows weekly sales data for three brands over a one-year period. Since sales movements of brands competing with each other will typically be correlated over time, it often will make sense, and be more statistically rigorous, to include data for all key brands in one model instead of running separate models for each brand.
Vector Autoregression (VAR), the Vector Error Correction Model (VECM) and the more general State Space framework are three frequently-used approaches to multiple time series analysis. Causal data can be included and Market Response/Marketing Mix modeling conducted.
Other Methods
There are several additional methods relevant to marketing research and data science I'll now briefly describe.
Panel Models include cross sections in a time series analysis. Sales and marketing data for several brands, for instance, can be stacked on top of one another and analyzed simultaneously. Panel modeling permits category-level analysis and also comes in handy when data are infrequent (e.g., monthly or quarterly).
Longitudinal Analysis is a generic and sometimes confusingly-used term that can refer to Panel modeling with a small number of periods ("short panels"), as well as to Repeated Measures, Growth Curve Analysis or Multilevel Analysis. In a literal sense it subsumes time series analysis but many authorities reserve that term for analysis of data with many time periods (e.g., >25). Structural Equation Modeling (SEM) is one method widely-used in Growth Curve modeling and other longitudinal analyses.
Survival Analysis is a branch of #statistics for analyzing the expected length of time until one or more events happen, such as death in biological organisms and failure in mechanical systems. It's also called Duration Analysis in Economics and Event History Analysis in Sociology. It is often used in customer churn analysis.
In some instances one model will not fit an entire series well because of structural changes within the series, and model parameters will vary across time. There are numerous breakpoint tests and models (e.g., State Space, Switching Regression) available for these circumstances.
You may also notice that sales, call center activity or other data series you are tracking exhibit clusters of volatility. That is, there may be periods in which the figures move up and down in much more extreme fashion than other periods.
In these cases, you should consider a class of models with the forbidding name of GARCH (Generalized Autoregressive Conditional Heteroskedasticity). ARCH and GARCH models were originally developed for financial markets but can used for other kinds of time series data when volatility is of interest. Volatility can fall into many patterns and, accordingly, there are many flavors of GARCH models. Causal variables can be included. There are also multivariate extensions (MGARCH) if you have two or more series you wish to analyze jointly.
Non-Parametric Econometrics is a very different approach to studying time series and longitudinal data that is now receiving a lot of attention because of #bigdata and the greater computing power we now enjoy. These methods are increasingly feasible and useful as alternatives to the more familiar methods such as those described in this article.
#MachineLearning (e.g., #ArtificialNeuralNetwork s) is also useful in some circumstances but the results can be hard to interpret - they predict well but may not help us understand the mechanism that generated to data (the Why). To some extent, this drawback also applies to non-parametric techniques.
Most of the methods I've mentioned are Time Domain techniques. Another group of methods known as Frequency Domain, plays a more limited role in Marketing Research.
❇️ @AI_Python_EN
New tutorial! Traffic Sign Classification with #Keras and #TensorFlow 2.0
- 95% accurate
- Includes pre-trained model
- Full tutorial w/ #Python code
http://pyimg.co/5wzc5
#DeepLearning #MachineLearning #ArtificialIntelligence #DataScience #AI #computervision
❇️ @AI_Python_EN
- 95% accurate
- Includes pre-trained model
- Full tutorial w/ #Python code
http://pyimg.co/5wzc5
#DeepLearning #MachineLearning #ArtificialIntelligence #DataScience #AI #computervision
❇️ @AI_Python_EN
"Differentiable Convex Optimization Layers"
CVXPY creates powerful new PyTorch and TensorFlow layers
Agrawal et al.: https://locuslab.github.io/2019-10-28-cvxpylayers/
#PyTorch #TensorFlow #NeurIPS2019
❇️ @AI_Python_EN
CVXPY creates powerful new PyTorch and TensorFlow layers
Agrawal et al.: https://locuslab.github.io/2019-10-28-cvxpylayers/
#PyTorch #TensorFlow #NeurIPS2019
❇️ @AI_Python_EN
This media is not supported in your browser
VIEW IN TELEGRAM
Streamlit, a Python framework that dedicates to deploying Machine Learning and Data Science models. If you are a data scientist who's struggling to showcase your fantastic works, I would encourage you to check it out.
The author had a nice article/ tutorial at Med:
#machinelearning #mlops
#datascience
Github link:
https://github.com/streamlit/streamlit
❇️ @AI_Python_EN
The author had a nice article/ tutorial at Med:
#machinelearning #mlops
#datascience
Github link:
https://github.com/streamlit/streamlit
❇️ @AI_Python_EN
AI, Python, Cognitive Neuroscience
Streamlit, a Python framework that dedicates to deploying Machine Learning and Data Science models. If you are a data scientist who's struggling to showcase your fantastic works, I would encourage you to check it out. The author had a nice article/ tutorial…
Don't confuse creating a data product and deployment. Streamlit makes creating a GUI-type data product easy for Python, much like shiny makes it easy for R.
Deployment is the next step: running it on a server, having security and user authentication, scaling it as required, updating the application when necessary, etc.
That's what the ownR platform does for shiny and dash already and will soon offer for Streamlit too. If you need to actually deploy your data products, especially in an enterprise environment, drop me an invite or a DM for a demo or a trial account!
❇️ @AI_Python_EN
Deployment is the next step: running it on a server, having security and user authentication, scaling it as required, updating the application when necessary, etc.
That's what the ownR platform does for shiny and dash already and will soon offer for Streamlit too. If you need to actually deploy your data products, especially in an enterprise environment, drop me an invite or a DM for a demo or a trial account!
❇️ @AI_Python_EN
Data science is not #MachineLearning .
Data science is not #statistics.
Data science is not analytics.
Data science is not #AI.
#DataScience is a process of:
Obtaining your data
Scrubbing / Cleaning your data
Exploring your data
Modeling your data
iNterpreting your data
Data Science is the science of extracting useful information from data using statistics, skills, experience and domain knowledge.
If you love data, you will like this role....
solving business problems using data is data science. Machine learning/statistics /analytics may come as a way of the solution of a particular business problem. Sometimes we may need all to solve a problem and sometimes even a crosstabs may be handy.
➡️ Get free resources at his site:
www.claoudml.com
❇️ @AI_Python_EN
Data science is not #statistics.
Data science is not analytics.
Data science is not #AI.
#DataScience is a process of:
Obtaining your data
Scrubbing / Cleaning your data
Exploring your data
Modeling your data
iNterpreting your data
Data Science is the science of extracting useful information from data using statistics, skills, experience and domain knowledge.
If you love data, you will like this role....
solving business problems using data is data science. Machine learning/statistics /analytics may come as a way of the solution of a particular business problem. Sometimes we may need all to solve a problem and sometimes even a crosstabs may be handy.
➡️ Get free resources at his site:
www.claoudml.com
❇️ @AI_Python_EN
What Are "Panel Models?" Part 1
In #statistics, the English is sometimes as hard as the math. Vocabulary is frequently used in confusing ways and often differs by discipline. "Panel" and "longitudinal" are two examples - economists tend to favor the first term, while researchers in most other fields use the second to mean essentially the same thing.
But to what "thing" do they refer? Say, for example, households, individual household members, companies or brands are selected and followed over time. Statisticians working in many fields, such as economics and psychology, have developed numerous techniques which allow us to study how these households, household members, companies or brands change over time, and investigate what might have caused these changes.
Marketing mix modeling conducted at the category level is one example that will be close to home for many marketing researchers. In a typical case, we might have four years of weekly sales and marketing data for 6-8 brands in a product or service category. These brands would comprise the panel. This type of modeling is also known as cross-sectional time-series analysis because there is an explicit time component in the modeling. It is just one kind of panel/longitudinal analysis.
Marketing researchers make extensive use of online panels for consumer surveys. Panelists are usually not surveyed on the same topic on different occasions though they can be, in which case we would have a panel selected from an online panel. Some MROCs (aka insights communities) also are large and can be analyzed with these methods.
The reference manual for the Stata statistical software provides an in-depth look at many of these methods, particularly those widely-used in econometrics. I should note that there is a methodological connection with mixed-effects models, which I have briefly summarized here. Mplus is another statistical package which is popular among researchers in psychology, education and healthcare, and its website is another good resource.
Longitudinal/panel modeling has featured in countless papers and conference presentations over the years and is also the subject of many books. Here are some books I have found helpful:
Analysis of Longitudinal Data (Diggle et al.)
Analysis of Panel Data (Hsiao)
Econometric Analysis of Panel Data (Baltagi)
Longitudinal Structural Equation Modeling (Newsom)
Growth Modeling (Grimm et al.)
Longitudinal Analysis (Hoffman)
**Applied Longitudinal Data Analysis for Epidemiology (Twisk)**
Many of these methods can also be performed within a Bayesian statistical framework.
❇️ @AI_Python_EN
In #statistics, the English is sometimes as hard as the math. Vocabulary is frequently used in confusing ways and often differs by discipline. "Panel" and "longitudinal" are two examples - economists tend to favor the first term, while researchers in most other fields use the second to mean essentially the same thing.
But to what "thing" do they refer? Say, for example, households, individual household members, companies or brands are selected and followed over time. Statisticians working in many fields, such as economics and psychology, have developed numerous techniques which allow us to study how these households, household members, companies or brands change over time, and investigate what might have caused these changes.
Marketing mix modeling conducted at the category level is one example that will be close to home for many marketing researchers. In a typical case, we might have four years of weekly sales and marketing data for 6-8 brands in a product or service category. These brands would comprise the panel. This type of modeling is also known as cross-sectional time-series analysis because there is an explicit time component in the modeling. It is just one kind of panel/longitudinal analysis.
Marketing researchers make extensive use of online panels for consumer surveys. Panelists are usually not surveyed on the same topic on different occasions though they can be, in which case we would have a panel selected from an online panel. Some MROCs (aka insights communities) also are large and can be analyzed with these methods.
The reference manual for the Stata statistical software provides an in-depth look at many of these methods, particularly those widely-used in econometrics. I should note that there is a methodological connection with mixed-effects models, which I have briefly summarized here. Mplus is another statistical package which is popular among researchers in psychology, education and healthcare, and its website is another good resource.
Longitudinal/panel modeling has featured in countless papers and conference presentations over the years and is also the subject of many books. Here are some books I have found helpful:
Analysis of Longitudinal Data (Diggle et al.)
Analysis of Panel Data (Hsiao)
Econometric Analysis of Panel Data (Baltagi)
Longitudinal Structural Equation Modeling (Newsom)
Growth Modeling (Grimm et al.)
Longitudinal Analysis (Hoffman)
**Applied Longitudinal Data Analysis for Epidemiology (Twisk)**
Many of these methods can also be performed within a Bayesian statistical framework.
❇️ @AI_Python_EN
AI, Python, Cognitive Neuroscience
What Are "Panel Models?" Part 1 In #statistics, the English is sometimes as hard as the math. Vocabulary is frequently used in confusing ways and often differs by discipline. "Panel" and "longitudinal" are two examples - economists tend to favor the first…
What Are "Panel Models?" Part 2
Rather than having been displaced by big data, AI and machine learning, these techniques are more valuable than ever because we now have more longitudinal data than ever. Many longitudinal methods are computationally intensive and not suitable for massive data on ordinary computers, but parallel processing and cloud computing will often get around this. I also anticipate that more big data versions of their computational algorithms will be developed over the next few years. Here is one example.
For those new to the subject who would like a quick (if somewhat technical) start, I’ve included some edited entries from the Stata reference manual’s glossary below.
Any copy/paste and editing errors are mine.
Arellano–Bond estimator. The Arellano–Bond estimator is a generalized method of moments (GMM) estimator for linear dynamic panel-data models that uses lagged levels of the endogenous variables as well as first differences of the exogenous variables as instruments. The Arellano–Bond estimator removes the panel-specific heterogeneity by first-differencing the regression equation.
autoregressive process. In autoregressive processes, the current value of a variable is a linear function of its own past values and a white-noise error term.
balanced data. A longitudinal or panel dataset is said to be balanced if each panel has the same number of observations.
between estimator. The between estimator is a panel-data estimator that obtains its estimates by running OLS on the panel-level means of the variables. This estimator uses only the between-panel variation in the data to identify the parameters, ignoring any within-panel variation. For it to be consistent, the between estimator requires that the panel-level means of the regressors be uncorrelated with the panel-specific heterogeneity terms.
correlation structure. A correlation structure is a set of assumptions imposed on the within-panel variance–covariance matrix of the errors in a panel-data model.
cross-sectional data. Cross-sectional data refers to data collected over a set of individuals, such as households, firms, or countries sampled from a population at a given point in time.
cross-sectional time-series data. Cross-sectional time-series data is another name for panel data. The term cross-sectional time-series data is sometimes reserved for datasets in which a relatively small number of panels were observed over many periods. See also panel data.
disturbance term. The disturbance term encompasses any shocks that occur to the dependent variable that cannot be explained by the conditional (or deterministic) portion of the model.
dynamic model. A dynamic model is one in which prior values of the dependent variable or disturbance term affect the current value of the dependent variable.
endogenous variable. An endogenous variable is a regressor that is correlated with the unobservable error term. Equivalently, an endogenous variable is one whose values are determined by the equilibrium or outcome of a structural model.
exogenous variable. An exogenous variable is a regressor that is not correlated with any of the unobservable error terms in the model. Equivalently, an exogenous variable is one whose values change independently of the other variables in a structural model.
fixed-effects model. The fixed-effects model is a model for panel data in which the panel-specific errors are treated as fixed parameters. These parameters are panel-specific intercepts and therefore allow the conditional mean of the dependent variable to vary across panels. The linear fixed effects estimator is consistent, even if the regressors are correlated with the fixed effects.
generalized estimating equations (GEE). The method of generalized estimating equations is used to fit population-averaged panel-data models. GEE extends the GLM method by allowing the user to specify a variety of different within-panel correlation structures.
Rather than having been displaced by big data, AI and machine learning, these techniques are more valuable than ever because we now have more longitudinal data than ever. Many longitudinal methods are computationally intensive and not suitable for massive data on ordinary computers, but parallel processing and cloud computing will often get around this. I also anticipate that more big data versions of their computational algorithms will be developed over the next few years. Here is one example.
For those new to the subject who would like a quick (if somewhat technical) start, I’ve included some edited entries from the Stata reference manual’s glossary below.
Any copy/paste and editing errors are mine.
Arellano–Bond estimator. The Arellano–Bond estimator is a generalized method of moments (GMM) estimator for linear dynamic panel-data models that uses lagged levels of the endogenous variables as well as first differences of the exogenous variables as instruments. The Arellano–Bond estimator removes the panel-specific heterogeneity by first-differencing the regression equation.
autoregressive process. In autoregressive processes, the current value of a variable is a linear function of its own past values and a white-noise error term.
balanced data. A longitudinal or panel dataset is said to be balanced if each panel has the same number of observations.
between estimator. The between estimator is a panel-data estimator that obtains its estimates by running OLS on the panel-level means of the variables. This estimator uses only the between-panel variation in the data to identify the parameters, ignoring any within-panel variation. For it to be consistent, the between estimator requires that the panel-level means of the regressors be uncorrelated with the panel-specific heterogeneity terms.
correlation structure. A correlation structure is a set of assumptions imposed on the within-panel variance–covariance matrix of the errors in a panel-data model.
cross-sectional data. Cross-sectional data refers to data collected over a set of individuals, such as households, firms, or countries sampled from a population at a given point in time.
cross-sectional time-series data. Cross-sectional time-series data is another name for panel data. The term cross-sectional time-series data is sometimes reserved for datasets in which a relatively small number of panels were observed over many periods. See also panel data.
disturbance term. The disturbance term encompasses any shocks that occur to the dependent variable that cannot be explained by the conditional (or deterministic) portion of the model.
dynamic model. A dynamic model is one in which prior values of the dependent variable or disturbance term affect the current value of the dependent variable.
endogenous variable. An endogenous variable is a regressor that is correlated with the unobservable error term. Equivalently, an endogenous variable is one whose values are determined by the equilibrium or outcome of a structural model.
exogenous variable. An exogenous variable is a regressor that is not correlated with any of the unobservable error terms in the model. Equivalently, an exogenous variable is one whose values change independently of the other variables in a structural model.
fixed-effects model. The fixed-effects model is a model for panel data in which the panel-specific errors are treated as fixed parameters. These parameters are panel-specific intercepts and therefore allow the conditional mean of the dependent variable to vary across panels. The linear fixed effects estimator is consistent, even if the regressors are correlated with the fixed effects.
generalized estimating equations (GEE). The method of generalized estimating equations is used to fit population-averaged panel-data models. GEE extends the GLM method by allowing the user to specify a variety of different within-panel correlation structures.
AI, Python, Cognitive Neuroscience
What Are "Panel Models?" Part 1 In #statistics, the English is sometimes as hard as the math. Vocabulary is frequently used in confusing ways and often differs by discipline. "Panel" and "longitudinal" are two examples - economists tend to favor the first…
generalized linear model. The generalized linear model is an estimation framework in which the user specifies a distributional family for the dependent variable and a link function that relates the dependent variable to a linear combination of the regressors. The distribution must be a member of the exponential family of distributions. The generalized linear model encompasses many common models, including linear, probit, and Poisson regression.
idiosyncratic error term. In longitudinal or panel-data models, the idiosyncratic error term refers to the observation-specific zero-mean random-error term. It is analogous to the random-error term of cross-sectional regression analysis.
instrumental variables. Instrumental variables are exogenous variables that are correlated with one or more of the endogenous variables in a structural model. The term instrumental variable is often reserved for those exogenous variables that are not included as regressors in the model.
instrumental-variables (IV) estimator. An instrumental variables estimator uses instrumental variables to produce consistent parameter estimates in models that contain endogenous variables. IV estimators can also be used to control for measurement error.
longitudinal data. Longitudinal data is another term for panel data.
overidentifying restrictions. The order condition for model identification requires that the number of exogenous variables excluded from the model be at least as great as the number of endogenous regressors. When the number of excluded exogenous variables exceeds the number of endogenous regressors, the model is overidentified, and the validity of the instruments can then be checked via a test of overidentifying restrictions.
panel data. Panel data are data in which the same units were observed over multiple periods. The units, called panels, are often firms, households, or patients who were observed at several points in time. In a typical panel dataset, the number of panels is large, and the number of observations per panel is relatively small.
panel-corrected standard errors (PCSEs). The term panel-corrected standard errors refers to a class of estimators for the variance–covariance matrix of the OLS estimator when there are relatively few panels with many observations per panel. PCSEs account for heteroskedasticity, autocorrelation, or cross-sectional correlation.
pooled estimator. A pooled estimator ignores the longitudinal or panel aspect of a dataset and treats the observations as if they were cross-sectional.
population-averaged model. A population-averaged model is used for panel data in which the parameters measure the effects of the regressors on the outcome for the average individual in the population. The panel-specific errors are treated as uncorrelated random variables drawn from a population with zero mean and constant variance, and the parameters measure the effects of the regressors on the dependent variable after integrating over the distribution of the random effects.
predetermined variable. A predetermined variable is a regressor in which its contemporaneous and future values are not correlated with the unobservable error term but past values are correlated with the error term.
prewhiten. To prewhiten is to apply a transformation to a time series so that it becomes white noise.
❇️ @AI_Python_EN
idiosyncratic error term. In longitudinal or panel-data models, the idiosyncratic error term refers to the observation-specific zero-mean random-error term. It is analogous to the random-error term of cross-sectional regression analysis.
instrumental variables. Instrumental variables are exogenous variables that are correlated with one or more of the endogenous variables in a structural model. The term instrumental variable is often reserved for those exogenous variables that are not included as regressors in the model.
instrumental-variables (IV) estimator. An instrumental variables estimator uses instrumental variables to produce consistent parameter estimates in models that contain endogenous variables. IV estimators can also be used to control for measurement error.
longitudinal data. Longitudinal data is another term for panel data.
overidentifying restrictions. The order condition for model identification requires that the number of exogenous variables excluded from the model be at least as great as the number of endogenous regressors. When the number of excluded exogenous variables exceeds the number of endogenous regressors, the model is overidentified, and the validity of the instruments can then be checked via a test of overidentifying restrictions.
panel data. Panel data are data in which the same units were observed over multiple periods. The units, called panels, are often firms, households, or patients who were observed at several points in time. In a typical panel dataset, the number of panels is large, and the number of observations per panel is relatively small.
panel-corrected standard errors (PCSEs). The term panel-corrected standard errors refers to a class of estimators for the variance–covariance matrix of the OLS estimator when there are relatively few panels with many observations per panel. PCSEs account for heteroskedasticity, autocorrelation, or cross-sectional correlation.
pooled estimator. A pooled estimator ignores the longitudinal or panel aspect of a dataset and treats the observations as if they were cross-sectional.
population-averaged model. A population-averaged model is used for panel data in which the parameters measure the effects of the regressors on the outcome for the average individual in the population. The panel-specific errors are treated as uncorrelated random variables drawn from a population with zero mean and constant variance, and the parameters measure the effects of the regressors on the dependent variable after integrating over the distribution of the random effects.
predetermined variable. A predetermined variable is a regressor in which its contemporaneous and future values are not correlated with the unobservable error term but past values are correlated with the error term.
prewhiten. To prewhiten is to apply a transformation to a time series so that it becomes white noise.
❇️ @AI_Python_EN
AI, Python, Cognitive Neuroscience
generalized linear model. The generalized linear model is an estimation framework in which the user specifies a distributional family for the dependent variable and a link function that relates the dependent variable to a linear combination of the regressors.…
What Are "Panel Models?" Part 3
random-coefficients model. A random-coefficients model is a panel-data model in which group specific heterogeneity is introduced by assuming that each group has its own parameter vector, which is drawn from a population common to all panels.
random-effects model. A random-effects model for panel data treats the panel-specific errors as uncorrelated random variables drawn from a population with zero mean and constant variance. The regressors must be uncorrelated with the random effects for the estimates to be consistent.
regressand. The regressand is the variable that is being explained or predicted in a regression model. Synonyms include dependent variable, left-hand-side variable, and endogenous variable.
regressor. Regressors are variables in a regression model used to predict the regressand. Synonyms include independent variable, right-hand-side variable, explanatory variable, predictor variable, and exogenous variable.
strongly balanced. A longitudinal or panel dataset is said to be strongly balanced if each panel has the same number of observations and the observations for different panels were all made at the same times.
unbalanced data. A longitudinal or panel dataset is said to be unbalanced if each panel does not have the same number of observations.
weakly balanced. A longitudinal or panel dataset is said to be weakly balanced if each panel has the same number of observations but the observations for different panels were not all made at the same times.
within estimator. The within estimator is a panel-data estimator that removes the panel-specific heterogeneity by subtracting the panel-level means from each variable and then performing ordinary least squares on the demeaned data. The within estimator is used in fitting the linear fixed-effects model.
❇️ @AI_Python_EN
random-coefficients model. A random-coefficients model is a panel-data model in which group specific heterogeneity is introduced by assuming that each group has its own parameter vector, which is drawn from a population common to all panels.
random-effects model. A random-effects model for panel data treats the panel-specific errors as uncorrelated random variables drawn from a population with zero mean and constant variance. The regressors must be uncorrelated with the random effects for the estimates to be consistent.
regressand. The regressand is the variable that is being explained or predicted in a regression model. Synonyms include dependent variable, left-hand-side variable, and endogenous variable.
regressor. Regressors are variables in a regression model used to predict the regressand. Synonyms include independent variable, right-hand-side variable, explanatory variable, predictor variable, and exogenous variable.
strongly balanced. A longitudinal or panel dataset is said to be strongly balanced if each panel has the same number of observations and the observations for different panels were all made at the same times.
unbalanced data. A longitudinal or panel dataset is said to be unbalanced if each panel does not have the same number of observations.
weakly balanced. A longitudinal or panel dataset is said to be weakly balanced if each panel has the same number of observations but the observations for different panels were not all made at the same times.
within estimator. The within estimator is a panel-data estimator that removes the panel-specific heterogeneity by subtracting the panel-level means from each variable and then performing ordinary least squares on the demeaned data. The within estimator is used in fitting the linear fixed-effects model.
❇️ @AI_Python_EN
What is Cluster Analysis?
Practically anyone working in marketing research or data science has heard of cluster analysis, but there are many misunderstandings about what it is. This is not surprising since cluster analysis originated outside the business world and is frequently applied in ways we may not be familiar with.
#Clusteranalysis is actually not just one thing and is an umbrella term for a very large family of methods which includes familiar approaches such as K-means and hierarchical agglomerative clustering (HAC). For those of you interested in a detailed look at cluster analysis, below are some excellent if technical books on or related to cluster analysis:
Cluster Analysis (Everitt et al.)
* Data Clustering (Aggarwal and Reddy)
* Handbook of Cluster Analysis (Hennig et al.)
* Applied Biclustering Methods (Kasim et al.)
* Finite Mixture and Markov Switching Models (Frühwirth-Schnatter)
* Latent Class and Latent Transition Analysis (Collins and Lanza)
* Advances in Latent Class Analysis (Hancock et al.)
* Market Segmentation (Wedel and Kamakura)
"Cluster analysis – also known as unsupervised learning – is used in multivariate statistics to uncover latent groups suspected in the data or to discover groups of homogeneous observations. The aim is thus often defined as partitioning the data such that the groups are as dissimilar as possible and that the observations within the same group are as similar as possible. The groups forming the partition are also referred to as clusters.
Cluster analysis can be used for different purposes. It can be employed
(1) as an exploratory tool to detect structure in multivariate data sets such that the results allow the data to be summarized and represented in a simplified and shortened form,
(2) to perform vector quantization and compress the data using suitable prototypes and prototype assignments and
(3) to reveal a latent group structure which corresponds to unobserved heterogeneity.
A standard statistical textbook on cluster analysis is, for example, Everitt et al. (2011).
Clustering is often referred to as an ill-posed problem which aims to reveal interesting structures in the data or to derive a useful grouping of the observations. However, specifying what is interesting or useful in a formal way is challenging. This complicates the specification of suitable criteria for selecting a clustering method or a final clustering solution. Hennig (2015) also emphasizes this point. He argues that the definition of the true clusters depends on the context and on the aim of clustering. Thus there does not exist a unique clustering solution given the data, but different aims of clustering imply different solutions, and analysts should in general be aware of the ambiguity inherent in cluster analysis and thus be transparent about their clustering aims when presenting the solutions obtained.
At the core of cluster analysis is the definition of what a cluster is. This can be achieved by defining the characteristics of the clusters which should emerge as output from the analysis. Often these characteristics can only be informally defined and are not directly useful for selecting a suitable clustering method. In addition, some notion of the total number of clusters suspected or the expected size of clusters might be needed to characterize the cluster problem. Furthermore, domain knowledge is important for deciding on a suitable solution, in the sense that the derived partition consists of interpretable clusters that have practical relevance. However, domain experts are often only able to assess the suitability of a solution once they are confronted with a grouping but are unable to provide clear characteristics of the desired clustering beforehand."
❇️ @AI_Python_EN
Practically anyone working in marketing research or data science has heard of cluster analysis, but there are many misunderstandings about what it is. This is not surprising since cluster analysis originated outside the business world and is frequently applied in ways we may not be familiar with.
#Clusteranalysis is actually not just one thing and is an umbrella term for a very large family of methods which includes familiar approaches such as K-means and hierarchical agglomerative clustering (HAC). For those of you interested in a detailed look at cluster analysis, below are some excellent if technical books on or related to cluster analysis:
Cluster Analysis (Everitt et al.)
* Data Clustering (Aggarwal and Reddy)
* Handbook of Cluster Analysis (Hennig et al.)
* Applied Biclustering Methods (Kasim et al.)
* Finite Mixture and Markov Switching Models (Frühwirth-Schnatter)
* Latent Class and Latent Transition Analysis (Collins and Lanza)
* Advances in Latent Class Analysis (Hancock et al.)
* Market Segmentation (Wedel and Kamakura)
"Cluster analysis – also known as unsupervised learning – is used in multivariate statistics to uncover latent groups suspected in the data or to discover groups of homogeneous observations. The aim is thus often defined as partitioning the data such that the groups are as dissimilar as possible and that the observations within the same group are as similar as possible. The groups forming the partition are also referred to as clusters.
Cluster analysis can be used for different purposes. It can be employed
(1) as an exploratory tool to detect structure in multivariate data sets such that the results allow the data to be summarized and represented in a simplified and shortened form,
(2) to perform vector quantization and compress the data using suitable prototypes and prototype assignments and
(3) to reveal a latent group structure which corresponds to unobserved heterogeneity.
A standard statistical textbook on cluster analysis is, for example, Everitt et al. (2011).
Clustering is often referred to as an ill-posed problem which aims to reveal interesting structures in the data or to derive a useful grouping of the observations. However, specifying what is interesting or useful in a formal way is challenging. This complicates the specification of suitable criteria for selecting a clustering method or a final clustering solution. Hennig (2015) also emphasizes this point. He argues that the definition of the true clusters depends on the context and on the aim of clustering. Thus there does not exist a unique clustering solution given the data, but different aims of clustering imply different solutions, and analysts should in general be aware of the ambiguity inherent in cluster analysis and thus be transparent about their clustering aims when presenting the solutions obtained.
At the core of cluster analysis is the definition of what a cluster is. This can be achieved by defining the characteristics of the clusters which should emerge as output from the analysis. Often these characteristics can only be informally defined and are not directly useful for selecting a suitable clustering method. In addition, some notion of the total number of clusters suspected or the expected size of clusters might be needed to characterize the cluster problem. Furthermore, domain knowledge is important for deciding on a suitable solution, in the sense that the derived partition consists of interpretable clusters that have practical relevance. However, domain experts are often only able to assess the suitability of a solution once they are confronted with a grouping but are unable to provide clear characteristics of the desired clustering beforehand."
❇️ @AI_Python_EN
OpenAI announced the final staged release of its 1.5 billion parameter language model GPT-2, along with all associated code and model weights
https://medium.com/syncedreview/openai-releases-1-5-billion-parameter-gpt-2-model-c34e97da56c0
❇️ @AI_Python_EN
https://medium.com/syncedreview/openai-releases-1-5-billion-parameter-gpt-2-model-c34e97da56c0
❇️ @AI_Python_EN
NeurIPS 2019: Adversarial music that prevents Amazon Alexa from waking up
https://www.profillic.com/paper/arxiv:1911.00126
They target the attack on the wake-word detection system, jamming the model with some inconspicuous background music to deactivate the VAs while the audio adversary is present
❇️ @AI_Python_EN
https://www.profillic.com/paper/arxiv:1911.00126
They target the attack on the wake-word detection system, jamming the model with some inconspicuous background music to deactivate the VAs while the audio adversary is present
❇️ @AI_Python_EN
Profillic
Adversarial Music: Real World Audio Adversary Against Wake-word Detection System - Profillic
Explore state-of-the-art in machine learning, AI, and robotics. Browse models, source code, papers by topics and authors. Connect with researchers and engineers working on related problems in machine learning, deep learning, natural language processing, robotics…
The Reinforcement-Learning Methods that Allow AlphaStar to Outcompete Almost All Human Players at StarCraft II
https://bit.ly/2NjezOU
https://bit.ly/2NjezOU
Medium
The Reinforcement-Learning Methods that Allow AlphaStar to Outcompete Almost All Human Players at StarCraft II
The new AlphaStar achieved Grandmaster level at StarCraft II overcoming some of the limitations of the previous version. How did it do it?
What are the three types of error in a #ML model?
👉 1. Bias - error caused by choosing an algorithm that cannot accurately model the signal in the data, i.e. the model is too general or was incorrectly selected. For example, selecting a simple linear regression to model highly non-linear data would result in error due to bias.
👉 2. Variance - error from an estimator being too specific and learning relationships that are specific to the training set but do not generalize to new samples well. Variance can come from fitting too closely to noise in the data, and models with high variance are extremely sensitive to changing inputs. Example: Creating a decision tree that splits the training set until every leaf node only contains 1 sample.
👉 3. Irreducible error - error caused by noise in the data that cannot be removed through modeling. Example: inaccuracy in data collection causes irreducible error.
❇️ @AI_Python_EN
👉 1. Bias - error caused by choosing an algorithm that cannot accurately model the signal in the data, i.e. the model is too general or was incorrectly selected. For example, selecting a simple linear regression to model highly non-linear data would result in error due to bias.
👉 2. Variance - error from an estimator being too specific and learning relationships that are specific to the training set but do not generalize to new samples well. Variance can come from fitting too closely to noise in the data, and models with high variance are extremely sensitive to changing inputs. Example: Creating a decision tree that splits the training set until every leaf node only contains 1 sample.
👉 3. Irreducible error - error caused by noise in the data that cannot be removed through modeling. Example: inaccuracy in data collection causes irreducible error.
❇️ @AI_Python_EN
François Chollet (Google, Creator of Keras) just released a paper on defining and measuring intelligence and a GitHub repo that includes a new #AI evaluation dataset, ARC – "Abstraction and Reasoning Corpus".
Paper: https://arxiv.org/abs/1911.01547
ARC: https://github.com/fchollet/ARC
#AI #machinelearning #deeplearning
❇️ @AI_Python_EN
Paper: https://arxiv.org/abs/1911.01547
ARC: https://github.com/fchollet/ARC
#AI #machinelearning #deeplearning
❇️ @AI_Python_EN