AI, Python, Cognitive Neuroscience
3.86K subscribers
1.09K photos
47 videos
78 files
892 links
Download Telegram
There are now many methods we can use when our dependent variable is not continuous. SVM, XGBoost and Random Forests are some popular ones.

There are also "traditional" methods, such as Logistic Regression. These usually scale well and, when used properly, are competitive in terms of predictive accuracy.

They are probabilistic models, which gives them additional flexibility. They also are often easier to interpret, critical when the goal is explanation, not just prediction.

They can be more work, however, and are probably easier to misuse than newer methods such as Random Forests. Here are some excellent books on these methods that may be of interest:

- Categorical Data Analysis (Agresti)
- Analyzing Categorical Data (Simonoff)
- Regression Models for Categorical Dependent Variables (Long and Freese)
- Generalized Linear Models and Extensions (Hardin and Hilbe)
- Regression Modeling Strategies (Harrell)
- Applied Logistic Regression (Hosmer and Lemeshow)
- Logistic Regression Models (Hilbe)
- Analysis of Ordinal Categorical Data (Agresti)
- Applied Ordinal Logistic Regression (Liu)
- Modeling Count Data (Hilbe)
- Negative Binomial Regression (Hilbe)
- Handbook of Survival Analysis (Klein et al.)
- Survival Analysis: A Self-Learning Text (Kleinbaum and Klein)


#statistics #book #Machinelearning

✴️ @AI_Python
#Statistics such as correlation, mean and standard deviation (variance) create strong visual images and meaning. Two different #datasets with the same correlation would sort of look the same. Right?

Not so much.

Each of these very different-looking graphs are plotting datasets with the same correlation, mean and SD. This is why plotting data is so important though oddly so rarely (in my expereince) done.

https://bit.ly/2oZ29MP

✴️ @AI_Python_EN
The field of statistics has very long history, dating back to ancient times.

Much of marketing data science can be traced to the origins of actuarial science, demography, sociology and psychology, with early statisticians playing major roles in all of these fields.

Big is relative, and statisticians have been working with "big data" all along. "Machine learners" such as SVM and random forests originated in statistics, and neural nets were inspired as much by regression as by theories of the human brain.

Statisticians are involved in a diverse range of fields, including marketing, psychology, pharmacology, economics, meteorology, political science and ecology, and have helped developed research methods and analytics for nearly any kind of data.

The history and richness of #statistics is not always appreciated, though. For example, this morning I was asked "How's your #machinelearning?" :-)

✴️ @AI_Python_EN
Sampling is a deceptively complex subject, and some academic statisticians have devoted the bulk of their careers to it.

It's not a subject that thrills everyone but is a very important one, and one which seems underappreciated in marketing research and #data science.

Here are some books on or related to sampling I've found helpful:

- Survey Sampling (Kish)
- Sampling Techniques (Cochran)
- Model Assisted Survey Sampling (Särndal et al.)
- Sampling: Design and Analysis (Lohr)
- Practical Tools for Designing and Weighting Survey Samples (Valliant et al.)
- Survey Weights: A Step-by-step Guide to Calculation (Valliant and Dever)
- Complex Surveys (Lumley)
- Hard-to-Survey Populations (Tourangeau et al.)
- Small Area Estimation (Rao and Molina)


The first three are regarded as classics (though still relevant.) Sharon Lohr's book is the friendliest introduction I know of on this subject. Standard marketing research textbooks also give simple overviews of sampling but do not get into depth.

There are also academic journals that feature articles on sampling, such as the Public Opinion Quarterly (AAPOR) and the Journal of Survey #Statistics and Methodology (AAPOR and ASA).

✴️ @AI_Python_EN
This is Your Brain on Code 🧠💻🔢 computer programming is often associated with math, but researchers used functional MRI scans to show the role of the brain's language processing centers: https://lnkd.in/eN_-3RA

#datascience #machinelearning #ai #bigdata #analytics #statistics #artificialintelligence #datamining #computing #programmers #neuroscience

✴️ @AI_Python_EN
Uncertainty in big data analytics: survey, opportunities, and challenges

https://journalofbigdata.springeropen.com/articles/10.1186/s40537-019-0206-3

#BigData #statistics #NLP

✴️ @AI_Python_EN
Data in the Life: Authorship Attribution in Lennon-McCartney Songs", was just published in the first issue of the HARVARD DATA SCIENCE REVIEW, the inaugural publication of harvard datascience published by the mit press. Combining features of a premier research journal, a leading educational publication, and a popular magazine, HDSR leverages digital technologies and data visualizations to facilitate author-reader interactions globally. Besides our article, the first issue features articles on topics ranging from machine learning models for predicting drug approvals to artificial intelligence. Read it now:
https://bit.ly/2Kuze2q.
#datascience #bigdata #machinelearing #statistics #AI

✴️ @AI_Python_EN
Great Statistical software for Beginners.

Here is the Gretl Tutorial by Simone Gasperin

1)Simple Linear Regression
https://lnkd.in/ecfsV9c

2)Coding Dummy Variables
https://lnkd.in/ef7Yd7f

3)Forecasting New Observations
https://lnkd.in/eNKbxbU

4)Forecasting a Large Number of Observations
https://lnkd.in/eHmibGs

5)Logistic Regression
https://lnkd.in/eRfhQ87

6)Forecasting and Confusion Matrix
https://lnkd.in/eaqrFJr

7)Modeling and Forecasting Time Series Data
https://lnkd.in/e6fqKpF

8)Comparing Time Series Trend Models
https://lnkd.in/eKjEUAE

#datascience #machinelearning #statistics #dataanalytics #dataanalysis

✴️ @AI_Python_EN
1-point RANSAC for Circular Motion Estimation in Computed Tomography (CT)
https://deepai.org/publication/1-point-ransac-for-circular-motion-estimation-in-computed-tomography-ct

by Mikhail O. Chekanov et al.
#Statistics #Estimator

❇️ @AI_Python_EN
What's the purpose of statistics?

"Do you think the purpose of existence is to pass out of existence is the purpose of existence?" - Ray Manzarek

The former Doors organist poses some fundamental questions to which definitive answers remain elusive. Happily, the purpose of statistics is easier to fathom since humans are its creator. Put simply, it is to enhance decision making.

These decisions could be those made by scientists, businesspeople, politicians and other government officials, by medical and legal professionals, or even by religious authorities. In informal ways, ordinary folks also use statistics to help make better decisions.

How does it do this?

One way is by providing basic information, such as how many, how much and how often. Stat in statistics is derived from the word state, as in nation state and, as it emerged as a formal discipline, describing nations quantitatively (e.g., population size, number of citizens working in manufacturing) became a fundamental purpose. Frequencies, means, medians and standard deviations are now familiar to anyone.

Often we must rely on samples to make inferences about our population of interest. From a consumer survey, for example, we might estimate mean annual household expenditures on snack foods. This is known as inferential statistics, and confidence intervals will be familiar to anyone who has taken an introductory course in statistics. So will methods such as t-tests and chi-squared tests which can be used to make population inferences about groups (e.g., are males more likely than females to eat pretzels?).

Another way statistics helps us make decisions is by exploring relationships among variables through the use of cross tabulations, correlations and data visualizations. Exploratory data analysis (EDA) can also take on more complex forms and draw upon methods such as principal components analysis, regression and cluster analysis. EDA is often used to develop hypotheses which will be assessed more rigorously in subsequent research.

These hypotheses are often causal in nature, for example, why some people avoid snacks. Randomized experiments are generally considered the best approach in causal analysis but are not always possible or appropriate; see Why experiment? for some more thoughts on this subject. Hypotheses can be further developed and refined, not simply tested through Null Hypothesis Significance Testing, though this has been traditionally frowned upon since we are using the same data for multiple purposes.

Many statisticians are actively involved in designing research, not merely using secondary data. This is a large subject but briefly summarized in Preaching About Primary Research.

Making classifications, predictions and forecasts is another traditional role of statistics. In a data science context, the first two are often called predictive analytics and employ methods such as random forests and standard (OLS) regression. Forecasting sales for the next year is a different matter and normally requires the use of time-series analysis. There is also unsupervised learning, which aims to find previously unknown patterns in unlabeled data. Using K-means clustering to partition consumer survey respondents into segments based on their attitudes is an example of this.

Quality control, operations research, what-if simulations and risk assessment are other areas where statistics play a key role. There are many others, as this page illustrates.

The fuzzy buzzy term analytics is frequently used interchangeably with statistics, an offense to which I also plead guilty.

"The best thing about being a statistician is that you get to play in everyone's backyard." - John Tukey

#ai #artificialintelligence #ml #statistics #bigdata #machinelearning
#datascience

❇️ @AI_Python_EN
AI, Python, Cognitive Neuroscience
What is a Time Series? Many data sets are cross-sectional and represent a single slice of time. However, we also have data collected over many periods - weekly sales data, for instance. This is an example of time series data. Time series analysis is a…
What is a Time Series?

Multiple Time Series

You might need to analyze multiple time series simultaneously, e.g., sales of your brands and key competitors. Figure 2 below is an example and shows weekly sales data for three brands over a one-year period. Since sales movements of brands competing with each other will typically be correlated over time, it often will make sense, and be more statistically rigorous, to include data for all key brands in one model instead of running separate models for each brand.

Vector Autoregression (VAR), the Vector Error Correction Model (VECM) and the more general State Space framework are three frequently-used approaches to multiple time series analysis. Causal data can be included and Market Response/Marketing Mix modeling conducted.

Other Methods

There are several additional methods relevant to marketing research and data science I'll now briefly describe.

Panel Models include cross sections in a time series analysis. Sales and marketing data for several brands, for instance, can be stacked on top of one another and analyzed simultaneously. Panel modeling permits category-level analysis and also comes in handy when data are infrequent (e.g., monthly or quarterly).

Longitudinal Analysis
is a generic and sometimes confusingly-used term that can refer to Panel modeling with a small number of periods ("short panels"), as well as to Repeated Measures, Growth Curve Analysis or Multilevel Analysis. In a literal sense it subsumes time series analysis but many authorities reserve that term for analysis of data with many time periods (e.g., >25). Structural Equation Modeling (SEM) is one method widely-used in Growth Curve modeling and other longitudinal analyses.

Survival Analysis is a branch of #statistics for analyzing the expected length of time until one or more events happen, such as death in biological organisms and failure in mechanical systems. It's also called Duration Analysis in Economics and Event History Analysis in Sociology. It is often used in customer churn analysis.
In some instances one model will not fit an entire series well because of structural changes within the series, and model parameters will vary across time. There are numerous breakpoint tests and models (e.g., State Space, Switching Regression) available for these circumstances.

You may also notice that sales, call center activity or other data series you are tracking exhibit clusters of volatility. That is, there may be periods in which the figures move up and down in much more extreme fashion than other periods.

In these cases, you should consider a class of models with the forbidding name of GARCH (Generalized Autoregressive Conditional Heteroskedasticity). ARCH and GARCH models were originally developed for financial markets but can used for other kinds of time series data when volatility is of interest. Volatility can fall into many patterns and, accordingly, there are many flavors of GARCH models. Causal variables can be included. There are also multivariate extensions (MGARCH) if you have two or more series you wish to analyze jointly.

Non-Parametric Econometrics is a very different approach to studying time series and longitudinal data that is now receiving a lot of attention because of #bigdata and the greater computing power we now enjoy. These methods are increasingly feasible and useful as alternatives to the more familiar methods such as those described in this article.

#MachineLearning (e.g., #ArtificialNeuralNetwork s) is also useful in some circumstances but the results can be hard to interpret - they predict well but may not help us understand the mechanism that generated to data (the Why). To some extent, this drawback also applies to non-parametric techniques.

Most of the methods I've mentioned are Time Domain techniques. Another group of methods known as Frequency Domain, plays a more limited role in Marketing Research.

❇️ @AI_Python_EN
Data science is not #MachineLearning .
Data science is not #statistics.
Data science is not analytics.
Data science is not #AI.

#DataScience is a process of:
Obtaining your data
Scrubbing / Cleaning your data
Exploring your data
Modeling your data
iNterpreting your data

Data Science is the science of extracting useful information from data using statistics, skills, experience and domain knowledge.

If you love data, you will like this role....

solving business problems using data is data science. Machine learning/statistics /analytics may come as a way of the solution of a particular business problem. Sometimes we may need all to solve a problem and sometimes even a crosstabs may be handy.

➡️ Get free resources at his site:
www.claoudml.com

❇️ @AI_Python_EN
What Are "Panel Models?"​ Part 1

In #statistics, the English is sometimes as hard as the math. Vocabulary is frequently used in confusing ways and often differs by discipline. "Panel" and "longitudinal" are two examples - economists tend to favor the first term, while researchers in most other fields use the second to mean essentially the same thing.

But to what "thing" do they refer? Say, for example, households, individual household members, companies or brands are selected and followed over time. Statisticians working in many fields, such as economics and psychology, have developed numerous techniques which allow us to study how these households, household members, companies or brands change over time, and investigate what might have caused these changes.

Marketing mix modeling conducted at the category level is one example that will be close to home for many marketing researchers. In a typical case, we might have four years of weekly sales and marketing data for 6-8 brands in a product or service category. These brands would comprise the panel. This type of modeling is also known as cross-sectional time-series analysis because there is an explicit time component in the modeling. It is just one kind of panel/longitudinal analysis.

Marketing researchers make extensive use of online panels for consumer surveys. Panelists are usually not surveyed on the same topic on different occasions though they can be, in which case we would have a panel selected from an online panel. Some MROCs (aka insights communities) also are large and can be analyzed with these methods.

The reference manual for the Stata statistical software provides an in-depth look at many of these methods, particularly those widely-used in econometrics. I should note that there is a methodological connection with mixed-effects models, which I have briefly summarized here. Mplus is another statistical package which is popular among researchers in psychology, education and healthcare, and its website is another good resource.
Longitudinal/panel modeling has featured in countless papers and conference presentations over the years and is also the subject of many books. Here are some books I have found helpful:

Analysis of Longitudinal Data (Diggle et al.)
Analysis of Panel Data (Hsiao)
Econometric Analysis of Panel Data (Baltagi)
Longitudinal Structural Equation Modeling (Newsom)
Growth Modeling (Grimm et al.)
Longitudinal Analysis (Hoffman)
**Applied Longitudinal Data Analysis for Epidemiology (Twisk)**

Many of these methods can also be performed within a Bayesian statistical framework.

❇️ @AI_Python_EN