Data mining with Machine Learning for the social sciences

Introduction, Challenges, the right & the wrong, Misunderstanding

Priv.-Doz. Dr. Stefan Bosse

University of Koblenz-Landau, Fac. Computer Science
University of Bremen, Dept. Mathematics & Informatics

18.5.2018

sbosse@uni-bremen.de

Content

Introduction to Artificial Intelligence

Artificial Intelligence

In social science big data volumes must be handled.
But big do not mean helpful or important!
Data is noisy and uncertain!?

One major task in data science is the derivation of fundamental mapping functions:

F(Input Data): Input Data → Output Data
⇔
F(Sensor Data): Sensor Data → Knowledge

Such a function F performs Feature Extraction
But often there are no or only partial numerical/mathematical models that can implement F!

Artificial Intelligence

Usage of Artificial Intelligence and their methods can be helpful to derive such fundamental mapping functions - or at least an approximation: Hypothesis
The input data is characterized commonly by a high dimensionality consisting of a vector of variables

[x₁,x₂,..,x_n],

whereby the output data (information) has a much lower dimensionality (data reduction!) consisting of the variable vector

[y₁,y₂,..,y_m]

This means:

F: R^N → R^M with M ≪ N

Data reduction includes the pre-selection of suitable (high information entropy) data variables → Feature Selection

Artificial Intelligence

Machine Learning - Technical Sciences

Often there are no functional relations between two variables x and y.
In technical applications x can be a camera image with 1 Million pixels and y a figure from the set {0,1,2,..,9} that represent a hand written character. Generally:

f(x):x → y.

Machine Learning (ML) can be used to derive such relation from experimental/empirical training data!
Among the derivation of such functional relations the prediction of what will happen next or in the future is an important task of Machine Learning

Machine Learning - The Functional Approach

Machine learning means the derivation of a hypothesis of a simple input-output function from training data provided by humans (statistical data!)!

Machine Learning - Medicine

Diagnosis of Appendicitis from medicine and personal data

Input Data x: Patient Details [weight,age,sex,pain left, pain right, temperature, ..]
Output Data y: Diagnosis Label {Appendicitis, Dyspepsia, Unknown, .. }
Decision Learner: Returns one of the labels matching a new input vector x (the test object)

figapp

Machine Learning - Medicine

Decision classifiers only return one (good or bad) matching label
No information about matching probability

Probalistic Learner (Bayes Theorem): Feature: Probability forecast estimating the conditional probability of best matching (or all) label(s) with a given observed object x

P(y|x) = \frac{P(x|y)P(y)}{P(x)}

figappProb

Machine Learning - Applications

History of some selected application areas

Rule discovery using a variant of the ID3 for a printing industry problem [Evans & Fisher 1992]
Electric power load forecasting using a k-nearest-neighbor rule system [Jabbour , K- et al. 1997]
Automatic help desk assistant using a nearest-neighbor system [Acorn % Walden 1992]
Planning and scheduling for a steel mill using ExpertEase a marketed (ID3-like) system [Michie, 1992]
Classification of stars and galaxies [Fayyad et al., 1993]

⇒ Technical Problems!

Learning From Crowds - A probalistic approach for supervised learning [Raykar et al., 2010]

Machine Learning - Applications

Machine learning is having a substantial effect on many areas of technology and science; examples of recent applied success stories include [6]
- robotics and autonomous vehicle control (top left),
- speech processing and natural language processing (top left),
- neuroscience research (bottom, left),
- and applications in computer vision (right).

figmlapp

Data Mining

Data mining (DM) is the more general name given to a variety of computer-intensive techniques for discovering structure and for analyzing patterns in data.

Using those patterns, DM can
- create predictive models, or
- classify things, or
- identify different groups or clusters of cases within data.
Data mining, machine learning and predictive analytics are already widely used in business and are starting to spread into social science and other areas of research. [Atte15]

Data Mining

A partial list of current data mining methods includes:
- Association rules
- Recursive partitioning or decision trees, including CART (classification and regression trees) and CHAID (chi-squared automatic interaction detection), boosted trees, forests, and bootstrap forests
- Multi-layer neural network models and “deep learning” methods
- Naive Bayes classifiers and Bayesian networks
- Clustering methods, including hierarchical, k-means, nearest neighbor, linear and nonlinear manifold clustering
- Support vector machines
- “Soft modeling” or partial least squares latent variable modeling

Data Mining

Taxonomy [1]

figdmtaxonomy

Data Mining

Layers [1]

figdmlayers

Challenges in Data Mining

Challenges of Applying Machine Learning to Qualitative Coding

Coding is an important part of qualitative analysis in many fields in social science.
Most applications of qualitative coding require detailed, line-by-line examination of the data.
Such analysis can quickly become very time-consuming even on a moderately sized dataset!
Machine learning techniques could potentially extend the principles of qualitative analysis to the whole dataset given proper guidance from a qualitative coder.

Consequently these techniques offer a promising approach to scale up the coding process.

Challenges of Applying Machine Learning to Qualitative Coding

Machine learning has emerged in the past decades, but its application in qualitative analysis is still very limited.
One common reason: People who use qualitative methods usually do not have background in ML.
The complexity of selecting features, building models, and tuning parameters, can prevent the construction of acceptable models.

On the other hand, an ML expert might be able to take the codes that a social scientist has applied to part of a dataset in order to train a classifier to label the whole dataset.

However, since very few ML experts have background in social science, they do not have contextual information to engineer good features and to prevent issues like overfitting.

Challenges of Applying Machine Learning to Qualitative Coding

In addition, machine learning usually performs better on categories that have more instances, but those codes may not be the most interesting to a social scientist.

For example:

To build a good classifier,: we usually need predefined categories and a large amount of corresponding labeled data (in supervised learning), or the distributions of the datasets must have some distinct separation (in unsupervised learning).

However, neither of these is the case in coding!

Labeling

Supervised Machine Learning requires labeled training data.
But: Who or how are labels assigned to data sets?
For many supervised learning tasks it may be infeasible (or very expensive) to obtain objective and reliable labels.
Instead, subjective (possibly noisy) labels from multiple experts or annotators are collected.
In practice, there is a substantial amount of disagreement among the annotators, and hence it is of great practical interest to use and improve conventional supervised learning problems for such case.

Size is not everything

Having more rows is nice (increases statistical power) → more training data
Having more columns is nice, (estimate heterogenous effects, interactions) → more feature variables
But with limited research resources:
- If the choice is between creating a counterfactual or more data, the counterfactual will mostly win!

Big but not correlated

Big Data is usually collected from a variety of sources with unknown models how the data was generated.
It can be sparse data: Weak and probably unknown correlation between data variables
It is statistically variant and noisy data!

Big is not big enough

Data Mining suggests that the combination of brute computing power and very large datasets enables data miners to discover structures in data?
Assumption: Applying conventional statistical approaches to datasets containing smaller numbers of cases do not deliver the structure. → Wrong
Even the largest social science datasets are not large enough to allow a comprehensive or exhaustive search for structure.
Example: The five-million-person, multi-year census files available from the American Community Survey
Even the biggest, fastest computers find certain empirical tasks intractable.
Data mining frequently has to make
- Simplifying assumptions to keep problems tractable, or
- Select subsets of variables,
Data Mining cannot handle all the available measures in one model!
It approximates and make compromises!

Crowd Sensing

The “wisdom of the crowd” effect refers to the phenomenon that the mean of estimates provided by a group of individuals is more accurate than most of the individual estimates.
It is well-known that in many forecasting scenarios [4], averaging the forecasts of a set of individuals yields a collective prediction that is more accurate than the majority of the individuals’ forecasts—the so-called “wisdom of crowds” effect (Surowiecki 2004).
However, there are smarter ways of aggregating forecasts than simple averaging.
A hybrid approach of (re)calibrating and aggregating probabilistic forecasts about future events provided by experts who may exhibit systematic biases can improve model quality (Turner, 2010) → such as overestimating the likelihood of rare events
For many situations, such as when estimating the opinions of a group of individuals, it is desirable to learn a classification model but there is no underlying ground truth!

Important stages in Data Mining

Having introduced the paradox of too little big data and noted the challenges caused by high-dimensional data, we can now discuss how a DM analysis typically proceeds.

Data Preprocessing

Raw sample data has to be preprocessed. But this task requires sometimes statistical knowledege of the input data (noise …).

Feature Selection

Feature-selection methods allow a researcher to identify which among many potential predictors those are strongly associated with an outcome variable of interest.
- They also help avoid problems with multicollinearity among predictors.
DM offers several alternatives for selecting a subset of independent variables that are the most effective predictors of a dependent variable.
- Manual variable selection
- Automatic variable selection

Feature Selection - Some Methods

Principle Component Analysis

One example for a method used in dimesionality reduction of the input data → Feature Selection
- But it can handle basically only two-dimensional data!
PCA can be applied
- to ordered and unordered attributes, and
- can handle sparse data and skewed data.
Multidimensional data can be handled by reducing the problem to two dimensions. Principal components may be used as inputs to multiple regression and cluster analysis.

Attribute Subset Selection

Data sets for analysis may contain hundreds of attributes, many of them may be irrelevant to the mining task or redundant.
Attribute subset selection reduces the data set size by removing irrelevant or redundant attributes (or dimensions).

Information Entropy

Feature selection using entropy

Information entropy of a data set, i.e., value set c (column of data) of a variable x_i with |o| as the number of values of a set o (x=v ∈ V), p_v the occurence probability of a specific value v, and N a set of occurence counts:

$\begin{gathered} entrop{y_N}(N) = \sum\limits_{i = 1}^n { - {p_i}{{\log }_2}{p_i}} ,{\text{with }}{p_i} = \frac{{{N_i}}}{{\sum N}} \hfill \\ entropy(c) = \sum\limits_{v \in V(c)}^{} { - {p_v}{{\log }_2}{p_v}} ,{\text{with }}{p_v} = \frac{{|{c_v}|}}{{|c|}} \hfill \\ \end{gathered}$
Information entropy of associated data sets (x → y, x=v ∈ V, y ∈ T)

$\begin{gathered} Entropy(c|T) = \sum\limits_{v \in V(c)} {{p_v}entropy_{N}(\{ {c_v}|T\} )} {\text{ ,}} \hfill \\ {\text{with }}{p_v} = \frac{{|{c_v}|}}{{|c|}}{\text{ and }}\{ {c_v}|T\} = \{ |{c_v}{\text{ with y = }}{t_1}|,|{c_v}{\text{ with y = }}{t_2}|,..\} ,{t_i} \in T \hfill \\ \end{gathered}$

Information Entropy

Feature selection using information gain

Entropy is a measure of impurity in a collection of training examples
The effectniss of a feature variable x_i in classifying the training data is given by the Information Gain:

$\begin{gathered} Gain(y,{x_i}) = entropy_{N}(\{ y|T\} ) - \sum\limits_{v \in V({x_i})} {{p_v}entrop{y_N}(\{ c{{({x_i})}_v}|T\} )} \hfill \\ {\text{with }}{p_v} = \frac{{|{c_v}|}}{{|c|}}{\text{ and }}\{ {c_v}|T\} = \{ |{c_v}{\text{ with y = }}{t_1}|,|{c_v}{\text{ with y = }}{t_2}|,..\} ,{t_i} \in T \hfill \\ {\text{and }}\{ y|T\} = \{ |y{\text{ with y = }}{t_1}|,|y{\text{ with y = }}{t_2}|,..\} ,{t_i} \in T \hfill \\ \end{gathered}$

Information Entropy

Example: Playing Golf?


	x₁ Outlook	x₂ Temperat.	x₃ Humidity	x₄ Wind	y Playing?

	Sunny	Hot	High	Weak	No
	Sunny	Hot	High	Strong	No
	Overcast	Hot	High	Weak	Yes
	Rain	Mild	High	Weak	Yes
	Rain	Cool	Normal	Weak	Yes
	Rain	Cool	Normal	Strong	No
	Overcast	Cool	Normal	Strong	Yes
	Sunny	Mild	High	Weak	No
	Sunny	Cool	Normal	Weak	Yes
	Rain	Mild	Normal	Weak	Yes
	Sunny	Mild	Normal	Strong	Yes
	Overcast	Mild	High	Strong	Yes
	Overcast	Hot	Normal	Weak	Yes
	Rain	Mild	High	Strong	No
entropy	1.58	1.56	1.0	0.99	0.94
Entropy	0.69	0.91	0.79	0.89	-
Gain	0.25	0.03	0.15	0.05	-

Training and Testing

Sample training data is now used to derive a model as a hypothesis of the unknown input-output function.
There are multiple hypothesis functions H={h₁,h₂, ..} with h(x): x → y that can approximate the unknown function f(x)!

The challenge: Find the best hypothesis model function h_i by testing the derived model against the sample and specific test data (e.g., by maximizing R² measure and using cross validation)!!

Constructing a Model

Once a researcher has constructed a dataset rich in relevant features or variables, modeling can begin.
In DM different kind of models and learning approaches are used, but a specific choice is only temporary.
Thus, the data will be analyzed using several different kinds of models or approaches and compare their prediction accuracy before settling on a final approach.

Cross Validation

Cross Validation

Different sample data classes for generalization

One group or random subset of cases or observations is known as the training sample or estimation sample. This is the group of cases that will be analyzed first, to create a predictive model.
A second random sample can be created, known as the tuning sample (it is sometimes called the validation sample). It is used to estimate certain modeling parameters that will yield an optimal prediction.
A third randomly selected group of observations is central to cross-validation. This is the test sample, sometimes called the holdout sample. The test sample is not used in any way during the creation of the predictive model; it is deliberately kept completely separate (held back).

Calibrating

Calibrating is another DM strategy for improving model prediction that differ from conventional practices, i.e., increasing the R² statistical measure of a model for a given sample dataset.

Comparison of conventional and DM approaches

figcompDMconv [2]

Impact of Calibration on R² statistical measure

figcalibration [2]

DM methods outperforms conventional regression with better predictive power!
Calibration improves predictive accuracy!

Application

Now the learned and mined model function can be used for the prediction of unknown input data!

figmlinout

Machine Learning Algorithms and Models

Machine Learning Classes

Supervised Learning: The training data consists of input data (x, sensor variables, structure parameters, ..) with the associated output data (y, so-called labels, eg material parameters). The output data is commonly assigned by experts (humans), but can also be fed back through an automatic evaluation (reinforcement learning).
Unsupervised Learning: In so-called clustering, patterns in the input data are automatically recognized, i.e., the training data consists only of the input data x.
Reinforcement Learning: This learning class is closely related to autonomous agents interacting in an environment with the behaviour: action → perception → decicion
It is a sequential decision-making problem with delayed reward. Reinforcement learning algorithms seek to learn a policy (mapping from states to actions) that maximize the reward received over time.
The unlabeled training data x(t) is provided sequentially as a stream!

Machine Learning Classes

Example 1

Do I will play tennis today?

Supervised learning with Decision Tree Model
Trainer works deterministc: data → model

Machine Learning - Noise

Noise and measurement uncertainty can affect the derived model until it is useless !!!
This concerns both phases:
- Learning process (training)
- Classification process (application of the learned model)
Also erroneous assignment of labels (classes) is noise!

Impact of noise on prediction accuracy depends on the algorithm and models that are used

fignoise

Machine Learning - Overfitting

Overfitting

The generalization error decreases with the training set size
But the training and generalization errors decreases more slowly for larger and grwoing training data sets

figoverfitting

Overfitting in Decision Trees

Overﬁtting decreases prediction accuracy in decision trees!
The training error continues to decline as the tree become bigger.
The generalization error declines at ﬁrst then at some point starts to increase due to overﬁtting

figoverfittingDT

Model: Decision Tree

figdt

A decision tree consists of nodes that are linked to and evaluate a given input variable x_i [3].
There are one or more edges to successor nodes depending on the variable value (relational or interval).
Leaves are linked to the classification result. Both graph, table, and program notation are available as representation of the learned model? Compact model!

Model: Artificial Neural Network

figANNmultiLayer

An artificial neural network (ANN) is a network of neurons (perceptrons) [1].
Each neuron n_i has one or more inputs x_i and an output y_i.
There is an activation function: f_i(x_i): x_i → y_i, y ∈ [0,1], x_i,j ∈ [a,b].
The training of the network is iterative (probalistic) and is done by 1. distribution of weights of the edges; 2. Configuration until the entire classification with minimal error is possible.

Example 2

Do I will play tennis today?

Supervised learning with an Artificial Neural Network
Trainer works randomized: data → model₁|model₂|,..

Deep Networks

Example: Automatic generation of text captions for images with a deep neuronal network [6]

A convolutional neural network is trained to interpret images, and its output is then used by a recurrent neural network trained to generate a text caption (top).
The sequence at the bottom shows the word-by-word focus of the network on different parts of input image while it generates the caption word-by-word.

figannhybrid

Conclusions

Using Data Mining

creates predictive models, or
classifies things, or
identifies different groups or clusters of cases within data.

Data: Big is not big enough ⇒ But size is not everything!
Labeling: Accurate assigning of labels to training data is crucial!
Feature Selection: Selecting the right (the best/relevant) feature attributes is crucial!
Training and Testing: The data will be analyzed and tested using several different kinds of models or approaches and compare their prediction accuracy before settling on a final approach.

References

Literature

L. Rokach and O. Maimon, DATA MINING WITH DECISION TREES Theory and Applications. World Scientific Publishing, 2015.
P. Attewell and D. B. Monaghan, Data mining for the social sciences : an introduction. University of California Press, 2015.
N. J. Nilsson, Introduction To Machine Learning. 1996.
W. Mason, J. Wortman Vaughan, and H. Wallach, “Computational social science and social computing,” Machine Learning, vol. 95, 2014.
V. C. Raykar et al., “Learning From Crowds,” Journal ofMachine Learning Research, vol. 11, 2010.
M. Jordan and T. M. Mitchell, “Machine learning: Trends, perspectives, and prospects,” Science, vol. 349, no. 6245, 2015.

Videos

Neural Network 3D Simulation, www.youtube.com/watch?v=3JQ3hYko51Y

Data mining with Machine Learning for the social sciences

Introduction, Challenges, the right & the wrong, Misunderstanding

Content

Introduction to Artificial Intelligence

Artificial Intelligence

Artificial Intelligence

Artificial Intelligence

Machine Learning - Technical Sciences

Machine Learning - The Functional Approach

Machine Learning - Medicine

Diagnosis of Appendicitis from medicine and personal data

Machine Learning - Medicine

Machine Learning - Social Sciences

Whats the problem with ML in Social Sciences?

Machine Learning - Applications

History of some selected application areas

Machine Learning - Applications

Data Mining

Data Mining

Data Mining

Taxonomy [1]

Data Mining

Layers [1]

Challenges in Data Mining

Challenges of Applying Machine Learning to Qualitative Coding

Challenges of Applying Machine Learning to Qualitative Coding

Challenges of Applying Machine Learning to Qualitative Coding

For example:

Labeling

Size is not everything

Big but not correlated

Big is not big enough

Crowd Sensing

Important stages in Data Mining

Data Preprocessing

Feature Selection

Feature Selection - Some Methods

Principle Component Analysis

Attribute Subset Selection

Information Entropy

Feature selection using entropy

Information Entropy

Feature selection using information gain

Information Entropy

Example: Playing Golf?

Training and Testing

Constructing a Model

Cross Validation

Cross Validation

Different sample data classes for generalization

Calibrating

Comparison of conventional and DM approaches

Impact of Calibration on R2 statistical measure

Application

Machine Learning Algorithms and Models

Machine Learning Classes

Machine Learning Classes

Example 1

Machine Learning - Noise

Machine Learning - Overfitting

Model: Decision Tree

Model: Artificial Neural Network

Example 2

Deep Networks

Example: Automatic generation of text captions for images with a deep neuronal network [6]

Conclusions

References

Literature

Videos

Impact of Calibration on R² statistical measure