Priv.-Doz. Dr. Stefan Bosse
University of Koblenz-Landau, Fac. Computer Science
University of Bremen, Dept. Mathematics & Informatics 18.5.2018
sbosse@uni-bremen.de |
In social science big data volumes must be handled.
But big do not mean helpful or important!
Data is noisy and uncertain!?
F(Input Data): Input Data → Output Data
⇔
F(Sensor Data): Sensor Data → Knowledge
Such a function F performs Feature Extraction
But often there are no or only partial numerical/mathematical models that can implement F!
Usage of Artificial Intelligence and their methods can be helpful to derive such fundamental mapping functions - or at least an approximation: Hypothesis
The input data is characterized commonly by a high dimensionality consisting of a vector of variables
[x1,x2,..,xn],
[y1,y2,..,ym]
F: RN → RM with M ≪ N
f(x):x → y.
Machine Learning (ML) can be used to derive such relation from experimental/empirical training data!
Among the derivation of such functional relations the prediction of what will happen next or in the future is an important task of Machine Learning
Patient Details [weight,age,sex,pain left, pain right, temperature, ..]
Diagnosis Label {Appendicitis, Dyspepsia, Unknown, .. }
Returns one of the labels matching a new input vector x (the test object)
Decision classifiers only return one (good or bad) matching label
No information about matching probability
⇒ Technical Problems!
Data mining (DM) is the more general name given to a variety of computer-intensive techniques for discovering structure and for analyzing patterns in data.
Using those patterns, DM can
Data mining, machine learning and predictive analytics are already widely used in business and are starting to spread into social science and other areas of research. [Atte15]
A partial list of current data mining methods includes:
Association rules
Recursive partitioning or decision trees, including CART (classification and regression trees) and CHAID (chi-squared automatic interaction detection), boosted trees, forests, and bootstrap forests
Multi-layer neural network models and “deep learning” methods
Naive Bayes classifiers and Bayesian networks
Clustering methods, including hierarchical, k-means, nearest neighbor, linear and nonlinear manifold clustering
Support vector machines
“Soft modeling” or partial least squares latent variable modeling
Coding is an important part of qualitative analysis in many fields in social science.
Most applications of qualitative coding require detailed, line-by-line examination of the data.
Such analysis can quickly become very time-consuming even on a moderately sized dataset!
Machine learning techniques could potentially extend the principles of qualitative analysis to the whole dataset given proper guidance from a qualitative coder.
Consequently these techniques offer a promising approach to scale up the coding process.
Machine learning has emerged in the past decades, but its application in qualitative analysis is still very limited.
One common reason: People who use qualitative methods usually do not have background in ML.
The complexity of selecting features, building models, and tuning parameters, can prevent the construction of acceptable models.
On the other hand, an ML expert might be able to take the codes that a social scientist has applied to part of a dataset in order to train a classifier to label the whole dataset.
In addition, machine learning usually performs better on categories that have more instances, but those codes may not be the most interesting to a social scientist.
Supervised Machine Learning requires labeled training data.
But: Who or how are labels assigned to data sets?
For many supervised learning tasks it may be infeasible (or very expensive) to obtain objective and reliable labels.
Instead, subjective (possibly noisy) labels from multiple experts or annotators are collected.
In practice, there is a substantial amount of disagreement among the annotators, and hence it is of great practical interest to use and improve conventional supervised learning problems for such case.
Having more rows is nice (increases statistical power) → more training data
Having more columns is nice, (estimate heterogenous effects, interactions) → more feature variables
But with limited research resources:
Big Data is usually collected from a variety of sources with unknown models how the data was generated.
It can be sparse data: Weak and probably unknown correlation between data variables
It is statistically variant and noisy data!
Data Mining suggests that the combination of brute computing power and very large datasets enables data miners to discover structures in data?
Assumption: Applying conventional statistical approaches to datasets containing smaller numbers of cases do not deliver the structure. → Wrong
Even the largest social science datasets are not large enough to allow a comprehensive or exhaustive search for structure.
Example: The five-million-person, multi-year census files available from the American Community Survey
Even the biggest, fastest computers find certain empirical tasks intractable.
Data mining frequently has to make
Data Mining cannot handle all the available measures in one model!
It approximates and make compromises!
The “wisdom of the crowd” effect refers to the phenomenon that the mean of estimates provided by a group of individuals is more accurate than most of the individual estimates.
It is well-known that in many forecasting scenarios [4], averaging the forecasts of a set of individuals yields a collective prediction that is more accurate than the majority of the individuals’ forecasts—the so-called “wisdom of crowds” effect (Surowiecki 2004).
However, there are smarter ways of aggregating forecasts than simple averaging.
A hybrid approach of (re)calibrating and aggregating probabilistic forecasts about future events provided by experts who may exhibit systematic biases can improve model quality (Turner, 2010) → such as overestimating the likelihood of rare events
For many situations, such as when estimating the opinions of a group of individuals, it is desirable to learn a classification model but there is no underlying ground truth!
Having introduced the paradox of too little big data and noted the challenges caused by high-dimensional data, we can now discuss how a DM analysis typically proceeds.
Feature-selection methods allow a researcher to identify which among many potential predictors those are strongly associated with an outcome variable of interest.
DM offers several alternatives for selecting a subset of independent variables that are the most effective predictors of a dependent variable.
Manual variable selection
Automatic variable selection
One example for a method used in dimesionality reduction of the input data → Feature Selection
PCA can be applied
Multidimensional data can be handled by reducing the problem to two dimensions. Principal components may be used as inputs to multiple regression and cluster analysis.
Data sets for analysis may contain hundreds of attributes, many of them may be irrelevant to the mining task or redundant.
Attribute subset selection reduces the data set size by removing irrelevant or redundant attributes (or dimensions).
Information entropy of a data set, i.e., value set c (column of data) of a variable xi with |o| as the number of values of a set o (x=v ∈ V), pv the occurence probability of a specific value v, and N a set of occurence counts:
Information entropy of associated data sets (x → y, x=v ∈ V, y ∈ T)
Entropy is a measure of impurity in a collection of training examples
The effectniss of a feature variable xi in classifying the training data is given by the Information Gain:
x1 Outlook | x2 Temperat. | x3 Humidity | x4 Wind | y Playing? | |
Sunny | Hot | High | Weak | No | |
Sunny | Hot | High | Strong | No | |
Overcast | Hot | High | Weak | Yes | |
Rain | Mild | High | Weak | Yes | |
Rain | Cool | Normal | Weak | Yes | |
Rain | Cool | Normal | Strong | No | |
Overcast | Cool | Normal | Strong | Yes | |
Sunny | Mild | High | Weak | No | |
Sunny | Cool | Normal | Weak | Yes | |
Rain | Mild | Normal | Weak | Yes | |
Sunny | Mild | Normal | Strong | Yes | |
Overcast | Mild | High | Strong | Yes | |
Overcast | Hot | Normal | Weak | Yes | |
Rain | Mild | High | Strong | No | |
entropy | 1.58 | 1.56 | 1.0 | 0.99 | 0.94 |
Entropy | 0.69 | 0.91 | 0.79 | 0.89 | - |
Gain | 0.25 | 0.03 | 0.15 | 0.05 | - |
Sample training data is now used to derive a model as a hypothesis of the unknown input-output function.
There are multiple hypothesis functions H={h1,h2, ..} with h(x): x → y that can approximate the unknown function f(x)!
The challenge: Find the best hypothesis model function hi by testing the derived model against the sample and specific test data (e.g., by maximizing R2 measure and using cross validation)!!
Once a researcher has constructed a dataset rich in relevant features or variables, modeling can begin.
In DM different kind of models and learning approaches are used, but a specific choice is only temporary.
Thus, the data will be analyzed using several different kinds of models or approaches and compare their prediction accuracy before settling on a final approach.
One group or random subset of cases or observations is known as the training sample or estimation sample. This is the group of cases that will be analyzed first, to create a predictive model.
A second random sample can be created, known as the tuning sample (it is sometimes called the validation sample). It is used to estimate certain modeling parameters that will yield an optimal prediction.
A third randomly selected group of observations is central to cross-validation. This is the test sample, sometimes called the holdout sample. The test sample is not used in any way during the creation of the predictive model; it is deliberately kept completely separate (held back).
Comparison of conventional and DM approaches[2] | Impact of Calibration on R2 statistical measure[2] |
Now the learned and mined model function can be used for the prediction of unknown input data!
The training data consists of input data (x, sensor variables, structure parameters, ..) with the associated output data (y, so-called labels, eg material parameters). The output data is commonly assigned by experts (humans), but can also be fed back through an automatic evaluation (reinforcement learning).
In so-called clustering, patterns in the input data are automatically recognized, i.e., the training data consists only of the input data x.
This learning class is closely related to autonomous agents interacting in an environment with the behaviour: action → perception → decicion
It is a sequential decision-making problem with delayed reward. Reinforcement learning algorithms seek to learn a policy (mapping from states to actions) that maximize the reward received over time.
The unlabeled training data x(t) is provided sequentially as a stream!
Do I will play tennis today?
Impact of noise on prediction accuracy depends on the algorithm and models that are used |
|
Overfitting
|
Overfitting in Decision Trees
|
|
Do I will play tennis today?
Using Data Mining
The challenge: Find the best hypothesis model function hi by testing the derived model against the sample and specific test data (e.g., by maximizing R2 measure and using cross validation)!!