Data mining with Machine Learning for the social sciences

Introduction, Challenges, the right & the wrong, Misunderstanding

Priv.-Doz. Dr. Stefan Bosse
University of Koblenz-Landau, Fac. Computer Science
University of Bremen, Dept. Mathematics & Informatics


Introduction to Artificial Intelligence

Artificial Intelligence

In social science big data volumes must be handled.
But big do not mean helpful or important!
Data is noisy and uncertain!?

  • One major task in data science is the derivation of fundamental mapping functions:

F(Input Data): Input Data Output Data

F(Sensor Data): Sensor Data Knowledge

  • Such a function F performs Feature Extraction

  • But often there are no or only partial numerical/mathematical models that can implement F!

Artificial Intelligence

  • Usage of Artificial Intelligence and their methods can be helpful to derive such fundamental mapping functions - or at least an approximation: Hypothesis

  • The input data is characterized commonly by a high dimensionality consisting of a vector of variables


  • whereby the output data (information) has a much lower dimensionality (data reduction!) consisting of the variable vector


  • This means:

F: RN RM with M N

  • Data reduction includes the pre-selection of suitable (high information entropy) data variables Feature Selection

Artificial Intelligence


Fig. 1. A typical Aritificial Intelligence System

Machine Learning - Technical Sciences

  • Often there are no functional relations between two variables x and y.
    In technical applications x can be a camera image with 1 Million pixels and y a figure from the set {0,1,2,..,9} that represent a hand written character. Generally:

f(x):x y.

  • Machine Learning (ML) can be used to derive such relation from experimental/empirical training data!

  • Among the derivation of such functional relations the prediction of what will happen next or in the future is an important task of Machine Learning

Machine Learning - The Functional Approach

  • Machine learning means the derivation of a hypothesis of a simple input-output function from training data provided by humans (statistical data!)!


Fig. 2. A hypothesis of an input-output model function derived from training data

Machine Learning - Medicine

Diagnosis of Appendicitis from medicine and personal data

Input Data x

Patient Details [weight,age,sex,pain left, pain right, temperature, ..]

Output Data y

Diagnosis Label {Appendicitis, Dyspepsia, Unknown, .. }

Decision Learner

Returns one of the labels matching a new input vector x (the test object)


Machine Learning - Medicine

  • Decision classifiers only return one (good or bad) matching label

  • No information about matching probability

Probalistic Learner (Bayes Theorem)
Feature: Probability forecast estimating the conditional probability of best matching (or all) label(s) with a given observed object x
\[P(y|x) = \frac{P(x|y)P(y)}{P(x)}