Machine
learning
Chapter-1 :-
INTRODUCTION
Learn:
to improve automatically with experience
Machine
Learning: programs that improve automatically with experience
Machine Learning is the field of study
that gives computers the capability to learn without being explicitly
programmed
It is a process of executing a function
on a machine, training a data on machine based on efficiency and performance.
Performance =P(x)
It is also defined as execution various
methods, procedure and functions using datasets to perform effectively.
Datasets
are categorized into two types to execute on a machine
1.
Training
datasets- 80% of these run on a machine accurately at the time of manufacture
2.
Test
datasets – remaining 20% are executed by the user on real time.
Example –
washing machines, mobile and etc.
Examples
Application areas which influence on machine learning
1.
Artificial
Intelligence
2.
Bayesian
methods
3.
Computational
complexity theory
4.
Control
theory
5.
Information
theory
6.
Philosophy
7.
Psychology
and Neurobiology
Applications
of Machine Learning
Learning
to recognize spoken words
Learning
to drive an autonomous vehicle
Learning
to play world class computer games
Well
Posed Learning Problems
A program is set
to learn on a machine with an experience E, with perform of execution P by
executing a task T
Some of the
learning problems can be defined as a Task T with a performance measure P and
to get the training experience E
1.
A checkers
learning problem :
Task T : Playing
Checkers
Performance
measure P : Percent of games won
against opponents
Training
experience E : playing
practice games against itself
2.
Hand Writing
Recognition Learning Problem
Task T : recognizing
and classifying hand written words within images
Performance
measure P : Percent of words
correctly classified
Training
experience E : a database of hand written words with a given
classification
3.
A robot driving
learning problem
Task T : driving on public 4 line highways
using vision sensors
Performance
measure P: Justification made by human
based on average distance travel of robot before error
Training
experience E : A sequence of images and
steering commands recorded while observing a human driver.
Designing
a Learning System
There are 5 steps in designing a learning
system
1.
Choosing
the training experience
2.
Choosing
the target function
3.
Choosing a representation for a target function
4.
Choosing
a function approximation algorithm
5.
Final
design
1. CHOOSING THE TRAINING EXPERIENCE
In this we have to learn a machine of
how to play chess
There
are 3 different attributes to get the training experiences
The
first design choice is to choose the type of training experience from which our
system has to learn
First Key attribute- Training experience provides
direct or indirect feedback
Direct
feedback: it consisting of individual board states and the correct move for
each board state
Indirect
feedback: consisting of move sequences and final outcomes of various games
played (for each legal move, there are number of moves available. A machine can
show various types of moves as a hint or suggestion).
Second
Key Attribute – the
training experience is to find the degree to which the learner controls the
sequence of training examples
Teacher – Learner:
problem are given to learner to solve the problems
Learner – Teacher:
learner ask the training to solve a particular case problem
Self learning by
the learner
(How to control
the degree of learning with the help of trainer by the learner)
3. Third key
attribute
– it is used to measure the machine based on various experiences of which the
final system performance (P) is measured.
2. CHOOSING THE TARGET FUNCTION
a)
It
is to determine exactly what type of knowledge would be learned and how this
will be used by the performance program.
Example - Chess program
We
can generate the legal moves from any board state. The program needs only to
learn how to choose the best move among various legal moves.
Let us choose the function name as ChooseMove
Notation:
ChooseMove : B à M where B
represent Board and M represent a legal move
b)
An
alternative target function that assigns a numerical score to any given board
state
Notation :
V : Bà R , where V maps any legal board
state from the set B (Board) to some real value R(location of move)
Therefore, let
us defined the target value V (b) for any arbitrary board state b in B(board).
1.
If
b is a final board state that is won ,
then V(b) =100
2.
If
b is a final board state that is lost , then V(b) =-100
3.
If
b is a final board state that is draw ,
then V(b) =0
4.
If
b is not a final state in the game, then V(b) = V(b’) where b’ is the best
final board state.
3
CHOOSING A
REPRESENTATION FOR A TARGET FUNCTION
Here, we have to specify the ideal
target function V, in that the learning program will use to describe the
function V’ that it will learn
Let us a simple representation for any
given board state, The function V’ will be calculated as a linear combination
of the following board features
x1 : the number of white pieces on the board
x2 : the number of black pieces on the board
x3
: black king on the board
x4
: white king on the board
x5
: the number of black pieces threatened
by white
x6
: the number of white pieces threatened by black
the learning program will represent
V’(b) as a linear function of the form
V’(b)
= w0 + w1x1+w2x2+w3x3+w4x4+w5x5+w6x6
Where w0 through w6 are numerical
coefficients chosen by the learning algorithm.
Partial
Design of a Chess Learning Program
Task T : Playing
Checkers
Performance
measure P : Percent of games won
against opponents
Training
experience E : playing
practice games against itself
Target function : V(b)àR
Target function
representation: V’(b) = w0 + w1x1+w2x2+w3x3+w4x4+w5x5+w6x6
4 CHOOSING FUNCTION APPROXIMATION ALGORITHM
Approximation
function is defined using two states of algorithms or functions
In a chess game,
the following training sates describes the board state b , in which Black has won the game and there are no white
pieces on the board except King
Therefore,
Vtrain(b) = 100
((x1=0, x2=3,x3=1,x4=1,x5=0,x6=0), + 100)
1.
Estimating
training values
– this approach is to design the training value of Vtrain(b) for any
intermediate Board state b to be V’(Sucessor(b)) denotes the next board state
for which the programs turn to move.
Rule for Estimating Training values
Vtrain(b) <-- V’(Sucessor(b))
2.
Adjusting the
weights
– We must define the best fit to the training data, one common approach is to
define the best hypothesis or set of weights, that which minimizes the Squared
Error E, between the training values and
the values predicted by the hypothesis V’.
E = ∑ (Vtrain(b)
- V’(b))2
b,Vtrain(b))
€ training examples
Several
algorithms are known for finding weights of a linear function that minimize E.
one such algorithm is called least mean squares or LMS training rule.
The LMS algorithms is defined as follows
LMS Weight
update rule:
For each
training example (b, Vtrain(b))
·
Use
the current weight to calculate V’(b)
·
For
each weight wi, update it as
wi
ß- wi+n(Vtrain(b)
– V’(b)) xi
1.
When
the error (Vtrain(b) -
V’(b)) is zero, no weight are changed
2.
When
(Vtrain(b) - V’(b)) is
positive then each weight is increased in proportion to the value of its corresponding feature.
This will raise the value of V’(b), reducing the error.
5 FINAL DESIGN
Many learning system has four distinct programs
modules they are
1.
The performance System — Takes a new board as
input and outputs a trace of the game it played against itself.
2.
The Critic — Takes the trace of a
game as an input and outputs a set of training examples of the target function.
3.
The Generalizer — Takes training
examples as input and outputs a hypothesis that estimates the target function.
Good generalization to new cases is crucial.
4.
The Experiment Generator — Takes the current
hypothesis (currently learned function) as input and outputs a new problem (an
initial board state) for the performance system to explore.
Designing
the Checker Learning Program
EVALUATING HYPOTHESES
It is important to evaluate the
performance of learned hypothesis
Evaluating hypothesis is an integral component of many learning methods.
Two key
difficulties arise while learning a hypothesis and
estimating its future accuracy given only a limited set of data:
1.
Bias in
the estimate.
2.
Variance in the estimate
àESTIMATING HYPOTHESIS ACCURACY
Hypothesis can be
estimated(finding Accuracy) with two different types of errors
1.Sample Error
2. True Error
Sample Error
The
sample error of a hypothesis with respect to some sample S of instances drawn
from X is the fraction of S that it misclassifies.
Definition: The sample error (errors(h)) of hypothesis h with respect to target function f and data
sample S is
Where n is the number of examples in S, and the quantity δ(f(x), h(x))
is 1 if f (x) ≠ h(x), and 0 otherwise.
True Error
The true error of a hypothesis is the probability
that it will misclassify a single randomly drawn instance from the distribution
D.
Definition: The true error (errorD(h)) of hypothesis h with respect
to target function f and distribution D, is the probability that h will misclassify an instance drawn at random
according to D.
Confidence Intervals for Discrete-Valued Hypotheses
Suppose we wish to estimate the true error for some discrete valued
hypothesis h, based on its observed sample error over a sample S, where
The sample S contains n examples drawn independent of one another, and
independent of h, according to the probability distribution D
n ≥ 30
Hypothesis
h commits r errors over these n examples (i.e., errors (h) =
r/n).
Under these conditions,
statistical theory allows to make the following assertions:
1.
Given no other information, the
most probable value of errorD (h) is errors(h)
2.
With approximately 95% probability, the true error errorD (h) lies in the interval
Example:
Suppose the data sample S contains n = 40 examples and that hypothesis h
commits r = 12 errors over this data.
The sample
error is errors(h) = r/n = 12/40 = 0.30
Given no other information, true error is errorD (h) = errors(h), i.e., errorD (h) =
0.30
With the 95% confidence interval estimate for errorD (h).
= 0.30 ± (1.96 * 0.07) = 0.30 ±
0.14
3.
A different constant, ZN,
is used to calculate the N% confidence
interval. The general expression for approximate N% confidence intervals
for errorD (h) is
Where,
The above equation describes how to calculate the confidence intervals,
or error bars, for estimates of errorD (h) that are based on errors(h)
Example:
Suppose the data sample S contains n = 40 examples and that hypothesis h
commits r = 12 errors over this data.
The sample
error is errors(h) = r/n = 12/40 = 0.30
With the 68% confidence interval estimate for errorD (h).
= 0.30 ±
(1.00 * 0.07)
= 0.30 ±
0.07
àBASICS OF SAMPLING THEORY
Error Estimation and Estimating Binomial Proportions
Collect a random sample S of n
independently drawn instances from the distribution D, and then measure the sample error errors(h). Repeat this experiment many
times, each time drawing a different random sample Si of size n, we would expect to observe different values for the various
errorsi(h), depending on random
differences in the makeup of the various Si. We say that errorsi(h), the
outcome of the ith such experiment, is a random variable.
Imagine that we were to run k random experiments, measuring the random
variables errors1(h),
errors2(h) . . . errorssk(h) and plotted a histogram
displaying the frequency with which each possible error value is observed.
As k grows, the histogram would approach a particular probability
distribution called the Binomial distribution which is shown
in below figure.
A Binomial distribution is
defined by the probability function
If the random variable X
follows a Binomial distribution, then:
The
probability Pr(X = r) that X will take on the value r
is given by P(r)
Consider the following problem
for better understanding of Binomial Distribution
Toss the coin and estimate the probability that the coin will turn up
heads when tossed.
Unknown
probability of heads p. Toss the coin n
times and record the number of times
r that it
turns up heads.
Estimate
of p
= r / n
The Binomial distribution describes for each possible value of r
(i.e., from 0 to n), the probability of observing exactly r heads given a sample of
n
independent tosses of a coin whose true probability of heads is p.
The
general setting to which the Binomial distribution applies is:
1.
There is a base experiment (e.g.,
toss of the coin) whose outcome can be described by a random variable ‘Y’. The
random variable Y can take on two possible values
(e.g., Y = 1 if heads, Y = 0 if
tails).
2.
The probability that Y = 1 on any
single trial of the base experiment is given by some constant p, independent of
the outcome of any other experiment. The probability that Y = 0 is therefore (1
- p). Typically, p is not known in advance, and the problem is to estimate it.
3.
A series of n independent trials
of the underlying experiment is performed (e.g., n independent coin tosses),
producing the sequence of independent, identically distributed random variables
Y1, Y2, . . . , Yn. Let R
denote the number of trials for which Yi = 1 in this series of n experiments
4.
The probability that the random
variable R will take on a specific value r (e.g., the probability of observing
exactly r heads) is given by the Binomial distribution
Mean, Variance and Standard Deviation
1.The
Mean (expected value) is the average
of the values taken on by repeatedly sampling the random variable
Definition:
Consider a random variable Y that
takes on the possible values y1, . . . yn. The expected value (Mean) of
Y, E[Y], is
2.The Variance captures how far the random variable is expected
to vary from its mean value.
Definition: The
variance of a random variable Y, Var[Y], is
The
variance describes the expected squared error in using a single observation of
Y to estimate its mean E[Y].
3.The square root of the variance is called the standard deviation of Y, denoted
σy
Definition: The
standard deviation of a random variable Y, σy, is
In case
the random
variable Y is governed by a Binomial distribution, then the Mean,
Variance and standard deviation are given by
Estimators, Bias, and Variance
Let us
describe errors(h) and
errorD(h) using the terms in Equation
(1) defining the Binomial distribution. We then have
Where,
n is the
number of instances in the sample S,
r is the
number of instances from S misclassified by h
p is the
probability of misclassifying a single instance drawn from D
Estimator:
errors(h) an estimator
for the true error errorD(h): An
estimator is any random variable used to estimate some parameter of the
underlying population from which the sample is drawn
Estimation bias: is the difference between the
expected value of the estimator and the true value of the parameter.
Definition:
The estimation bias of an estimator Y for an
arbitrary parameter p is