Monday, 12 April 2021

MACHINE - LEARNING

 

Machine learning

Chapter-1  :- INTRODUCTION

 

Learn: to improve automatically with experience 

Machine Learning: programs that improve automatically with experience

Machine Learning is the field of study that gives computers the capability to learn without being explicitly programmed

It is a process of executing a function on a machine, training a data on machine based on efficiency and performance.

Performance =P(x)

It is also defined as execution various methods, procedure and functions using datasets to perform effectively.

Datasets are categorized into two types to execute on a machine

1.      Training datasets- 80% of these run on a machine accurately at the time of manufacture

2.      Test datasets – remaining 20% are executed by the user on real time.

Example – washing machines, mobile and etc.

Examples Application areas which influence on machine learning

1.      Artificial Intelligence

2.      Bayesian methods

3.      Computational complexity theory

4.      Control theory

5.      Information theory

6.      Philosophy

7.      Psychology and Neurobiology

 

Applications of Machine Learning

            Learning to recognize spoken words

            Learning to drive an autonomous vehicle

            Learning to play world class computer games

Well Posed Learning Problems

A program is set to learn on a machine with an experience E, with perform of execution P by executing a task T

Some of the learning problems can be defined as a Task T with a performance measure P and to get the training experience E

1.      A checkers learning problem :

Task T :                       Playing Checkers

Performance measure P :         Percent of games won against opponents

Training experience E   :         playing practice games against itself

 

2.      Hand Writing Recognition Learning Problem

 

Task T :           recognizing and classifying hand written words within images

Performance measure P :         Percent of words correctly classified

Training experience           E   :     a database of hand written words with a given classification

 

3.      A robot driving learning problem

Task T :           driving on public 4 line highways using vision sensors

Performance measure P:  Justification made by human based on average distance travel of robot before error

Training experience E   : A sequence of images and steering commands recorded while observing a human driver.

Designing a Learning System

There are 5 steps in designing a learning system

1.      Choosing  the training experience

2.      Choosing   the target function

3.      Choosing  a representation for a target function

4.      Choosing  a function approximation algorithm

5.      Final design

 

1.      CHOOSING THE TRAINING EXPERIENCE

In this we have to learn a machine of how to play chess

            There are 3 different attributes to get the training experiences

The first design choice is to choose the type of training experience from which our system has to learn

 

 

 

First  Key attribute- Training experience provides direct or indirect feedback

 

Direct feedback: it consisting of individual board states and the correct move for each board state

 

Indirect feedback: consisting of move sequences and final outcomes of various games played (for each legal move, there are number of moves available. A machine can show various types of moves as a hint or suggestion).

 

 

Second Key Attribute – the training experience is to find the degree to which the learner controls the sequence of training examples

Teacher – Learner: problem are given to learner to solve the problems

Learner – Teacher: learner ask the training to solve a particular case problem

Self learning by the learner

(How to control the degree of learning with the help of trainer by the learner)

 

3.  Third key attribute – it is used to measure the machine based on various experiences of which the final system performance (P) is measured.

2. CHOOSING THE TARGET FUNCTION

a)      It is to determine exactly what type of knowledge would be learned and how this will be used by the performance program.

Example - Chess program

            We can generate the legal moves from any board state. The program needs only to learn how to choose the best move among various legal moves.

Let us choose the function name as ChooseMove

            Notation:

            ChooseMove : B à M  where B represent Board and M represent a legal move

b)      An alternative target function that assigns a numerical score to any given board state

Notation :

V : Bà R , where V maps any legal board state from the set B (Board) to some real value R(location of move)

Therefore, let us defined the target value V (b) for any arbitrary board state b in B(board).

 

1.      If b is a final board state that is  won , then V(b) =100

 

2.      If b is a final board state that is lost , then V(b) =-100

 

3.      If b is a final board state that is  draw , then V(b) =0

 

4.      If b is not a final state in the game, then V(b) = V(b’) where b’ is the best final board state.

 

3        CHOOSING A REPRESENTATION FOR A TARGET FUNCTION

Here, we have to specify the ideal target function V, in that the learning program will use to describe the function V’ that it will learn

Let us a simple representation for any given board state, The function V’ will be calculated as a linear combination of the following board features

             x1 : the number of white pieces on the board

            x2  : the number of black pieces on the board

            x3 : black king on the board

            x4 : white king on the board

            x5 :  the number of black pieces threatened by white

            x6 : the number of white pieces threatened by black

the learning program will represent V’(b) as a linear function of the form

            V’(b) = w0 + w1x1+w2x2+w3x3+w4x4+w5x5+w6x6

Where w0 through w6 are numerical coefficients chosen by the learning algorithm.

Partial Design of a Chess Learning Program

Task T :                       Playing Checkers

Performance measure P :         Percent of games won against opponents

Training experience E   :         playing practice games against itself

Target function                       : V(b)àR

Target function representation: V’(b) = w0 + w1x1+w2x2+w3x3+w4x4+w5x5+w6x6

 

4  CHOOSING FUNCTION APPROXIMATION ALGORITHM

Approximation function is defined using two states of algorithms or functions

In a chess game, the following training sates describes the board state b , in which Black has         won the game and there are no white pieces on the board except King

            Therefore, Vtrain(b) = 100

((x1=0, x2=3,x3=1,x4=1,x5=0,x6=0), + 100)

1.      Estimating training values – this approach is to design the training value of Vtrain(b) for any intermediate Board state b to be V’(Sucessor(b)) denotes the next board state for which the programs turn to move.

Rule for Estimating Training values

 Vtrain(b) <-- V’(Sucessor(b))

2.      Adjusting the weights – We must define the best fit to the training data, one common approach is to define the best hypothesis or set of weights, that which minimizes the Squared Error  E, between the training values and the values predicted by the hypothesis V’.

 

E        =                                    (Vtrain(b) - V’(b))2

            b,Vtrain(b)) € training examples

 

Several algorithms are known for finding weights of a linear function that minimize E. one such algorithm is called least mean squares or LMS training rule.

 

The LMS algorithms is defined as follows

 

LMS Weight update rule:

For each training example (b, Vtrain(b))

·         Use the current weight to calculate V’(b)

·         For each weight wi, update it as

wi ß- wi+n(Vtrain(b) – V’(b)) xi

 

 

1.      When the error (Vtrain(b) - V’(b)) is zero, no weight are changed

2.      When (Vtrain(b) - V’(b)) is positive then each weight is increased in proportion  to the value of its corresponding feature. This will raise the value of V’(b), reducing the error.


 

5 FINAL DESIGN

Many learning system has four distinct programs modules they are

1.      The performance System — Takes a new board as input and outputs a trace of the game it played against itself.

2.      The Critic — Takes the trace of a game as an input and outputs a set of training examples of the target function.

3.      The Generalizer — Takes training examples as input and outputs a hypothesis that estimates the target function. Good generalization to new cases is crucial.

4.      The Experiment Generator — Takes the current hypothesis (currently learned function) as input and outputs a new problem (an initial board state) for the performance system to explore.


 


 

 

 

 

 

 

 

 


 

Designing the Checker Learning Program


           


 

 Chapter -5

EVALUATING HYPOTHESES

 

 

 

 

It is important to evaluate the performance of learned hypothesis

Evaluating hypothesis is an integral component of many learning methods.

 

 

Two key difficulties arise while learning a hypothesis and estimating its future accuracy given only a limited set of data:

 

1.      Bias in the estimate.

2.      Variance in the estimate

 

àESTIMATING HYPOTHESIS ACCURACY

  

Hypothesis can be estimated(finding Accuracy) with two different types of errors

 

1.Sample Error

2. True Error

 

Sample Error

 

The sample error of a hypothesis with respect to some sample S of instances drawn from X is the fraction of S that it misclassifies.

 

Definition: The sample error (errors(h)) of hypothesis h with respect to target function f and data sample S is


 

 

 

 

Where n is the number of examples in S, and the quantity δ(f(x), h(x)) is 1 if f (x) ≠ h(x), and 0 otherwise.

 

True Error

 

The true error of a hypothesis is the probability that it will misclassify a single randomly drawn instance from the distribution D.

 

Definition: The true error (errorD(h)) of hypothesis h with respect to target function f and distribution D, is the probability that h will misclassify an instance drawn at random according to D.

 

 

 

 

 

Confidence Intervals for Discrete-Valued Hypotheses

 

 

Suppose we wish to estimate the true error for some discrete valued hypothesis h, based on its observed sample error over a sample S, where

 

The sample S contains n examples drawn independent of one another, and independent of h, according to the probability distribution D

 

n ≥ 30

 

Hypothesis h commits r errors over these n examples (i.e., errors (h) = r/n).

 

 

Under these conditions, statistical theory allows to make the following assertions:

1.      Given no other information, the most probable value of errorD (h) is errors(h)

2.      With approximately 95% probability, the true error errorD (h) lies in the interval

 

 

 

 

 

 

Example:

 

Suppose the data sample S contains n = 40 examples and that hypothesis h commits r = 12 errors over this data.

 

The sample error is errors(h) = r/n = 12/40 = 0.30

 

Given no other information, true error is errorD (h) = errors(h), i.e., errorD (h) =

 

0.30

With the 95% confidence interval estimate for errorD (h).

 

 

 

= 0.30 ± (1.96 * 0.07)              = 0.30 ± 0.14


 

 

 

3.      A different constant, ZN, is used to calculate the N% confidence interval. The general expression for approximate N% confidence intervals for errorD (h) is

 

 

 

 

 

 

Where,

 

 

 

 

 

 

 

 

 

The above equation describes how to calculate the confidence intervals, or error bars, for estimates of errorD (h) that are based on errors(h)

 

 

Example:

 

Suppose the data sample S contains n = 40 examples and that hypothesis h commits r = 12 errors over this data.

 

The sample error is errors(h) = r/n = 12/40 = 0.30

 

With the 68% confidence interval estimate for errorD (h).

 

 

 

 

=  0.30 ± (1.00 * 0.07)

 

=  0.30 ± 0.07

 

 

 

 

àBASICS OF SAMPLING THEORY

 

 

Error Estimation and Estimating Binomial Proportions

 

Collect a random sample S of n independently drawn instances from the distribution D, and then measure the sample error errors(h). Repeat this experiment many times, each time drawing a different random sample Si of size n, we would expect to observe different values for the various errorsi(h), depending on random differences in the makeup of the various Si. We say that errorsi(h), the outcome of the ith such experiment, is a random variable.

 

Imagine that we were to run k random experiments, measuring the random variables errors1(h), errors2(h) . . . errorssk(h) and plotted a histogram displaying the frequency with which each possible error value is observed.

 

As k grows, the histogram would approach a particular probability distribution called the Binomial distribution which is shown in below figure.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

A Binomial distribution is defined by the probability function

 

 

 

 

 

 

 

If the random variable X follows a Binomial distribution, then:

 

The probability Pr(X = r) that X will take on the value r is given by P(r)

 


 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

The Binomial Distribution

 

Consider the following problem for better understanding of Binomial Distribution

 

Toss the coin and estimate the probability that the coin will turn up heads when tossed.

 

Unknown probability of heads p. Toss the coin n times and record the number of times

 

r that it turns up heads.

 

Estimate of p = r / n

 

 

The Binomial distribution describes for each possible value of r (i.e., from 0 to n), the probability of observing exactly r heads given a sample of n independent tosses of a coin whose true probability of heads is p.

 

 

 

 

 

The general setting to which the Binomial distribution applies is:

 

 

1.      There is a base experiment (e.g., toss of the coin) whose outcome can be described by a random variable ‘Y’. The random variable Y can take on two possible values

(e.g., Y = 1 if heads, Y = 0 if tails).

 

2.      The probability that Y = 1 on any single trial of the base experiment is given by some constant p, independent of the outcome of any other experiment. The probability that Y = 0 is therefore (1 - p). Typically, p is not known in advance, and the problem is to estimate it.

 

3.      A series of n independent trials of the underlying experiment is performed (e.g., n independent coin tosses), producing the sequence of independent, identically distributed random variables Y1, Y2, . . . , Yn. Let R denote the number of trials for which Yi = 1 in this series of n experiments

 

 

 

 

 

4.      The probability that the random variable R will take on a specific value r (e.g., the probability of observing exactly r heads) is given by the Binomial distribution


 

 

 

 

 

 

 

 

 

 

 

 

 

 


 

Mean, Variance and Standard Deviation

 

1.The Mean (expected value) is the average of the values taken on by repeatedly sampling the random variable

 

Definition: Consider a random variable Y that takes on the possible values y1, . . . yn. The expected value (Mean) of Y, E[Y], is

 

 

 

 

 

 

 

2.The Variance captures how far the random variable is expected to vary from its mean value.

 

Definition: The variance of a random variable Y, Var[Y], is

 

 

 

 

 

 

The variance describes the expected squared error in using a single observation of Y to estimate its mean E[Y].

 

 

 

 

 

3.The square root of the variance is called the standard deviation of Y, denoted σy

 

Definition: The standard deviation of a random variable Y, σy, is

 

 

 

 

 

In case the random variable Y is governed by a Binomial distribution, then the Mean, Variance and standard deviation are given by


 

 

 

 

 

 

 

 

 

 

 

 

 

Estimators, Bias, and Variance

 

Let us describe errors(h) and errorD(h) using the terms in Equation (1) defining the Binomial distribution. We then have

 

 

 

 

 

 

 

Where,

 

n is the number of instances in the sample S,

 

r is the number of instances from S misclassified by h

 

p is the probability of misclassifying a single instance drawn from D

 

 

Estimator:

 

errors(h) an estimator for the true error errorD(h): An estimator is any random variable used to estimate some parameter of the underlying population from which the sample is drawn

 

 

 

Estimation bias: is the difference between the expected value of the estimator and the true value of the parameter.

 

Definition: The estimation bias of an estimator Y for an arbitrary parameter p is