matlab NCA,Neighborhood Component Analysis (NCA) Feature Selection

  • Post author:
  • Post category:其他


NCA Feature Selection for Classification

Consider a multi-class classification problem with a training set containing n observations:

S={(xi,yi),i=1,2,…,n},

where xi∈ℝp are the feature vectors, yi∈{1,2,…,c} are the class labels, and c is the number of classes. The aim is to learn a classifier f:ℝp→{1,2,…,c} that accepts a feature vector and makes a prediction f(x) for the true label y of x.

Consider a randomized classifier that:

Randomly picks a point, Ref(x), from S as the ‘reference point’ for x

Labels x using the label of the reference point Ref(x).

This scheme is similar to that of a 1-NN classifier where the reference point is chosen to be the nearest neighbor of the new point x. In NCA, the reference point is chosen randomly and all points in S have some probability of being selected as the reference point. The probability P(Ref(x)=xj|S) that point xj is picked from S as the reference point for x is higher if xj is closer to x as measured by the distance function dw, where

dw(xi,xj)=∑r=1pwr2|xir−xjr|,

and wr are the feature weights. Assume that

P(Ref(x)=xj|S)∝k(dw(x,xj)),

where k is some kernel or a similarity function that assumes large values when dw(x,xj) is small. Suppose it is

k(z)=exp(−zσ),

as suggested in [1]. The reference point for x is chosen from S, so sum of P(Ref(x)=xj|S) for all j must be equal to 1. Therefore, it is possible to write

P(Ref(x)=xj|S)=k(dw(x,xj))∑j=1nk(dw(x,xj)).

Now consider the leave-one-out application of this randomized classifier, that is, predicting the label of xi using the data in S−i, the training set S excluding the point (xi,yi). The probability that point xj is picked as the reference point for xi is

pij=P(Ref(xi)=xj|S−i)=k(dw(xi,xj))∑j=1,j≠ink(dw(xi,xj)).

The average leave-one-out probability of correct classification is the probability pi that the randomized classifier correctly classifies observation i using S−i.

pi=∑j=1,j≠inP(Ref(xi)=xj|S−i)I(yi=yj)=∑j=1,j≠inpijyij,

where

yij=I(yi=yj)={1ifyi=yj,0otherwise.

The average leave-one-out probability of correct classification using the randomized classifier can be written as

F(w)=1n∑i=1npi.

The right hand side of F(w) depends on the weight vector w. The goal of neighborhood component analysis is to maximize F(w) with respect to w. fscnca uses the regularized objective function as introduced in [1].

F(w)=1n∑i=1npi−λ∑r=1pwr2=1n∑i=1n[∑j=1,j≠inpijyij−λ∑r=1pwr2]︸Fi(w)=1n∑i=1nFi(w),

where λ is the regularization parameter. The regularization term drives many of the weights in w to 0.

After choosing the kernel parameter σ in pij as 1, finding the weight vector w can be expressed as the following minimization problem for given λ.

w^=argminwf(w)=argminw1n∑i=1nfi(w),

where f(w) = -F(w) and fi(w) = -Fi(w).

Note that

1n∑i=1n∑j=1,j≠inpij=1,

and the argument of the minimum does not change if you add a constant to an objective function. Therefore, you can rewrite the objective function by adding the constant 1.

w^=argminw{1+f(w)}=argminw{1n∑i=1n∑j=1,j≠inpij−1n∑i=1n∑j=1,j≠inpijyij+λ∑r=1pwr2}=argminw{1n∑i=1n∑j=1,j≠inpij(1−yij)+λ∑r=1pwr2}=argminw{1n∑i=1n∑j=1,j≠inpijl(yi,yj)+λ∑r=1pwr2},

where the loss function is defined as

l(yi,yj)={1ifyi≠yj,0otherwise.

The argument of the minimum is the weight vector that minimizes the classification error. You can specify a custom loss function using the LossFunction name-value pair argument in the call to fscnca.