CUR Decomposition

  • Post author:
  • Post category:其他




Prerequisite: Matrix Multiplication Sampling



常规矩阵乘法的计算:





A

M

m

×

n

,

 

B

M

n

×

p

A\in M_{m\times n}, \ B\in M_{n\times p}






A














M











m


×


n



















,






B














M











n


×


p



























A

B

=

k

=

1

n

A

(

:

,

k

)

B

(

k

,

:

)

AB=\sum_{k=1}^n A(:,k)B(k,:)






A


B




=

















k


=


1


















n



















A


(


:






,




k


)


B


(


k


,




:






)







时间复杂度为



O

(

n

3

)

O(n^3)






O


(



n










3









)





,太慢了,我们希望通过随机取样的方法计算矩阵乘法



A

B

AB






A


B





,降低时间复杂度



一次取样:

随机取



A

A






A





中的一行与



B

B






B







对应的

一列

设取第



k

k






k





行与第



k

k






k





列的概率为



p

k

p_k







p










k





















,我们有



p

1

+

+

p

n

=

1

p_1+\cdots+p_n=1







p










1




















+













+









p










n




















=








1










X

=

1

p

k

A

(

:

,

k

)

B

(

k

,

:

)

X=\dfrac{1}{p_k}A(:,k)B(k,:)






X




=




















p










k






























1




















A


(


:






,




k


)


B


(


k


,




:






)






则有





E

(

X

)

=

k

=

1

n

p

k

1

p

k

A

(

:

,

k

)

B

(

k

,

:

)

=

A

B

E(X)=\sum_{k=1}^n p_k \cdot \frac{1}{p_k}A(:,k)B(k,:)=AB






E


(


X


)




=

















k


=


1


















n




















p










k









































p










k






























1




















A


(


:






,




k


)


B


(


k


,




:






)




=








A


B











V

a

r

(

X

)

=

def

i

j

V

a

r

(

X

i

j

)

=

i

j

E

(

X

i

j

2

)

i

j

E

2

(

X

i

j

)

Var(X) \xlongequal{\text{def}} \sum\limits_{ij}Var(X_{ij}) =\sum_{ij}E(X_{ij}^2)-\sum_{ij}E^2(X_{ij})






V


a


r


(


X


)














def
































i


j





























V


a


r


(



X











i


j



















)




=

















i


j





























E


(



X











i


j









2


















)






















i


j






























E










2









(



X











i


j



















)





注:

在矩阵方差的这个定义下,我们有



V

a

r

(

X

)

=

E

(

X

E

(

X

)

F

2

)

Var(X)=E\big(\|X-E(X)\|_F^2\big)






V


a


r


(


X


)




=








E



(






X













E


(


X


)














F








2



















)





前半部分





i

j

E

(

X

i

j

2

)

=

i

j

k

p

k

1

p

k

2

A

i

k

2

B

k

j

2

=

k

1

p

k

(

i

A

i

k

2

)

(

j

B

k

j

2

)

=

k

1

p

k

A

(

:

,

k

)

2

B

(

k

,

:

)

2

=

p

k

=

A

(

:

,

k

)

2

A

F

2

A

F

2

k

B

(

k

,

:

)

2

=

A

F

2

B

F

2

\begin{aligned} \sum_{ij}E(X_{ij}^2) = & \sum_{ij}\sum_k p_k \cdot \dfrac{1}{p_k^2}A_{ik}^2 B_{kj}^2 \\ = & \sum_{k} \dfrac{1}{p_k} \Big(\sum_{i} A_{ik}^2 \Big) \Big( \sum_j B_{kj}^2 \Big) \\ = & \sum_{k} \dfrac{1}{p_k} \|A(:,k)\|^2 \|B(k,:)\|^2 \\ \xlongequal{p_k=\frac{\|A(:,k)\|^2}{\|A\|_F^2}} & \|A\|_F^2\sum_k\|B(k,:)\|^2 \\ = & \|A\|_F^2 \|B\|_F^2 \end{aligned}

























i


j





























E


(



X











i


j









2


















)




=








=








=


















p










k


















=

















A













F






2


































A


(


:


,


k


)













2















































=






































i


j





































k





























p










k





































p










k








2






























1





















A











i


k









2



















B











k


j









2





































k









































p










k






























1





















(














i






























A











i


k









2



















)




(













j





























B











k


j









2



















)






















k









































p










k






























1























A


(


:


,




k


)














2












B


(


k


,




:


)














2




















A














F








2




























k































B


(


k


,




:


)














2




















A














F








2





















B














F








2






































注:

对于倒数第二个等号,由柯西不等式,本来当



p

k

A

(

:

,

k

)

B

(

k

,

:

)

p_k \propto \|A(:,k)\| \|B(k,:)\|







p










k
































A


(


:






,




k


)








B


(


k


,




:






)








时取最小值。

但是为了计算方便,这里让



p

k

A

(

:

,

k

)

2

p_k \propto \|A(:,k)\|^2







p










k
































A


(


:






,




k


)














2












,也就是



p

k

=

A

(

:

,

k

)

2

A

F

2

p_k=\dfrac{\|A(:,k)\|^2}{\|A\|_F^2}







p










k




















=






















A














F








2

































A


(


:


,




k


)














2





























后半部分





i

j

E

2

(

X

i

j

)

=

E

(

X

)

F

2

=

A

B

F

2

\sum_{ij}E^2(X_{ij}) = \|E(X)\|_F^2 = \|AB\|_F^2















i


j






























E










2









(



X











i


j



















)




=











E


(


X


)














F








2




















=











A


B














F








2























因此





V

a

r

(

X

)

=

A

F

2

B

F

2

A

B

F

2

A

F

2

B

F

2

Var(X) = \|A\|_F^2 \|B\|_F^2 – \|AB\|_F^2 \le \|A\|_F^2 \|B\|_F^2






V


a


r


(


X


)




=











A














F








2





















B














F








2
































A


B














F








2
































A














F








2





















B














F








2























多次取样:

核心思想:通过多次取样降低方差

随机取



A

A






A





中的



s

s






s





列组成



C

M

m

×

s

C \in M_{m\times s}






C














M











m


×


s


























B

B






B







相应的




s

s






s





行组成



R

M

s

×

p

R \in M_{s\times p}






R














M











s


×


p























当然还要乘上相应的系数,具体来说:





C

=

[

A

(

:

,

k

1

)

s

p

k

1

,


,

A

(

:

,

k

s

)

s

p

k

s

]

C = \begin{bmatrix} \dfrac{A(:,k_1)}{\sqrt {sp_{k_1}}},\cdots,\dfrac{A(:,k_s)}{\sqrt {sp_{k_s}}} \end{bmatrix}






C




=










[
































s



p












k










1





































































A


(


:


,





k










1


















)




















,











,























s



p












k










s





































































A


(


:


,





k










s


















)






































]













R

=

[

B

(

k

1

,

:

)

s

p

k

1

B

(

k

s

,

:

)

s

p

k

s

]

R = \begin{bmatrix} \dfrac{B(k_1,:)}{\sqrt {sp_{k_1}}} \\ \vdots \\ \dfrac{B(k_s,:)}{\sqrt {sp_{k_s}}} \end{bmatrix}






R




=
























































































































s



p












k










1





































































B


(



k










1


















,




:


)


























































s



p












k










s





































































B


(



k










s


















,




:


)





























































































































其中



k

1

,


,

k

s

k_1,\cdots,k_s







k










1


















,











,





k










s





















是从



{

1

,

2

,


,

n

}

\{1,2,\cdots,n\}






{



1


,




2


,











,




n


}





中随机取



s

s






s





个的结果

这样





C

R

=

1

s

i

1

p

k

i

A

(

:

,

k

i

)

B

(

k

i

,

:

)

=

1

s

i

X

i

\begin{aligned} CR & = \frac{1}{s} \sum _i \dfrac{1}{p_{k_i}}A(:,k_i)B(k_i,:) \\ & = \frac{1}{s}\sum_i X_i \end{aligned}
















C


R



































=















s














1






























i








































p












k










i















































1




















A


(


:


,





k










i


















)


B


(



k










i


















,




:


)












=















s














1






























i





























X










i








































因此



C

R

CR






C


R





就相当于重复取样多次后的



X

X






X





的平均值,相应地





V

a

r

(

C

R

)

=

V

a

r

(

1

s

i

X

i

)

=

1

s

2

i

V

a

r

(

X

i

)

1

s

A

F

2

B

F

2

Var(CR) = Var(\frac{1}{s}\sum_i X_i) = \frac{1}{s^2} \sum_i Var(X_i) \le \frac{1}{s} \|A\|_F^2 \|B\|_F^2






V


a


r


(


C


R


)




=








V


a


r


(













s














1






























i





























X










i


















)




=




















s










2





















1






























i




























V


a


r


(



X










i


















)
























s














1























A














F








2





















B














F








2























最终结论:





V

a

r

(

C

R

)

=

E

(

A

B

C

R

F

2

)

1

s

A

F

2

B

F

2

Var(CR) = E\left(\|AB-CR\|_F^2\right) \le \frac{1}{s} \|A\|_F^2 \|B\|_F^2






V


a


r


(


C


R


)




=








E






(






A


B









C


R














F








2



















)


























s














1























A














F








2





















B














F








2























CUR Decomposition



Motivation: Sketch of Matrix

Let



A

M

m

×

n

A \in M_{m\times n}






A














M











m


×


n























The time complexity of computing



A

x

Ax






A


x





is



O

(

m

n

)

O(mn)






O


(


m


n


)






But if we decomposite



A

A






A





into



C

U

R

CUR






C


U


R





, where



C

M

n

×

s

,

U

M

s

×

r

,

R

M

r

×

n

C \in M_{n\times s}, U \in M_{s \times r}, R \in M_{r\times n}






C














M











n


×


s



















,




U














M











s


×


r



















,




R














M











r


×


n






















, then the time complexity would be



O

(

m

s

+

s

r

+

r

n

)

O(ms+sr+rn)






O


(


m


s




+








s


r




+








r


n


)





, which would be



O

(

m

+

n

)

O(m+n)






O


(


m




+








n


)





if



s

s






s





and



r

r






r





are



O

(

1

)

O(1)






O


(


1


)





.



Comparison with SVD



Pros
  • faster
  • preserve the actual data



    \Rightarrow












    easier to interpret than the linear combination of data as in SVD

  • preserve some property, like the sparsity


Cons
  • less accurate approximation
  • stronger error bound



Intuitions



First Try: Identity Matrix

Just let



B

=

I

B=I






B




=








I





and repeat the matrix multiplication sampling process.

However,



I

I






I





is full rank and each row has the same weight, so sampling will lose too much information, and require the bound to be too strong:





E

(

A

I

C

R

F

2

)

1

s

A

F

2

I

F

2

=

n

s

A

F

2

E\left( \|AI-CR\|_F^2 \right) \le \frac{1}{s} \|A\|_F^2 \|I\|_F^2 = \frac{n}{s} \|A\|_F^2






E






(






A


I









C


R














F








2



















)


























s














1























A














F








2





















I














F








2




















=



















s














n























A














F








2























Projection Matrix

We want a low rank



B

B






B





, so that sampling won’t cause information loss.

Consider the projection matrix



P

=

R

T

(

R

R

T

)

1

R

P = R^T(RR^T)^{-1}R






P




=









R










T









(


R



R










T










)














1










R





, where



R

M

s

×

n

R \in M_{s\times n}






R














M











s


×


n






















is a matrix of



r

r






r





rows of



A

A






A





picked according to length squared sampling.



P

P






P





is just like a pseudo indentity.

Note: Properties of projection matrix




P

x

=

x

Px=x






P


x




=








x





if



x

x






x





is in the row space of



R

R






R









P

x

=

0

Px=0






P


x




=








0





otherwise



Main Theorem

Let



A

M

m

×

n

,

 

r

,

s

N

+

A\in M_{m\times n},\ r,s \in \mathbb N^+






A














M











m


×


n



















,






r


,




s














N










+













Let



C

M

m

×

r

C\in M_{m\times r}






C














M











m


×


r






















be a matrix of



s

s






s





columns of



A

A






A





picked according to length squared sampling,



R

M

s

×

n

R \in M_{s\times n}






R














M











s


×


n






















be a matrix of



r

r






r





rows of



A

A






A





picked according to length squared sampling again.

Then we can find a



U

M

r

×

s

U \in M_{r\times s}






U














M











r


×


s






















, so that





E

(

A

C

U

R

2

2

)

A

F

2

(

1

r

+

r

s

)

E\left(\|A-CUR\|_2^2\right) \le \|A\|_F^2 \left( \frac{1}{\sqrt{r}} + \frac{r}{s} \right)






E






(






A









C


U


R














2








2



















)


















A














F








2






















(






















r




































1






















+















s














r





















)







Note:

Unlike what we do in matrix multiplication sampling, the rows are not necessarily corresponding to the columns.

And we find



U

U






U





in this way:

Wanting to make the rows of



U

R

UR






U


R





be the rows of



P

=

R

T

(

R

R

T

)

1

R

P=R^T(RR^T)^{-1}R






P




=









R










T









(


R



R










T










)














1










R





corresponding to the columns of



A

A






A





, we just picked some rows of



R

T

R^T







R










T












, right-times it by



(

R

R

T

)

1

(RR^T)^{-1}






(


R



R










T










)














1













, which comes to



U

U






U






Proof





A

C

U

R

2

A

A

P

2

+

A

P

C

U

R

2

\|A-CUR\|_2 \le \|A-AP\|_2 + \|AP-CUR\|_2









A













C


U


R














2
































A













A


P














2




















+











A


P













C


U


R














2





















Note:

The two norm of matrix



A

A






A





is



A

2

=

def

max

x

A

x

2

\|A\|_2 \xlongequal{\text{def}}\max\limits_x\|Ax\|_2









A














2






























def































x








max






















A


x














2






















satisfying the triangle inequality



A

+

B

2

A

2

+

B

2

\|A+B\|_2\le\|A\|_2+\|B\|_2









A




+








B














2
































A














2




















+











B














2




















first part:





A

A

P

2

2

=

max

x

A

x

A

P

x

2

\|A-AP\|_2^2=\max_x \|Ax-APx\|^2









A













A


P














2








2




















=
















x








max






















A


x













A


P


x














2














For the



x

x






x





in row space of



R

R






R





,



A

x

A

P

x

2

=

0

\|Ax-APx\|_2 = 0









A


x













A


P


x














2




















=








0





, so just consider the



x

x \bot






x








row space of



R

R






R










A

x

A

P

x

2

=

A

x

2

=

x

T

A

T

A

x

=

x

T

(

A

T

A

R

T

R

)

x

x

(

A

T

A

R

T

R

)

x

A

T

A

R

T

R

2

A

F

4

r

\begin{aligned} \|Ax-APx\|^2 & = \|Ax\|^2 \\ & = x^TA^TAx=x^T(A^TA-R^TR)x \\ & \le \|x\|\|(A^TA-R^TR)x\| \\ & \le \|A^TA-R^TR\|_2 \\ & \le \frac{\|A\|_F^4}{r} \end{aligned}



















A


x









A


P


x














2




























































=







A


x














2



















=





x










T










A










T









A


x




=





x










T









(



A










T









A










R










T









R


)


x




















x








(



A










T









A










R










T









R


)


x
























A










T









A










R










T









R














2












































r

















A














F








4
























































Note:

The last step comes from



E

(

A

B

C

R

F

2

)

1

s

A

F

2

B

F

2

E\left(\|AB-CR\|_F^2\right) \le \dfrac{1}{s} \|A\|_F^2 \|B\|_F^2






E






(






A


B









C


R














F








2



















)


























s














1























A














F








2





















B














F








2





















. Replace



A

A






A





by



A

T

A^T







A










T












and replace



B

B






B





by



A

A






A




second part:





E

(

A

P

C

U

R

2

)

E

(

A

P

C

U

R

F

)

1

s

A

F

2

P

F

2

r

s

A

F

2

\begin{aligned} E(\|AP-CUR\|_2) & \le E(\|AP-CUR\|_F)\\ & \le \dfrac{1}{s} \|A\|_F^2 \|P\|_F^2 \\ & \le \frac{r}{s}\|A\|_F^2 \end{aligned}
















E


(





A


P









C


U


R














2


















)














































E


(





A


P









C


U


R














F


















)




























s














1























A














F








2





















P














F








2












































s














r























A














F








2






































Note:

The last step comes from



P

F

2

r

\|P\|_F^2 \le r









P














F








2





























r






That is because



P

F

2

=

\|P\|_F^2 =









P














F








2




















=





sum of squared singular value, having



r

r






r





in total, each of which



1

\le 1















1





because



P

v

1

\|Pv\| \le 1









P


v
















1





for any unit vector



v

v






v





since



P

P






P





is a projection.



Computation Step by Step

  1. Pick



    s

    s






    s





    columns of



    A

    A






    A





    according to length squared sampling, which forms



    C

    C






    C




  2. Pick



    r

    r






    r





    rows of



    A

    A






    A





    according to length squared sampling, which forms



    R

    R






    R




  3. Compute



    (

    R

    R

    T

    )

    1

    (RR^T)^{-1}






    (


    R



    R










    T










    )














    1












  4. Pick



    s

    s






    s





    rows of



    R

    T

    R^T







    R










    T












    corresponding to step 1, and right-times the resulting matrix by



    (

    R

    R

    T

    )

    1

    (RR^T)^{-1}






    (


    R



    R










    T










    )














    1













    , which forms



    U

    U






    U




Note:

What if the matrix



R

R

T

RR^T






R



R










T












doesn’t have an inverse?

Use psedo inverse, which can be represented and computed based on some matrix decomposition, such as SVD.

As for why the pseudo inverse work, see

this passage

to know about the intuition behind pseudo inverse and you will understand.

Whatever,



R

R

T

RR^T






R



R










T












is a



r

×

r

r\times r






r




×








r





matrix, so the time complexity is quite low.



版权声明:本文为w112348原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。