Lately, in the course of writing my master thesis, I’ve been participating in the development of a new method for solving the famous *multi-frame* or, as the signal processers use to say, *multi-channel blind deconvolution* problem (MCBD). Its challenge is to recover the underlying signal from a sequence of its blurry and noisy observations . A well-known approach (see [1]) to the *denoising* of the observations is the recovery of the so-called *signal subspace* from the range of the observations’ matrix . The identification of such low-rank embeddings of data samples is often adressed through their *singular value decompositions* (SVD), that also have many applications beyond the world of image and signal processing – e.g. PCA for video compression.

“The ability to factor matrices into triangular, diagonal, orthogonal, and other special forms has turned out to be extremely useful.”

– Barry A. Cipra on the 'Top 10 Algorithms', in 'The Best of the 20th Century'

As I’ve been working on the refinements of the deconvolution algorithm’s implementation, I came across the necessity of computing the SVD of huge matrices quickly, that are too large to be kept in system memory completely. However, I stumbled over [2]: The authors proposed a randomized SVD algorithm, claiming it to be incredibly fast, particularly in that use-case. Seeing that there wasn’t a Python implementation available yet, I decided to write my own. It’s here, where I’m sharing it.

See this notebook for details.

## Multi-pass vs. Single-pass Techniques

So as the data matrix becomes very large, we decide to store it on-disk. This can be easily done through Numpy’s `memmap`

class, that supports the same interface as the `ndarray`

class. This allows us to apply arbitrary SVD algorithms on it, like `linalg.svd`

and in particular `randomized_svd`

from `sklearn`

: Whereas `linalg.svd`

demands loading the whole into memory at once, the latter only needs to compute `dot`

products that involve . Thus, for very large , we will still be able to use `randomized_svd`

, as long as we inject a dedicated `dot`

product implementation, which prevents the whole from being loaded into memory at once: This is such an implementation, that considers block-wise.

The motivation of dealing with single-pass decomposition techniques is as follows: As the matrices become too large to be stored in memory, the access of slower data carriers becomes the computational bottleneck. However, the single-pass SVD has even more advantages:

- We do not require the data samples to be arranged in any specific manner, like a matrix structure. Its perfectly okay for them to be generated on the fly, downloaded over the web, or read in from distinct files.
- As a matter of course, we can start the decomposition as soon as the first observation is available. There is no need in waiting for the last to arrive. Only the dimensionality of and must be known from the start.

Such a single-pass technique was proposed by Halko et al. in [2]. The `randomized_svd`

routine from `sklearn`

was also originally proposed by the authors of this paper.

## Implementation

For the sake of computational speed, we implement a particular subroutine natively. This is only fair, since the default `dot`

implementation not only is implemented natively too, but also heavily optimized đź™‚

from scipy import weave WEAVE_DEFAULT_KWARGS = { 'extra_compile_args': ['-Wno-cpp', '-std=c++11', '-Ofast'] }

Note that we use the `-Ofast`

compiler switch here, which may cause unexpected numerical results. However, experiments have shown that the the optimization flags do not affect the benchmarks’ results at all, so you may just leave that `-Ofast`

flag out, if you feel bad about it.

The below subroutine `single_pass_dot`

computes the outer product of vectors `a`

and `b`

and the inner product of `a`

and matrix `C`

simultaneously:

src_single_pass_dot = """ for( std::size_t k = 0; k < m; ++k ) { out2[ k ] = 0; } for( std::size_t i = 0; i < n; ++i ) { const auto ai = a[ i ]; for( std::size_t k = 0; k < m; ++k ) { out1[ i * m + k ] += ai * b[ k ]; out2[ k ] += ai * C[ m * i + k ]; } } """ def single_pass_dot(a, b, C, out1, out2): if a.ndim > 1 or b.ndim > 1: raise ValueError('`a` and `b` must be vectorial') n, m = len(a), len(b) weave.inline(src_single_pass_dot, ['a', 'b', 'C', 'out1', 'out2', 'n', 'm'], **WEAVE_DEFAULT_KWARGS)

Let me summarize the base algorithm from [2] for decomposing matrix by Halko et al.:

## excursus

# single-pass SVD algorithm

- Generate matrices and of isotropic Gaussian samples from .
- Compute and in a single pass over . The idea is that captures the structure of by probing it with the isotropic : squeezes the “spherical” into a “cigar” , whose proportions reflect the scattering of the samples . Analogously, captures the structure of .
- Compute QR factorizations and .
- Find the matrix , which minimizes the least-squares residuals of the linear systems
and

Since , this means that we fit s.t. it adapts the shape of in the coordinates of .

- Applying the SVD to yields the diagonal matrix , which carries the singular values of . Finally, compute , that turn out to be the eigenvectors of .

In this algorithm, step 4 is error-prone due to numerical instabilities. To overcome this, slight modifications (referred to as *oversampling*) are added to the implementation, which were suggested by the authors of the paper. It’s a disadvantage of these modifications, that they require us to replace the QR decompositions in step 3 by two further SVDs, that are more expansive. Fortunately, the `randomized_svd`

routine is almost as fast as `linalg.qr`

.

Here goes the actual implementation. It takes the generator `x_gen`

to iterate over the distinct columns of the matrix, that the SVD is to be computed for:

class SVD: def __init__(self, x_gen, m, n, k, oversample=10): """Computes SVD of `(m, n)` shaped matrix with columns `x_gen` and maximum rank `k`. """ self.task = { 'x_gen': x_gen, 'm': m, 'n': n, 'k': k, 'oversample': oversample } @property def U(self): return self.compute().U @property def S(self): return self.compute().S def compute(self): if hasattr(self, 'task'): O1, O2 = self.get_omega(**self.task) Z1, Z2 = self.data_dot(O1, O2, **self.task) Q1, Q2 = self.find_ranges(Z1, Z2, **self.task) B = self.approx_B(O1, O2, Z1, Z2, Q1, Q2, **self.task) V, self.S = self.small_svd(B, **self.task) self.U = Q1.dot(V) del self.task return self def get_omega(self, m, n, k, oversample, **kwargs): rs = random.RandomState(0) O1 = rs.normal(size=(n, k + oversample)) O2 = rs.normal(size=(m, k + oversample)) return (O1, O2) def data_dot(self, O1, O2, x_gen, **kwargs): Z1 = zeros((O2.shape[0], O1.shape[1])) Z2 = empty((O1.shape[0], O2.shape[1])) x = empty((O2.shape[0],)) for i, x_view in enumerate(x_gen): copyto(x, x_view) single_pass_dot(x, O1[i, :], O2, out1=Z1, out2=Z2[i, :]) return (Z1, Z2) def find_ranges(self, Z1, Z2, k, **kwargs): Q1 = randomized_svd(Z1, k, n_iter=0)[0] Q2 = randomized_svd(Z2, k, n_iter=0)[0] return (Q1, Q2) def approx_B(self, O1, O2, Z1, Z2, Q1, Q2, **kwargs): B1 = linalg.lstsq(O1.T.dot(Q2), Z1.T.dot(Q1))[0].T B2 = linalg.lstsq(O2.T.dot(Q1), Z2.T.dot(Q2))[0] return (B1 + B2) / 2 def small_svd(self, B, **kwargs): return randomized_svd(B, B.shape[1], n_iter=0)[:2]

## Evaluation

In the context of the MCBD problem, modeling

has proven to be reasonable, where denotes the *convolution* of the sharp with some blurring filter . This is how the data matrix was generated, which the following results were computed for. Again, see this notebook for details.

### Accuracy

We start with a small matrix and , so we can use `linalg.svd`

to verify the accuracy of the computed singular values:

The higher we choose the `oversampling`

parameter, the more accurate the results become.

Analogously, we verify the computed eigenvectors w.r.t. the `linalg.svd`

routine:

It is okay that the eigenvectors with are disordered, as they only span the noise space. They do not belong to the low-rank embedding of that we seek to find: This embedding is spanned by solely, since we chose the FIR filters to be in .

### Benchmarks

The following results were all achieved with `oversampling`

set to 10. For the application of the `randomized_svd`

, the block-wise `dot`

product was configured s.t. it considered in blocks of at most elements at a time:

As we see from this charts, the `randomized_svd`

always outperforms the `linalg.svd`

routine. The single-pass SVD isn’t always faster than the `randomized_svd`

; but it is, when the computational time becomes perceptible:

✕ 500 data matrix | ✕ 4000 data matrix | |||||||
---|---|---|---|---|---|---|---|---|

`randomized_svd` : |
0.011 | 0.083 | 1.007 | 23.523 | 0.076 | 0.577 | 6.977 | 182.782 |

single-pass SVD: |
0.033 | 0.112 | 1.000 | 11.481 | 0.212 | 0.689 | 4.809 | 63.131 |

`linalg.svd` : |
0.016 | 1.858 | 14.314 | 0.117 | 23.534 |

Whereas the `randomized_svd`

required 183 seconds to terminate for an matrix, the single-pass SVD accomplished its task in only 63 seconds. **This is a speed-up by about .**

## References

- Eric Moulines, Pierre Duhamel, Jean-Francois Cardoso, and Sylvie Mayrargue. Subspace methods for the blind identification of multichannel FIR filters.
*Signal Processing, IEEE Transactions on*, 43(2):516â€“525, 1995. - Nathan Halko, Per-Gunnar Martinsson, and Joel A. Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions.
*SIAM review*, 53(2):217â€“288, 2011.

## Leave a Reply