Streaming and Distributed Algorithms for Robust Column Subset Selection

Jiang, Shuli; Li, Dongyu; Li, Irene Mengze; Mahankali, Arvind V.; Woodruff, David P.

Computer Science > Data Structures and Algorithms

arXiv:2107.07657 (cs)

[Submitted on 16 Jul 2021]

Title:Streaming and Distributed Algorithms for Robust Column Subset Selection

Authors:Shuli Jiang, Dongyu Li, Irene Mengze Li, Arvind V. Mahankali, David P. Woodruff

View PDF

Abstract:We give the first single-pass streaming algorithm for Column Subset Selection with respect to the entrywise $\ell_p$-norm with $1 \leq p < 2$. We study the $\ell_p$ norm loss since it is often considered more robust to noise than the standard Frobenius norm. Given an input matrix $A \in \mathbb{R}^{d \times n}$ ($n \gg d$), our algorithm achieves a multiplicative $k^{\frac{1}{p} - \frac{1}{2}}\text{poly}(\log nd)$-approximation to the error with respect to the best possible column subset of size $k$. Furthermore, the space complexity of the streaming algorithm is optimal up to a logarithmic factor. Our streaming algorithm also extends naturally to a 1-round distributed protocol with nearly optimal communication cost. A key ingredient in our algorithms is a reduction to column subset selection in the $\ell_{p,2}$-norm, which corresponds to the $p$-norm of the vector of Euclidean norms of each of the columns of $A$. This enables us to leverage strong coreset constructions for the Euclidean norm, which previously had not been applied in this context. We also give the first provable guarantees for greedy column subset selection in the $\ell_{1, 2}$ norm, which can be used as an alternative, practical subroutine in our algorithms. Finally, we show that our algorithms give significant practical advantages on real-world data analysis tasks.

Comments:	Proceedings of the 38th International Conference on Machine Learning (ICML 2021)
Subjects:	Data Structures and Algorithms (cs.DS)
Cite as:	arXiv:2107.07657 [cs.DS]
	(or arXiv:2107.07657v1 [cs.DS] for this version)
	https://doi.org/10.48550/arXiv.2107.07657

Submission history

From: Shuli Jiang [view email]
[v1] Fri, 16 Jul 2021 01:05:08 UTC (426 KB)

Computer Science > Data Structures and Algorithms

Title:Streaming and Distributed Algorithms for Robust Column Subset Selection

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Data Structures and Algorithms

Title:Streaming and Distributed Algorithms for Robust Column Subset Selection

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators