Boston-area Data Privacy

A web site for a Boston-area group of researchers working on data privacy.
Home Talks
Boston-area Data Privacy

Below is the schedule of previous and upcoming Boston-area data privacy seminars. Join the mailing list and Google calendar for more information, including the Zoom meeting links.

Date Talk === Upcoming Talks === Friday, April 9 at 11-12:30 ET Hypothesis Selection with Privacy Speaker: Gautam Kamath, University of Waterloo Abstract: The Scheffe estimator is a classic and celebrated statistical tool, which provides a sample-efficient method for selecting the distribution from a set of hypotheses which best matches a dataset. It can be extended to the private setting, enabling near-optimal cover-based upper bounds, which tightly complement packing-based lower bounds. I will discuss applications of this method to distribution estimation and beyond, in both the central and local setting. Based on several related works, with Ishaq Aden-Ali, Hassan Ashtiani, Mark Bun, Sivakanth Gopi, Janardhan Kulkarni, Aleksandar Nikolov, Vikrant Singhal, Thomas Steinke, Jonathan Ullman, Zhiwei Steven Wu, and Huanyu Zhang. Arxiv links: https://arxiv.org/abs/1905.13229, https://arxiv.org/abs/2002.09465, https://arxiv.org/abs/2002.09464, https://arxiv.org/abs/2010.09929 === Previous Talks === Friday, April 2 at 11-12:30 ET What Is The Sample Complexity of Differentially Private Learning? Speaker: Shay Moran, Technion View recording here Abstract: The increase in machine learning applications which involve private and personal data highlights the need for algorithms that handle the data *responsibly*. While this need has been successfully addressed by the field of differentially private machine learning, the cost of privacy remains poorly understood: How much data is needed for differentially private learning? How much more data does private learning require compared to learning without privacy constraints? We will survey some of the recent progress towards answering these questions in the distribution-free PAC model, including the Littlestone-dimension-based *qualitative* characterization and the relationship with online learning. If time allows, we will also discuss this question in more general (distribution- and data-dependent) learning models. Friday, March 26 at 11-12:30 ET Security and Privacy Guarantees in Machine Learning with Differential Privacy Speaker: Roxana Geambasu, Columbia University View recording here Abstract: Machine learning (ML) is driving many of our applications and life-changing decisions. Yet, it is often brittle and unstable, making decisions that are hard to understand or can be exploited. Tiny changes to an input can cause dramatic changes in predictions; this results in decisions that surprise, appear unfair, or enable attack vectors such as adversarial examples. Moreover, models trained on users' data can encode not only general trends from large datasets but also very specific, personal information from these datasets; this threatens to expose users' secrets through ML models or predictions. This talk positions differential privacy (DP) -- a rigorous privacy theory -- as a versatile foundation for building into ML much-needed guarantees of security, stability, and privacy. I first present PixelDP (S&P'19), a scalable certified defense against adversarial example attacks that leverages DP theory to guarantee a level of robustness against these attacks. I then present Sage (SOSP'19), a DP ML platform that bounds the cumulative leakage of secrets through models while addressing some of the most pressing challenges of DP, such as running out of privacy budget and the privacy-accuracy tradeoff. PixelDP and Sage are designed from a pragmatic, systems perspective and illustrate that DP theory is powerful but requires adaptation to achieve practical guarantees for ML workloads. Friday, March 19 at 11-12:30 ET On Distributed Differential Privacy and Counting Distinct Elements Speaker: Lijie Chen, MIT View recording here Abstract: We study the setup where each of n users holds an element from a discrete set, and the goal is to count the number of distinct elements across all users, under the constraint of (eps, delta)-differentially privacy: (1) In the non-interactive local setting, we prove that the (additive) error of any protocol is Omega(n) for any constant eps and for any delta inverse polynomial in n. (2) In the single-message shuffle setting, we prove a lower bound of n/polylog(n) on the error for any constant eps and for some delta inverse quasi-polynomial in n. We do so by building on the moment-matching method from the literature on distribution estimation. (3) In the multi-message shuffle setting, we give a protocol with at most one message per user in expectation and with an error of sqrt{n} polylog(n) for any constant eps and for any delta inverse polynomial in n. Our protocol is also robustly shuffle private, and our error of sqrt{n} matches a known lower bound for such protocols. Our proof technique relies on a new notion, that we call dominated protocols, and which can also be used to obtain the first non-trivial lower bounds against multi-message shuffle protocols for the well-studied problems of selection and learning parity. Our lower bound for estimating the number of distinct elements provides the first omega(sqrt{n}) separation between global sensitivity and error in local differential privacy, thus answering an open question of Vadhan (2017). We also provide a simple construction that gives n/polylog(n) separation between global sensitivity and error in two-party differential privacy, thereby answering an open question of McGregor et al. (2011). This is joint work with Badih Ghazi, Ravi Kumar, and Pasin Manurangsi from Google Research. Friday, March 12 at 11-12:30 ET Algorithmic Challenges in Efficient Training of Private (Deep) Language Models Speaker: Janardhan Kulkarni, Microsoft Research View recording here Abstract: Many attacks have shown that deep learning models trained on private data of users can leak sensitive information of the users. Differential Privacy is a provable way to prevent such attacks. However, training deep learning models using DP introduces several new challenges both in terms of privacy vs accuracy tradeoffs and in the resource cost of the process. In this talk, I will highlight some of the problems we encountered, our solutions for resolving them and mention many important open problems. Friday, March 5 at 11-12:30 ET Hiding Among the Clones: A Simple and Nearly Optimal Analysis of Privacy Amplification by Shuffling Speaker: Audra McMillan, Apple View recording here Abstract: Recent work of Erlingsson, Feldman, Mironov, Raghunathan, Talwar, and Thakurta [EFMRTT19] demonstrates that random shuffling amplifies differential privacy guarantees of locally randomized data. Such amplification implies substantially stronger privacy guarantees for systems in which data is contributed anonymously [BEMMRLRKTS17] and has led to significant interest in the shuffle model of privacy [CSUZZ19, EFMRTT19]. In this talk, we will discuss a new result on privacy amplification by shuffling, which achieves the asymptotically optimal dependence in the local privacy parameter. Our result is based on a new proof strategy which is simpler than previous approaches, and extends to approximate differential privacy with nearly the same guarantees. We'll discuss this proof strategy, the extension to approximate differential privacy, and time permitting, some of the implications of this result. Friday, February 26 at 11-12:30 ET Sample-efficient proper PAC learning with approximate differential privacy Speaker: Noah Golowich, MIT View recording here Abstract: An exciting recent development in the theory of differentially private machine learning is the connection between private learning and online learning, and in particular the result that a binary hypothesis class is privately learnable if and only if it is online learnable (i.e., has finite Littlestone dimension). In this talk I will discuss our work strengthening various aspects of the result that online learning implies private learning: first, we show that the sample complexity of properly learning a class of Littlestone dimension d with approximate differential privacy is Õ(d^6), ignoring privacy and accuracy parameters. This result answers a question of Bun et al. (2020) by improving upon their upper bound of 2^O(d) on the sample complexity. Prior to our work, finiteness of the sample complexity for privately learning a class of finite Littlestone dimension was only known for improper private learners, and the fact that our learner is proper answers another question of Bun et al., which was also asked by Bousquet et al. (2020). Using machinery developed by Bousquet et al., we also show that the sample complexity of sanitizing a binary hypothesis class is at most polynomial in its Littlestone dimension and dual Littlestone dimension. This implies that a class is sanitizable if and only if it has finite Littlestone dimension. An important ingredient of our proofs is a new property of binary hypothesis classes that we call irreducibility, which may be of independent interest. Friday, February 19 at 11-12:30 ET Leveraging Heuristics for Private Synthetic Data Release Speaker: Steven Wu, CMU View recording here Abstract: This talk will focus on differentially private synthetic data---a privatized version of the dataset that consists of fake data records and that approximates the real dataset on important statistical properties of interest. I will present our recent results on private synthetic data that leverage practical optimization heuristics to circumvent the computational bottleneck in existing work. Our techniques are motivated by a modular, game-theoretic framework, which can flexibly work with methods such as integer program solvers and deep generative models. Friday, February 12 at 11-12:30 ET Towards Good Statistical Inference from Differentially Private Data Speaker: Ruobin Gong, Rutgers University View recording here Abstract: Differential privacy (DP) brings provability and transparency to statistical disclosure limitation. When data users migrate their analysis to private data, there is no guarantee that a statistical model, otherwise good for non-private data, will still produce trustworthy conclusions. This talk contemplates two challenges faced by data users to draw good statistical inference from private data releases. When the DP mechanism is transparent, I discuss how approximate computation techniques (Monte Carlo EM, approximate Bayesian computation) can be systematically adapted to produce exact inference with respect to the joint specification of the intended model and the DP mechanism. In the presence of mandated invariants which the data curator must observe, I advocate for the congenial design of the DP mechanism via standard probabilistic conditioning on the invariant margins, as an alternative to optimization-based post-processing. This proposal preserves both the privacy guarantee of the output and its statistical intelligibility. A demonstration of restricted contingency table privatization is performed via a Markov chain algorithm. Friday, February 5 at 11-12:30 ET Private Mean Estimation of Heavy-Tailed Distributions Speaker: Vikrant Singhal Abstract: We give new upper and lower bounds on the minimax sample complexity of differentially private mean estimation of distributions with bounded $k$-th moments. Roughly speaking, in the univariate case, we show that $$n = \Theta\left(\frac{1}{\alpha^2} + \frac{1}{\alpha^{\frac{k}{k-1}}\varepsilon}\right)$$ samples are necessary and sufficient to estimate the mean to $\alpha$-accuracy under $\varepsilon$-differential privacy, or any of its common relaxations. This result demonstrates a qualitatively different behavior compared to estimation absent privacy constraints, for which the sample complexity is identical for all $k \geq 2$. We also give algorithms for the multivariate setting whose sample complexity is a factor of $O(d)$ larger than the univariate case. Monday, January 25 at 3PM ET Local Differential Privacy is Equivalent to the Contraction of Hockey-Stick Divergence Abstract: In this talk, we first show that the approximate local differential privacy (LDP) can be equivalently expressed in terms of the contraction coefficient of “Hockey-Stick Divergence.” This result then enables us to relate the LDP guarantees of randomized mechanisms to contraction properties of any arbitrary f-divergences. This is in fact a generalization (and improvement) of the main result in [Duchi, Jordan and Wainwright, FOCS’13] that led to information-theoretic lower bounds for private minimax estimation problems only in the high privacy regime (i.e., epsilon<1 and delta =0). Our result allows us to drop the high-privacy assumption and obtain lower bounds for any epsilon and delta. Time permitting, I will also discuss some implications for the private Bayesian estimation problems. This is a work in progress and based on a collaboration with Maryam Aliakbarpour (UMass) and Flavio Calmon (Harvard).