Boston-Area Data Privacy

A web site for a Boston-area group of researchers working on data privacy.
Home Talks Privacy Day Summer School
Boston-Area Data Privacy

Below is the schedule of previous and upcoming Boston-area data privacy seminars. Join the mailing list and Google calendar for more information, including the Zoom meeting links.

=== Fall 2023 ===
Friday, December 8th, 3-430 pm ET On Provable Copyright Protection for Generative Models

Speaker: Nikhil Vyas, Harvard

Abstract: There is a growing concern that learned conditional generative models may output samples that are substantially similar to some copyrighted data C that was in their training set. We give a formal definition of near access-freeness (NAF) and prove bounds on the probability that a model satisfying this definition outputs a sample similar to C, even if C is included in its training set. Roughly speaking, a generative model p is k-NAF if for every potentially copyrighted data C, the output of p diverges by at most k-bits from the output of a model q that did not access C at all. We also give generative model learning algorithms, which efficiently modify the original generative model learning algorithm in a black box manner, that output generative models with strong bounds on the probability of sampling protected content. Furthermore, we provide promising experiments showing minimal degradation in output quality while ensuring strong protections against sampling protected content. Joint work with Sham Kakade and Boaz Barak (https://arxiv.org/abs/2302.10870)
Friday, November 10th, 3-430 pm ET Triangle Counting under Local Differential Privacy: Algorithms and Techniques

Speaker: Jacob Imola, UCSD

Abstract: In many machine learning applications, users hold data on a local device and communicate with an untrusted central analyst. Local differential privacy is the state-of-the art way to ensure that all data sent by a user will be protected regardless of who eventually sees the data. In this talk, we will focus on computing the number of triangles in a graph, a simple yet nontrivial query which has real-world importance, under local DP. We will outline several private algorithms which improve upon each other and illustrate new ideas in this area. First, we will propose an algorithm with a strong error guarantee that uses two rounds of user interaction. Unfortunately, it may have prohibitively large per-user download cost. Second, we will fix these matters by using edge sampling, along with a selective download, to create an algorithm which smoothly trades off between the download cost and error. Third, we will propose an algorithm which, using just a single round of interaction, privately counts triangles in the shuffled model of DP. The algorithm uses a novel edge sampling scheme and is also able to compute 4-cycles with low error. Finally, we will describe work on lower bounds for the problem, and further questions towards generalizing these bounds and algorithms.
Friday, October 27th, 3-430 pm ET Characterizing Minimax Generalization Error using Information-Theoretic Measures

Speaker: Mahdi Haghifam, Northeastern University

Abstract: Generalization error measures the extent to which a learning algorithm overfits to the training data. Tight bounds on generalization error are crucial for understanding and improving machine learning algorithms. In this talk, we provide an overview of recent research on bounding generalization error using information-theoretic measures of dependence between the training set and output of a learning algorithm. Then, we answer the following question: For which learning problems are information-theoretic bounds expressive enough to estimate the optimal minimax performance? We will discuss the expressive power of information-theoretic generalization by studying the connections between information-theoretic frameworks for generalization and classical min-max frameworks such as VC theory and Uniform Stability. This talk draws upon insights from the following works: arxiv:2212.13556, arxiv:2206.14800, and arxiv:2111.05275.
Friday, October 13th, 3-430 pm ET Learning Mixtures of Unrestricted Gaussians Privately

Speaker: Hassan Ashtiani, McMaster University

Abstract: We will talk about private parameter learning (https://arxiv.org/abs/2303.04288) and density estimation (https://arxiv.org/abs/2309.03847) for GMMs. In the parameter learning setting, we will show a poly-time (using poly-samples) reduction from the private problem to its non-private counterpart. For the density estimation setting, we show the first polynomial sample complexity upper bound. Through these examples, we will see some of the generic challenges of private distribution learning, and some techniques to tackle them.
=== Fall 2022 ===
Friday, September 16 at 10:30-11:45 am ET What Impacts (Empirical) Privacy Risk?

Speaker: Matthew Jagielski, Google Brain

Abstract: In this talk, I will talk about a number of factors that influence the success of privacy attacks on trained machine learning models. I'll start with a recent line of work on extracting training data from large language models: we'll see that data duplication and model size have a large impact on this memorization, but our estimates. Then, I'll discuss membership inference. I'll discuss recent work showing that adding or removing examples from a training set can have a large impact on membership inference accuracy for other examples. Then, we'll see *when* a point is used in training matters for membership inference attacks, too, in settings where models are updated repeatedly, or when training sets are extremely large. This talk is based on joint works with many collaborators.
Friday, September 30 at 10:30-11:45 am ET Score Attack: A Lower Bound Technique for Differentially Private Learning

Speaker: Linjun Zhang, Rutgers University

Abstract: Privacy-preserving techniques have been a central focus in data analysis nowadays. In this talk, we propose a general technique, the score attack, for lower bounding the differential-privacy-constrained minimax risk of parameter estimation. Inspired by the tracing attack idea in differential privacy, the score attack method is applicable to any statistical model with a well-defined score statistic and capable of optimally lower bounding the minimax risk of estimating unknown model parameters when the estimator is required to be differentially private. The effectiveness of this general method is demonstrated in a variety of examples: the generalized linear model in classical and high-dimensional sparse settings, the Bradley-Terry-Luce model for pairwise comparisons, and non-parametric function estimation in the Sobolev class.
Friday, October 14 at 10:30-11:45 am ET Policy impacts of statistical uncertainty and privacy

Speaker: Ryan Steed, Carnegie Mellon University

Abstract: The introduction of differential privacy to the 2020 Census---accomplished by adding random noise to statistics to preserve respondents' privacy---kindled an ongoing debate over privacy-utility trade-offs in federal statistics. We examine an overlooked aspect of this debate: the role of underlying statistical uncertainty. Simulating the effects of data error and noise injected for privacy, we find that the Title I education funding formula distributes error unequally across demographic groups---though existing data error impacts funding decisions much more than our privacy mechanism. We suggest simple policy changes to reduce disparities and compensate for additional privacy protections. Our results point the way towards reforms that would make evidence-based policy both more beneficial to marginalized communities and more privacy-protecting.
Friday, October 28 at 10:30-11:45 am ET Private Query Release via the Johnson-Lindenstrauss Transform

Speaker: Aleksandar (Sasho) Nikolov, University of Toronto

Abstract: One of the most basic tasks in differential privacy is answering statistical queries, which subsumes, for example, basic problems like answering counting queries, computing means and covariances, computing the CDF of the data, and computing the gradient of the empirical loss function of the data. Despite the importance of statistical queries, many basic questions about computing them with differential privacy remain unanswered. Perhaps the most embarrassing of these questions concerns the optimal tradeoff between error and data size for worst case queries in the setting of pure differential privacy. To address this question, we introduce a new method for releasing answers to statistical queries with differential privacy, based on the Johnson-Lindenstrauss lemma. The key idea is to randomly project the query answers to a lower dimensional space so that the distance between any two vectors of feasible query answers is preserved up to an additive error. Then we answer the projected queries using a simple noise-adding mechanism, and lift the answers up to the original dimension. Using this method, we give, for the first time, purely differentially private mechanisms with optimal worst case sample complexity under average error for answering a workload of k queries over a universe of size N. As other applications, we give the first efficient purely private mechanisms with optimal sample complexity for computing the covariance of a bounded high-dimensional distribution, and for answering 2-way marginal queries. We also show that, up to the dependence on the error, a variant of our mechanism is nearly optimal for every given query workload.
Friday, November 18 at 10:30-11:45 am ET Can we Reconcile the Computer Science and Legal Views of Privacy?

Speaker: Kobbi Nissim, Georgetown University

Abstract: Law and computer science interact in critical ways within sociotechnical systems, and recognition is growing among computer scientists, legal scholars, and practitioners of significant gaps between these disciplines that create potential risks for privacy and data protection. These gaps need to be bridged to ensure that computer systems are designed and implemented to correctly address applicable legal requirements and that interpretations of legal concepts accurately reflect the capabilities and limitations of technical systems. We will explore some of the gaps between the legal and technical views of privacy and suggest directions by which these gaps may be reconciled.
=== Spring 2022 ===
Friday, January 21 at 11-12:30 ET FriendlyCore: Practical Differentially Private Aggregation

Speaker: Eliad Tsfadia, Tel-Aviv University
View recording here


Abstract: Differentially private algorithms for common metric aggregation tasks, such as clustering or averaging, often have limited practicality due to their complexity or a large number of data points that is required for accurate results. We propose a simple and practical tool FriendlyCore that takes a set of elements D from an unrestricted (pseudo) metric space as input. When D has effective diameter r, FriendlyCore returns a ``stable'' subset C that includes all elements, except possibly a few outliers, and is certified to have diameter r. FriendlyCore can be used to preprocess the input before privately aggregating it, potentially simplifying the aggregation or boosting its accuracy. Surprisingly, FriendlyCore is light-weight with no dependence on the dimension. We empirically demonstrate its advantages in boosting the accuracy of mean estimation and clustering problems such as k-means and k-GMM, outperforming tailored methods. The talk is based on the joint work with Edith Cohen, Haim Kaplan, Yishay Mansour, and Uri Stemmer.
Monday, February 14 at 11:15-12:30 ET Privatizing One-shot and Continual Observation Histograms

Speaker: Ryan Rogers, LinkedIn
View recording here


Abstract: Member privacy is a priority at LinkedIn and a key consideration in product design. We aim to protect member privacy by using innovative privacy engineering techniques such as differential privacy to provide valuable insights to members and policy makers. The data privacy team at LinkedIn has developed new algorithms that work with existing real time data analytics systems, allowing us to develop systems at scale while protecting member privacy. These algorithms privatize histograms in both the one-shot setting, where data is fixed, and for the continual observation setting, for streaming data. We also showcase some deployments of differential privacy in LinkedIn products.
Monday, February 28 at 11:15-12:30 ET Practical and Private (Deep) Learning without Sampling or Shuffling

Speaker: Abhradeep Guha Thakurta, Google and Om Thakkar, Google
View recording here


Abstract: We consider training models with differential privacy (DP) using mini-batch gradients. The existing state-of-the-art, Differentially Private Stochastic Gradient Descent (DP-SGD), requires privacy amplification by sampling or shuffling to obtain the best privacy/accuracy/computation trade-offs. Unfortunately, the precise requirements on exact sampling and shuffling can be hard to obtain in important practical scenarios, particularly federated learning (FL). In this talk, we will present a DP variant of Follow-The-Regularized-Leader (DP-FTRL) that compares favorably (both theoretically and empirically) to amplified DP-SGD, while allowing for much more flexible data access patterns. DP-FTRL does not use any form of privacy amplification. The talk will assume basic understanding of differential privacy and machine learning. However, the talk will be self-contained w.r.t. most of the advanced concepts. This is a joint work with Peter Kairouz, Brendan McMahan, Shuang Song, and Zheng Xu from Google Research.
Monday, March 21 at 11:15-12:30 ET Private Convex Optimization via Exponential Mechanism

Speaker: Sivakanth Gopi, MSR
View recording here


Abstract: We study differentially private optimization of (non-smooth) convex functions F(x)=E_i[f_i(x)]. The classic exponential mechanism minimizes F(x) by sampling from pi(x) ~ exp(-kF(x)), but achieves a suboptimal privacy vs utility tradeoff. We show that modifying the exponential mechanism by adding an ell_2^2 regularizer to F(x) and sampling from pi(x) ~ exp(-k(F(x)+\mu ||x||_2^2/2)) recovers both optimal empirical risk and population loss under (eps,delta)-DP. We also give an algorithm to efficiently sample from the exponential mechanism using optimal number of oracle queries to f_i(x). We prove that the regularized exponential mechanism satisfies Gaussian Differential Privacy; our privacy bound is optimal (with tight constants), as it includes the analysis of Gaussian mechanism as a special case. The privacy proof uses isoperimetric inequality for strongly log-concave measures. The link to the paper is at https://arxiv.org/pdf/2203.00263.pdf.
Monday, April 4 at 11:15-12:30 ET Negotiating privacy and fitness for use in recent DP deployments at federal statistical agencies

Speaker: Ashwin Machanavajjhala, Tumult Labs and Duke University
View recording here


Abstract: In this talk we will describe two current and recent deployments of DP at the US Census Bureau and the US Internal Revenue Service that Tumult Labs is involved with. We will describe the data products and some of the considerations taken into account when designing DP algorithms for these deployments. We will highlight our methodology of engaging with key stakeholders to iteratively elicit requirements for privacy and fitness-for-use as well as design and tune a differentially private algorithm that meets these requirements. This talk will also briefly highlight Tumult Labs’ soon-to-be-open-source platform for SQL-like analytics with configurable differential privacy guarantees, which is currently used to publish these data products. We will end with lessons learned and identify some open challenges.
Monday, April 25 at 11:15-12:30 ET Differentially private inference via noisy optimization

Speaker: Marco Avella Medina, Columbia University
View recording here


Abstract: We propose a general optimization-based framework for computing differentially private M-estimators and a new method for the construction of differentially private confidence regions. Firstly, we show that robust statistics can be used in conjunction with noisy gradient descent and noisy Newton methods in order to obtain optimal private estimators with global linear or quadratic convergence, respectively. We establish global convergence guarantees, under both local strong convexity and self-concordance, showing that our private estimators converge with high probability to a neighborhood of the non-private M-estimators. The radius of this neighborhood is nearly optimal in the sense it corresponds to the statistical minimax cost of differential privacy up to a logarithmic term. Secondly, we tackle the problem of parametric inference by constructing differentially private estimators of the asymptotic variance of our private M-estimators. This naturally leads to the use of approximate pivotal statistics for the construction of confidence regions and hypothesis testing. We demonstrate the effectiveness of a bias correction that leads to enhanced small-sample empirical performance in simulations. This is joint work with Casey Bradshaw and Po-Ling Loh.
Monday, May 9 at 11:15-12:30 ET Low-communication Algorithms for Private Federated Data Analysis with Optimal Accuracy Guarantees

Speaker: Vitaly Feldman, Apple
View recording here


Abstract: Locally Differentially Private (LDP) Reports are commonly used for collection of statistics and machine learning in the federated setting. Yet for both frequency estimation over a large domain and learning a high-dimensional model the best known LDP algorithms require sending prohibitively large messages from each client device to the server. In this talk I’ll describe several recent results on low-communication LDP protocols with optimal accuracy. In particular, (1) A general approach that, under standard cryptographic assumptions, gives a low communication version of any efficient LDP randomizer with negligible loss in privacy and utility guarantees. (2) New algorithms for frequency estimation and high-dimensional mean estimation that achieve the optimal privacy/utility trade-off and have low communication cost. Based on joint works with Hilal Asi, Jelani Nelson, Huy Nguyen and Kunal Talwar
Monday, May 23 at 11:30-12:30 ET "Your responses will be retained until ...": The challenge of retaining information without retaining individual responses in a global COVID symptom survey.

Speaker: Frauke Kreuter, University of Maryland/LMU Munich and Steven Wu, Carnegie Mellon University
View recording here


Abstract: This presentation discussed the challenge of preserving long-term utility of a global COVID-19 symptom survey. The global sample includes more than 240 countries and territories since the launch of data collection, of which 115 are weighted to better reflect the age and gender of the general adult population. CTIS individual-level data and aggregate estimates are available for the nation level as well as for administrative level-1 subnational regions. Aggregates have been released publicly. The sample is a daily repeated cross-section of Facebook app users who can see an invitation as often as every month and as infrequently as every 5 months. This means the data can be used to detect both daily and weekly changes in COVID-19 symptoms as well as a variety of other COVID-related and public health outcomes, such as barriers to vaccination or mental health. This has enabled both local and national governments and non-governmental organizations to detect and respond to changes in trends and to evaluate the effectiveness of COVID policies and restrictions. Preserving the richness of this information for future use is key, though challenged by a promise made to EU respondents, that their responses will be retained for a period of two years. The talk will discuss steps taken to apply disclosure control to the daily cross-sectional estimates and to create DP synthetic data that allow detailed time series and causal effect analyses going forward. We hope to discuss with the group how to handle weights when creating the DP synthetic files and how to evaluate synthetic data in the presence of weighted private files. This is joint work with Caro Haensch (LMU Munich) and Terrance Liu (CMU).
=== Fall 2021 ===
Friday, October 15 at 11-12:30 ET Learning with user-level differential privacy

Speaker: Ananda Theertha Suresh, Google
View recording here


Abstract: The classical setting of differential privacy assumes each user contributes a single sample to the dataset and preserves privacy by noising the output in a way that is commensurate with the maximum contribution of a single example. However, in many practical applications such as federated learning, each user can contribute multiple samples. In these applications, the goal is to provide user-level differential privacy which protects privacy of all the samples of the user. In this talk, we present algorithms and information-theoretic lower bounds for the problems of discrete distribution estimation, high-dimensional mean estimation, and empirical risk minimization under user-level differential privacy constraints.
Friday, October 22 at 11-12:30 ET Mean Estimation with User-level Privacy under Data Heterogeneity

Speaker: Rachel Cummings, Columbia University
View recording here


Abstract: A key challenge for data analysis in the federated setting is that user data is heterogeneous, i.e., it cannot be assumed to be sampled from the same distribution. Further, in practice, different users may possess vastly different number of samples. In this work we propose a simple model of heterogeneous user data that differs in both distribution and quantity of data, and we provide a method for estimating the population-level mean while preserving user-level differential privacy. We demonstrate asymptotic optimality of our estimator within a natural class of private estimators and also prove general lower bounds on the error achievable in our problem. In particular, while the optimal non-private estimator can be shown to be linear, we show that privacy constrains us to use a non-linear estimator.
Friday, October 29 at 11-12:30 ET Hyperparameter Tuning with Renyi Differential Privacy

Speaker: Thomas Steinke, Google
View recording here


Abstract: For many differentially private algorithms, such as the prominent noisy stochastic gradient descent (DP-SGD), the analysis needed to bound the privacy leakage of a single training run is well understood. However, few studies have reasoned about the privacy leakage resulting from the multiple training runs needed to fine tune the value of the training algorithm's hyperparameters. In this work, we first illustrate how simply setting hyperparameters based on non-private training runs can leak private information. Motivated by this observation, we then provide privacy guarantees for hyperparameter search procedures within the framework of Renyi Differential Privacy. Our results improve and extend the work of Liu and Talwar (STOC 2019). Our analysis supports our previous observation that tuning hyperparameters does indeed leak private information, but we prove that, under certain assumptions, this leakage is modest, as long as each candidate training run needed to select hyperparameters is itself differentially private.
Friday, November 5 at 11-12:30 ET User-Level Differentially Private Learning via Correlated Sampling

Speaker: Pasin Manurangsi, Google
View recording here


Abstract: Most works in learning with differential privacy (DP) have focused on the setting where each user has a single sample. In this work, we consider the setting where each user holds $m$ samples and the privacy protection is enforced at the level of each user's data. We show that, in this setting, we may learn with a much fewer number of users. Specifically, we show that, as long as each user receives sufficiently many samples, we can learn any privately learnable class via an $(\epsilon, \delta)$-DP algorithm using only $O(\log(1/\delta)/\epsilon)$ users. For $\epsilon$-DP algorithms, we show that we can learn using only $O_{\epsilon}(d)$ users even in the local model, where $d$ is the probabilistic representation dimension. In both cases, we show a nearly-matching lower bound on the number of users required. A crucial component of our results is a generalization of global stability [Bun, Livni, Moran, FOCS 2020] that allows the use of public randomness. Under this relaxed notion, we employ a correlated sampling strategy to show that the global stability can be boosted to be arbitrarily close to one, at a polynomial expense in the number of samples.
Friday, November 12 at 11-12:30 ET Inference and Privacy

Speaker: Daniel Sheldon, University of Massachusetts Amherst
View recording here


Abstract: Differential privacy requires that an algorithm is randomized “just enough” to mask the contribution of any one individual in the input data set when viewing the algorithm’s output. This leads to new inference problems where we wish to reason about the truth given measurements with excess noise added for privacy. I will discuss recent work on inference and differential privacy, including probabilistic inference for more accurately answering database queries under privacy constraints, private Bayesian inference, and private confidence interval construction. Time permitting, I will briefly mention how underlying technical ideas were motivated by, and helped advance, inference about bird migration patterns from citizen science data.
Friday, November 19 at 11-12:30 ET Per-Query Error and Differential Privacy

Speaker: Daniel Kifer, Penn State University
View recording here


Abstract: In this talk we consider the effects of per-query error on the design of query answering algorithms under differential privacy. Per-query error is closer to stakeholder expectations than metrics that are typically considered in the literature. When differentially private systems are required to produce microdata, we show that this error metric reveals a tradeoff between query answer errors that is unrelated to privacy loss budget allocations between the queries. We also consider how to optimize the choice of linear queries to satisfy per-query error requirements.
Friday, December 3 at 11-12:30 ET Differential Privacy and Machine Unlearning

Speaker: Aaron Roth, University of Pennsylvania
View recording here


Abstract: The problem of data deletion or "machine unlearning" is to remove the influence of a data point on a trained model, with computational cost that is substantially better than the baseline solution of fully retraining the model. Whereas differential privacy asks that the same algorithm run on different (neighboring) inputs yield nearby distributions, the data deletion problem requires that two different algorithms (full retraining vs. a sequence of deletion operations) yield nearby distributions when run on the same input. So the two goals are not the same: nevertheless, techniques from differential privacy carry over in natural ways to the data deletion problem. In this talk, I'll walk through two simple vignettes that illustrate this point. The work I will discuss is from a pair of papers that are joint with Varun Gupta, Chris Jung, Seth Neel, Saeed Sharifi, and Chris Waites.
=== Spring 2021 ===
Friday, May 14 at 11-12:30 ET Near-Optimal Aggregation in the Shuffle DP Model

Speaker: Badih Ghazi, Google
View recording here

Abstract: The shuffle model of differential privacy has recently witnessed significant interest as an intermediate setting between the well-studied central and local differential privacy models. In this talk, we describe some algorithms for differentially private aggregation in the shuffle model, achieving near-central accuracy and small communication overhead. We also present some empirical evaluations and discuss some challenges towards obtaining more scalable practical implementations. Based on joint works with Ravi Kumar, Pasin Manurangsi, Rasmus Pagh, Amer Sinha and Ameya Velingker that appeared at Eurocrypt 2020, ICML 2020 and will appear at ICML 2021.
Friday, May 7 at 11-12:30 ET Identification and formal privacy guarantees

Speaker: Denis Nekipelov, University of Virginia
View recording here

Abstract: Empirical economic research crucially relies on highly sensitive individual datasets. At the same time, increasing availability of public individual-level data that comes from social networks, public government records and directories makes it possible for adversaries to potentially de-identify anonymized records in sensitive research datasets. Most commonly accepted formal definition of an individual non-disclosure guarantee is referred to as differential privacy. With differential privacy in place the researcher interacts with the data by issuing queries that evaluate the functions of the data. Differential privacy guarantee is achieved by replacing the actual outcome of the query with a randomized outcome with the amount of randomness determined by the sensitivity of the outcome to individual observations in the data.

While differential privacy does provide formal non-disclosure guarantees, its impact on the identification of empirical economic models as well as its impact on the performance of estimators in nonlinear empirical Econometric models has not been sufficiently studied. Since privacy protection mechanisms are inherently finite-sample procedures, we define the notion of identifiability of the parameter of interest under differential privacy as a property of the limit of experiments. It is naturally characterized by the concepts from the random sets theory and is linked to the asymptotic behavior in measure of differentially private estimators. We demonstrate that particular instances of regression discontinuity design and average treatment effect may be problematic for inference with differential privacy. Those parameters turn out to be neither point nor partially identified. The set of differentially private estimators converges weakly to a random set. This result is clearly supported by our simulation evidence. Our analysis suggests that many other estimators that rely on nuisance parameters may have similar properties with the requirement of differential privacy. Identification becomes possible if the target parameter can be deterministically localized within the random set. In that case, a full exploration of the random set of the weak limits of differentially private estimators can allow the data curator to select a sequence of instances of differentially private estimators that is guaranteed to converge to the target parameter in probability. We provide a decision- theoretic approach to this selection.
Friday, April 30 at 11-12:30 ET Private Stochastic Convex Optimization

Speaker: Kunal Talwar, Apple
View slides here

Abstract: I will summarize some recent works on differentially private (DP) algorithms for stochastic convex optimization: the problem of minimizing the population loss given i.i.d. samples from a distribution over convex loss functions. In the standard l2/l2 setting, we will see two approaches to getting optimal rates for this problem. We show that for a wide range of parameters, privacy causes no additional overhead in accuracy or run time. In the process, we will develop techniques for private stochastic optimization that work for other geometries. For the LASSO setting when optimizing over the l1 ball, we will see private algorithms that achieve optimal rates. Based on joint works with various subsets of Hilal Asi, Raef Bassily, Vitaly Feldman, Tomer Koren and Abhradeep Thakurta.
Friday, April 23 at 11-12:30 ET Privacy is Not an Afterthought

Speaker: Xi He, University of Waterloo
View recording here

Abstract: Computing technology has enabled massive digital traces of our personal lives to be collected and stored. These datasets play an important role in numerous real-life applications and research analysis, such as contact tracing for COVID 19, but they contain sensitive information about individuals. When managing these datasets, privacy is usually addressed as an afterthought, engineered on top of a database system optimized for performance and usability. This has led to a plethora of unexpected privacy attacks in the news. Specialized privacy-preserving solutions usually require a group of privacy experts and they are not directly transferable to other domains. There is an urgent need for a general trustworthy database system that offers end-to-end privacy guarantees. This talk will present the challenges in designing such a system and highlight our efforts to make the system efficient, robust, and usable for database clients while achieving provable privacy guarantees.
Friday, April 16 at 11-12:30 ET What is (and isn't) private learning?

Speaker: Florian Tramèr, Stanford University
View recording here

Abstract: In the first part of this talk, I'll motivate the need for training machine learning models with formal differential privacy guarantees, by drawing on recent examples of privacy failures. In the second part, I'll demonstrate that differentially private machine learning still has a long way to go to become practical. For many canonical vision tasks, I'll show that "old-school" computer vision algorithms with handcrafted features significantly outperform end-to-end deep neural networks for moderate privacy budgets.
Friday, April 9 at 11-12:30 ET Hypothesis Selection with Privacy

Speaker: Gautam Kamath, University of Waterloo
View recording here

Abstract: The Scheffe estimator is a classic and celebrated statistical tool, which provides a sample-efficient method for selecting the distribution from a set of hypotheses which best matches a dataset. It can be extended to the private setting, enabling near-optimal cover-based upper bounds, which tightly complement packing-based lower bounds. I will discuss applications of this method to distribution estimation and beyond, in both the central and local setting. Based on several related works, with Ishaq Aden-Ali, Hassan Ashtiani, Mark Bun, Sivakanth Gopi, Janardhan Kulkarni, Aleksandar Nikolov, Vikrant Singhal, Thomas Steinke, Jonathan Ullman, Zhiwei Steven Wu, and Huanyu Zhang. Arxiv links: https://arxiv.org/abs/1905.13229, https://arxiv.org/abs/2002.09465, https://arxiv.org/abs/2002.09464, https://arxiv.org/abs/2010.09929
Friday, April 2 at 11-12:30 ET What Is The Sample Complexity of Differentially Private Learning?

Speaker: Shay Moran, Technion
View recording here

Abstract: The increase in machine learning applications which involve private and personal data highlights the need for algorithms that handle the data *responsibly*. While this need has been successfully addressed by the field of differentially private machine learning, the cost of privacy remains poorly understood: How much data is needed for differentially private learning? How much more data does private learning require compared to learning without privacy constraints? We will survey some of the recent progress towards answering these questions in the distribution-free PAC model, including the Littlestone-dimension-based *qualitative* characterization and the relationship with online learning. If time allows, we will also discuss this question in more general (distribution- and data-dependent) learning models.
Friday, March 26 at 11-12:30 ET Security and Privacy Guarantees in Machine Learning with Differential Privacy

Speaker: Roxana Geambasu, Columbia University
View recording here

Abstract: Machine learning (ML) is driving many of our applications and life-changing decisions. Yet, it is often brittle and unstable, making decisions that are hard to understand or can be exploited. Tiny changes to an input can cause dramatic changes in predictions; this results in decisions that surprise, appear unfair, or enable attack vectors such as adversarial examples. Moreover, models trained on users' data can encode not only general trends from large datasets but also very specific, personal information from these datasets; this threatens to expose users' secrets through ML models or predictions. This talk positions differential privacy (DP) -- a rigorous privacy theory -- as a versatile foundation for building into ML much-needed guarantees of security, stability, and privacy. I first present PixelDP (S&P'19), a scalable certified defense against adversarial example attacks that leverages DP theory to guarantee a level of robustness against these attacks. I then present Sage (SOSP'19), a DP ML platform that bounds the cumulative leakage of secrets through models while addressing some of the most pressing challenges of DP, such as running out of privacy budget and the privacy-accuracy tradeoff. PixelDP and Sage are designed from a pragmatic, systems perspective and illustrate that DP theory is powerful but requires adaptation to achieve practical guarantees for ML workloads.
Friday, March 19 at 11-12:30 ET On Distributed Differential Privacy and Counting Distinct Elements

Speaker: Lijie Chen, MIT
View recording here

Abstract: We study the setup where each of n users holds an element from a discrete set, and the goal is to count the number of distinct elements across all users, under the constraint of (eps, delta)-differentially privacy: (1) In the non-interactive local setting, we prove that the (additive) error of any protocol is Omega(n) for any constant eps and for any delta inverse polynomial in n. (2) In the single-message shuffle setting, we prove a lower bound of n/polylog(n) on the error for any constant eps and for some delta inverse quasi-polynomial in n. We do so by building on the moment-matching method from the literature on distribution estimation. (3) In the multi-message shuffle setting, we give a protocol with at most one message per user in expectation and with an error of sqrt{n} polylog(n) for any constant eps and for any delta inverse polynomial in n. Our protocol is also robustly shuffle private, and our error of sqrt{n} matches a known lower bound for such protocols. Our proof technique relies on a new notion, that we call dominated protocols, and which can also be used to obtain the first non-trivial lower bounds against multi-message shuffle protocols for the well-studied problems of selection and learning parity. Our lower bound for estimating the number of distinct elements provides the first omega(sqrt{n}) separation between global sensitivity and error in local differential privacy, thus answering an open question of Vadhan (2017). We also provide a simple construction that gives n/polylog(n) separation between global sensitivity and error in two-party differential privacy, thereby answering an open question of McGregor et al. (2011). This is joint work with Badih Ghazi, Ravi Kumar, and Pasin Manurangsi from Google Research.
Friday, March 12 at 11-12:30 ET Algorithmic Challenges in Efficient Training of Private (Deep) Language Models

Speaker: Janardhan Kulkarni, Microsoft Research
View recording here

Abstract: Many attacks have shown that deep learning models trained on private data of users can leak sensitive information of the users. Differential Privacy is a provable way to prevent such attacks. However, training deep learning models using DP introduces several new challenges both in terms of privacy vs accuracy tradeoffs and in the resource cost of the process. In this talk, I will highlight some of the problems we encountered, our solutions for resolving them and mention many important open problems.
Friday, March 5 at 11-12:30 ET Hiding Among the Clones: A Simple and Nearly Optimal Analysis of Privacy Amplification by Shuffling

Speaker: Audra McMillan, Apple
View recording here

Abstract: Recent work of Erlingsson, Feldman, Mironov, Raghunathan, Talwar, and Thakurta [EFMRTT19] demonstrates that random shuffling amplifies differential privacy guarantees of locally randomized data. Such amplification implies substantially stronger privacy guarantees for systems in which data is contributed anonymously [BEMMRLRKTS17] and has led to significant interest in the shuffle model of privacy [CSUZZ19, EFMRTT19]. In this talk, we will discuss a new result on privacy amplification by shuffling, which achieves the asymptotically optimal dependence in the local privacy parameter. Our result is based on a new proof strategy which is simpler than previous approaches, and extends to approximate differential privacy with nearly the same guarantees. We'll discuss this proof strategy, the extension to approximate differential privacy, and time permitting, some of the implications of this result.
Friday, February 26 at 11-12:30 ET Sample-efficient proper PAC learning with approximate differential privacy

Speaker: Noah Golowich, MIT
View recording here

Abstract: An exciting recent development in the theory of differentially private machine learning is the connection between private learning and online learning, and in particular the result that a binary hypothesis class is privately learnable if and only if it is online learnable (i.e., has finite Littlestone dimension). In this talk I will discuss our work strengthening various aspects of the result that online learning implies private learning: first, we show that the sample complexity of properly learning a class of Littlestone dimension d with approximate differential privacy is Õ(d^6), ignoring privacy and accuracy parameters. This result answers a question of Bun et al. (2020) by improving upon their upper bound of 2^O(d) on the sample complexity. Prior to our work, finiteness of the sample complexity for privately learning a class of finite Littlestone dimension was only known for improper private learners, and the fact that our learner is proper answers another question of Bun et al., which was also asked by Bousquet et al. (2020). Using machinery developed by Bousquet et al., we also show that the sample complexity of sanitizing a binary hypothesis class is at most polynomial in its Littlestone dimension and dual Littlestone dimension. This implies that a class is sanitizable if and only if it has finite Littlestone dimension. An important ingredient of our proofs is a new property of binary hypothesis classes that we call irreducibility, which may be of independent interest.
Friday, February 19 at 11-12:30 ET Leveraging Heuristics for Private Synthetic Data Release

Speaker: Steven Wu, CMU
View recording here


Abstract: This talk will focus on differentially private synthetic data---a privatized version of the dataset that consists of fake data records and that approximates the real dataset on important statistical properties of interest. I will present our recent results on private synthetic data that leverage practical optimization heuristics to circumvent the computational bottleneck in existing work. Our techniques are motivated by a modular, game-theoretic framework, which can flexibly work with methods such as integer program solvers and deep generative models.
Friday, February 12 at 11-12:30 ET Towards Good Statistical Inference from Differentially Private Data

Speaker: Ruobin Gong, Rutgers University
View recording here

Abstract: Differential privacy (DP) brings provability and transparency to statistical disclosure limitation. When data users migrate their analysis to private data, there is no guarantee that a statistical model, otherwise good for non-private data, will still produce trustworthy conclusions. This talk contemplates two challenges faced by data users to draw good statistical inference from private data releases. When the DP mechanism is transparent, I discuss how approximate computation techniques (Monte Carlo EM, approximate Bayesian computation) can be systematically adapted to produce exact inference with respect to the joint specification of the intended model and the DP mechanism. In the presence of mandated invariants which the data curator must observe, I advocate for the congenial design of the DP mechanism via standard probabilistic conditioning on the invariant margins, as an alternative to optimization-based post-processing. This proposal preserves both the privacy guarantee of the output and its statistical intelligibility. A demonstration of restricted contingency table privatization is performed via a Markov chain algorithm.
Friday, February 5 at 11-12:30 ET Private Mean Estimation of Heavy-Tailed Distributions

Speaker: Vikrant Singhal
Abstract: We give new upper and lower bounds on the minimax sample complexity of differentially private mean estimation of distributions with bounded $k$-th moments. Roughly speaking, in the univariate case, we show that $$n = \Theta\left(\frac{1}{\alpha^2} + \frac{1}{\alpha^{\frac{k}{k-1}}\varepsilon}\right)$$ samples are necessary and sufficient to estimate the mean to $\alpha$-accuracy under $\varepsilon$-differential privacy, or any of its common relaxations. This result demonstrates a qualitatively different behavior compared to estimation absent privacy constraints, for which the sample complexity is identical for all $k \geq 2$. We also give algorithms for the multivariate setting whose sample complexity is a factor of $O(d)$ larger than the univariate case.
Monday, January 25 at 3PM ET Local Differential Privacy is Equivalent to the Contraction of Hockey-Stick Divergence

Abstract: In this talk, we first show that the approximate local differential privacy (LDP) can be equivalently expressed in terms of the contraction coefficient of “Hockey-Stick Divergence.” This result then enables us to relate the LDP guarantees of randomized mechanisms to contraction properties of any arbitrary f-divergences. This is in fact a generalization (and improvement) of the main result in [Duchi, Jordan and Wainwright, FOCS’13] that led to information-theoretic lower bounds for private minimax estimation problems only in the high privacy regime (i.e., epsilon<1 and delta =0). Our result allows us to drop the high-privacy assumption and obtain lower bounds for any epsilon and delta. Time permitting, I will also discuss some implications for the private Bayesian estimation problems. This is a work in progress and based on a collaboration with Maryam Aliakbarpour (UMass) and Flavio Calmon (Harvard).