Bostonarea Data Privacy
Below is the schedule of previous and upcoming Bostonarea data privacy seminars. Join the mailing list and Google calendar for more information, including the Zoom meeting links.
Friday, September 16 at 10:3011:45 am ET  What Impacts (Empirical) Privacy Risk? Speaker: Matthew Jagielski, Google Brain Abstract: In this talk, I will talk about a number of factors that influence the success of privacy attacks on trained machine learning models. I'll start with a recent line of work on extracting training data from large language models: we'll see that data duplication and model size have a large impact on this memorization, but our estimates. Then, I'll discuss membership inference. I'll discuss recent work showing that adding or removing examples from a training set can have a large impact on membership inference accuracy for other examples. Then, we'll see *when* a point is used in training matters for membership inference attacks, too, in settings where models are updated repeatedly, or when training sets are extremely large. This talk is based on joint works with many collaborators. 

Friday, September 30 at 10:3011:45 am ET  Score Attack: A Lower Bound Technique for Differentially Private Learning Speaker: Linjun Zhang, Rutgers University Abstract: Privacypreserving techniques have been a central focus in data analysis nowadays. In this talk, we propose a general technique, the score attack, for lower bounding the differentialprivacyconstrained minimax risk of parameter estimation. Inspired by the tracing attack idea in differential privacy, the score attack method is applicable to any statistical model with a welldefined score statistic and capable of optimally lower bounding the minimax risk of estimating unknown model parameters when the estimator is required to be differentially private. The effectiveness of this general method is demonstrated in a variety of examples: the generalized linear model in classical and highdimensional sparse settings, the BradleyTerryLuce model for pairwise comparisons, and nonparametric function estimation in the Sobolev class. 

Friday, October 14 at 10:3011:45 am ET  Policy impacts of statistical uncertainty and privacy Speaker: Ryan Steed, Carnegie Mellon University Abstract: The introduction of differential privacy to the 2020 Censusaccomplished by adding random noise to statistics to preserve respondents' privacykindled an ongoing debate over privacyutility tradeoffs in federal statistics. We examine an overlooked aspect of this debate: the role of underlying statistical uncertainty. Simulating the effects of data error and noise injected for privacy, we find that the Title I education funding formula distributes error unequally across demographic groupsthough existing data error impacts funding decisions much more than our privacy mechanism. We suggest simple policy changes to reduce disparities and compensate for additional privacy protections. Our results point the way towards reforms that would make evidencebased policy both more beneficial to marginalized communities and more privacyprotecting. 

Friday, October 28 at 10:3011:45 am ET  Private Query Release via the JohnsonLindenstrauss Transform Speaker: Aleksandar (Sasho) Nikolov, University of Toronto Abstract: One of the most basic tasks in differential privacy is answering statistical queries, which subsumes, for example, basic problems like answering counting queries, computing means and covariances, computing the CDF of the data, and computing the gradient of the empirical loss function of the data. Despite the importance of statistical queries, many basic questions about computing them with differential privacy remain unanswered. Perhaps the most embarrassing of these questions concerns the optimal tradeoff between error and data size for worst case queries in the setting of pure differential privacy. To address this question, we introduce a new method for releasing answers to statistical queries with differential privacy, based on the JohnsonLindenstrauss lemma. The key idea is to randomly project the query answers to a lower dimensional space so that the distance between any two vectors of feasible query answers is preserved up to an additive error. Then we answer the projected queries using a simple noiseadding mechanism, and lift the answers up to the original dimension. Using this method, we give, for the first time, purely differentially private mechanisms with optimal worst case sample complexity under average error for answering a workload of k queries over a universe of size N. As other applications, we give the first efficient purely private mechanisms with optimal sample complexity for computing the covariance of a bounded highdimensional distribution, and for answering 2way marginal queries. We also show that, up to the dependence on the error, a variant of our mechanism is nearly optimal for every given query workload. 

Friday, November 18 at 10:3011:45 am ET  Can we Reconcile the Computer Science and Legal Views of Privacy? Speaker: Kobbi Nissim, Georgetown University Abstract: Law and computer science interact in critical ways within sociotechnical systems, and recognition is growing among computer scientists, legal scholars, and practitioners of significant gaps between these disciplines that create potential risks for privacy and data protection. These gaps need to be bridged to ensure that computer systems are designed and implemented to correctly address applicable legal requirements and that interpretations of legal concepts accurately reflect the capabilities and limitations of technical systems. We will explore some of the gaps between the legal and technical views of privacy and suggest directions by which these gaps may be reconciled. 

Friday, January 21 at 1112:30 ET  FriendlyCore: Practical Differentially Private Aggregation Speaker: Eliad Tsfadia, TelAviv University View recording here Abstract: Differentially private algorithms for common metric aggregation tasks, such as clustering or averaging, often have limited practicality due to their complexity or a large number of data points that is required for accurate results. We propose a simple and practical tool FriendlyCore that takes a set of elements D from an unrestricted (pseudo) metric space as input. When D has effective diameter r, FriendlyCore returns a ``stable'' subset C that includes all elements, except possibly a few outliers, and is certified to have diameter r. FriendlyCore can be used to preprocess the input before privately aggregating it, potentially simplifying the aggregation or boosting its accuracy. Surprisingly, FriendlyCore is lightweight with no dependence on the dimension. We empirically demonstrate its advantages in boosting the accuracy of mean estimation and clustering problems such as kmeans and kGMM, outperforming tailored methods. The talk is based on the joint work with Edith Cohen, Haim Kaplan, Yishay Mansour, and Uri Stemmer. 

Monday, February 14 at 11:1512:30 ET  Privatizing Oneshot and Continual Observation Histograms Speaker: Ryan Rogers, LinkedIn View recording here Abstract: Member privacy is a priority at LinkedIn and a key consideration in product design. We aim to protect member privacy by using innovative privacy engineering techniques such as differential privacy to provide valuable insights to members and policy makers. The data privacy team at LinkedIn has developed new algorithms that work with existing real time data analytics systems, allowing us to develop systems at scale while protecting member privacy. These algorithms privatize histograms in both the oneshot setting, where data is fixed, and for the continual observation setting, for streaming data. We also showcase some deployments of differential privacy in LinkedIn products. 

Monday, February 28 at 11:1512:30 ET  Practical and Private (Deep) Learning without Sampling or Shuffling Speaker: Abhradeep Guha Thakurta, Google and Om Thakkar, Google View recording here Abstract: We consider training models with differential privacy (DP) using minibatch gradients. The existing stateoftheart, Differentially Private Stochastic Gradient Descent (DPSGD), requires privacy amplification by sampling or shuffling to obtain the best privacy/accuracy/computation tradeoffs. Unfortunately, the precise requirements on exact sampling and shuffling can be hard to obtain in important practical scenarios, particularly federated learning (FL). In this talk, we will present a DP variant of FollowTheRegularizedLeader (DPFTRL) that compares favorably (both theoretically and empirically) to amplified DPSGD, while allowing for much more flexible data access patterns. DPFTRL does not use any form of privacy amplification. The talk will assume basic understanding of differential privacy and machine learning. However, the talk will be selfcontained w.r.t. most of the advanced concepts. This is a joint work with Peter Kairouz, Brendan McMahan, Shuang Song, and Zheng Xu from Google Research. 

Monday, March 21 at 11:1512:30 ET  Private Convex Optimization via Exponential Mechanism Speaker: Sivakanth Gopi, MSR View recording here Abstract: We study differentially private optimization of (nonsmooth) convex functions F(x)=E_i[f_i(x)]. The classic exponential mechanism minimizes F(x) by sampling from pi(x) ~ exp(kF(x)), but achieves a suboptimal privacy vs utility tradeoff. We show that modifying the exponential mechanism by adding an ell_2^2 regularizer to F(x) and sampling from pi(x) ~ exp(k(F(x)+\mu x_2^2/2)) recovers both optimal empirical risk and population loss under (eps,delta)DP. We also give an algorithm to efficiently sample from the exponential mechanism using optimal number of oracle queries to f_i(x). We prove that the regularized exponential mechanism satisfies Gaussian Differential Privacy; our privacy bound is optimal (with tight constants), as it includes the analysis of Gaussian mechanism as a special case. The privacy proof uses isoperimetric inequality for strongly logconcave measures. The link to the paper is at https://arxiv.org/pdf/2203.00263.pdf. 

Monday, April 4 at 11:1512:30 ET  Negotiating privacy and fitness for use in recent DP deployments at federal statistical agencies Speaker: Ashwin Machanavajjhala, Tumult Labs and Duke University View recording here Abstract: In this talk we will describe two current and recent deployments of DP at the US Census Bureau and the US Internal Revenue Service that Tumult Labs is involved with. We will describe the data products and some of the considerations taken into account when designing DP algorithms for these deployments. We will highlight our methodology of engaging with key stakeholders to iteratively elicit requirements for privacy and fitnessforuse as well as design and tune a differentially private algorithm that meets these requirements. This talk will also briefly highlight Tumult Labs’ soontobeopensource platform for SQLlike analytics with configurable differential privacy guarantees, which is currently used to publish these data products. We will end with lessons learned and identify some open challenges. 

Monday, April 25 at 11:1512:30 ET  Differentially private inference via noisy optimization Speaker: Marco Avella Medina, Columbia University View recording here Abstract: We propose a general optimizationbased framework for computing differentially private Mestimators and a new method for the construction of differentially private confidence regions. Firstly, we show that robust statistics can be used in conjunction with noisy gradient descent and noisy Newton methods in order to obtain optimal private estimators with global linear or quadratic convergence, respectively. We establish global convergence guarantees, under both local strong convexity and selfconcordance, showing that our private estimators converge with high probability to a neighborhood of the nonprivate Mestimators. The radius of this neighborhood is nearly optimal in the sense it corresponds to the statistical minimax cost of differential privacy up to a logarithmic term. Secondly, we tackle the problem of parametric inference by constructing differentially private estimators of the asymptotic variance of our private Mestimators. This naturally leads to the use of approximate pivotal statistics for the construction of confidence regions and hypothesis testing. We demonstrate the effectiveness of a bias correction that leads to enhanced smallsample empirical performance in simulations. This is joint work with Casey Bradshaw and PoLing Loh. 

Monday, May 9 at 11:1512:30 ET  Lowcommunication Algorithms for Private Federated Data Analysis with Optimal Accuracy Guarantees Speaker: Vitaly Feldman, Apple View recording here Abstract: Locally Differentially Private (LDP) Reports are commonly used for collection of statistics and machine learning in the federated setting. Yet for both frequency estimation over a large domain and learning a highdimensional model the best known LDP algorithms require sending prohibitively large messages from each client device to the server. In this talk I’ll describe several recent results on lowcommunication LDP protocols with optimal accuracy. In particular, (1) A general approach that, under standard cryptographic assumptions, gives a low communication version of any efficient LDP randomizer with negligible loss in privacy and utility guarantees. (2) New algorithms for frequency estimation and highdimensional mean estimation that achieve the optimal privacy/utility tradeoff and have low communication cost. Based on joint works with Hilal Asi, Jelani Nelson, Huy Nguyen and Kunal Talwar 

Monday, May 23 at 11:3012:30 ET  "Your responses will be retained until ...": The challenge of retaining information without retaining individual responses in a global COVID symptom survey. Speaker: Frauke Kreuter, University of Maryland/LMU Munich and Steven Wu, Carnegie Mellon University View recording here Abstract: This presentation discussed the challenge of preserving longterm utility of a global COVID19 symptom survey. The global sample includes more than 240 countries and territories since the launch of data collection, of which 115 are weighted to better reflect the age and gender of the general adult population. CTIS individuallevel data and aggregate estimates are available for the nation level as well as for administrative level1 subnational regions. Aggregates have been released publicly. The sample is a daily repeated crosssection of Facebook app users who can see an invitation as often as every month and as infrequently as every 5 months. This means the data can be used to detect both daily and weekly changes in COVID19 symptoms as well as a variety of other COVIDrelated and public health outcomes, such as barriers to vaccination or mental health. This has enabled both local and national governments and nongovernmental organizations to detect and respond to changes in trends and to evaluate the effectiveness of COVID policies and restrictions. Preserving the richness of this information for future use is key, though challenged by a promise made to EU respondents, that their responses will be retained for a period of two years. The talk will discuss steps taken to apply disclosure control to the daily crosssectional estimates and to create DP synthetic data that allow detailed time series and causal effect analyses going forward. We hope to discuss with the group how to handle weights when creating the DP synthetic files and how to evaluate synthetic data in the presence of weighted private files. This is joint work with Caro Haensch (LMU Munich) and Terrance Liu (CMU). 

Friday, October 15 at 1112:30 ET  Learning with userlevel differential privacy Speaker: Ananda Theertha Suresh, Google View recording here Abstract: The classical setting of differential privacy assumes each user contributes a single sample to the dataset and preserves privacy by noising the output in a way that is commensurate with the maximum contribution of a single example. However, in many practical applications such as federated learning, each user can contribute multiple samples. In these applications, the goal is to provide userlevel differential privacy which protects privacy of all the samples of the user. In this talk, we present algorithms and informationtheoretic lower bounds for the problems of discrete distribution estimation, highdimensional mean estimation, and empirical risk minimization under userlevel differential privacy constraints. 

Friday, October 22 at 1112:30 ET  Mean Estimation with Userlevel Privacy under Data Heterogeneity Speaker: Rachel Cummings, Columbia University View recording here Abstract: A key challenge for data analysis in the federated setting is that user data is heterogeneous, i.e., it cannot be assumed to be sampled from the same distribution. Further, in practice, different users may possess vastly different number of samples. In this work we propose a simple model of heterogeneous user data that differs in both distribution and quantity of data, and we provide a method for estimating the populationlevel mean while preserving userlevel differential privacy. We demonstrate asymptotic optimality of our estimator within a natural class of private estimators and also prove general lower bounds on the error achievable in our problem. In particular, while the optimal nonprivate estimator can be shown to be linear, we show that privacy constrains us to use a nonlinear estimator. 

Friday, October 29 at 1112:30 ET  Hyperparameter Tuning with Renyi Differential Privacy Speaker: Thomas Steinke, Google View recording here Abstract: For many differentially private algorithms, such as the prominent noisy stochastic gradient descent (DPSGD), the analysis needed to bound the privacy leakage of a single training run is well understood. However, few studies have reasoned about the privacy leakage resulting from the multiple training runs needed to fine tune the value of the training algorithm's hyperparameters. In this work, we first illustrate how simply setting hyperparameters based on nonprivate training runs can leak private information. Motivated by this observation, we then provide privacy guarantees for hyperparameter search procedures within the framework of Renyi Differential Privacy. Our results improve and extend the work of Liu and Talwar (STOC 2019). Our analysis supports our previous observation that tuning hyperparameters does indeed leak private information, but we prove that, under certain assumptions, this leakage is modest, as long as each candidate training run needed to select hyperparameters is itself differentially private. 

Friday, November 5 at 1112:30 ET  UserLevel Differentially Private Learning via Correlated Sampling Speaker: Pasin Manurangsi, Google View recording here Abstract: Most works in learning with differential privacy (DP) have focused on the setting where each user has a single sample. In this work, we consider the setting where each user holds $m$ samples and the privacy protection is enforced at the level of each user's data. We show that, in this setting, we may learn with a much fewer number of users. Specifically, we show that, as long as each user receives sufficiently many samples, we can learn any privately learnable class via an $(\epsilon, \delta)$DP algorithm using only $O(\log(1/\delta)/\epsilon)$ users. For $\epsilon$DP algorithms, we show that we can learn using only $O_{\epsilon}(d)$ users even in the local model, where $d$ is the probabilistic representation dimension. In both cases, we show a nearlymatching lower bound on the number of users required. A crucial component of our results is a generalization of global stability [Bun, Livni, Moran, FOCS 2020] that allows the use of public randomness. Under this relaxed notion, we employ a correlated sampling strategy to show that the global stability can be boosted to be arbitrarily close to one, at a polynomial expense in the number of samples. 

Friday, November 12 at 1112:30 ET  Inference and Privacy Speaker: Daniel Sheldon, University of Massachusetts Amherst View recording here Abstract: Differential privacy requires that an algorithm is randomized “just enough” to mask the contribution of any one individual in the input data set when viewing the algorithm’s output. This leads to new inference problems where we wish to reason about the truth given measurements with excess noise added for privacy. I will discuss recent work on inference and differential privacy, including probabilistic inference for more accurately answering database queries under privacy constraints, private Bayesian inference, and private confidence interval construction. Time permitting, I will briefly mention how underlying technical ideas were motivated by, and helped advance, inference about bird migration patterns from citizen science data. 

Friday, November 19 at 1112:30 ET  PerQuery Error and Differential Privacy Speaker: Daniel Kifer, Penn State University View recording here Abstract: In this talk we consider the effects of perquery error on the design of query answering algorithms under differential privacy. Perquery error is closer to stakeholder expectations than metrics that are typically considered in the literature. When differentially private systems are required to produce microdata, we show that this error metric reveals a tradeoff between query answer errors that is unrelated to privacy loss budget allocations between the queries. We also consider how to optimize the choice of linear queries to satisfy perquery error requirements. 

Friday, December 3 at 1112:30 ET  Differential Privacy and Machine Unlearning Speaker: Aaron Roth, University of Pennsylvania View recording here Abstract: The problem of data deletion or "machine unlearning" is to remove the influence of a data point on a trained model, with computational cost that is substantially better than the baseline solution of fully retraining the model. Whereas differential privacy asks that the same algorithm run on different (neighboring) inputs yield nearby distributions, the data deletion problem requires that two different algorithms (full retraining vs. a sequence of deletion operations) yield nearby distributions when run on the same input. So the two goals are not the same: nevertheless, techniques from differential privacy carry over in natural ways to the data deletion problem. In this talk, I'll walk through two simple vignettes that illustrate this point. The work I will discuss is from a pair of papers that are joint with Varun Gupta, Chris Jung, Seth Neel, Saeed Sharifi, and Chris Waites. 

Friday, May 14 at 1112:30 ET  NearOptimal Aggregation in the Shuffle DP Model Speaker: Badih Ghazi, Google View recording here Abstract: The shuffle model of differential privacy has recently witnessed significant interest as an intermediate setting between the wellstudied central and local differential privacy models. In this talk, we describe some algorithms for differentially private aggregation in the shuffle model, achieving nearcentral accuracy and small communication overhead. We also present some empirical evaluations and discuss some challenges towards obtaining more scalable practical implementations. Based on joint works with Ravi Kumar, Pasin Manurangsi, Rasmus Pagh, Amer Sinha and Ameya Velingker that appeared at Eurocrypt 2020, ICML 2020 and will appear at ICML 2021. 

Friday, May 7 at 1112:30 ET  Identification and formal privacy guarantees Speaker: Denis Nekipelov, University of Virginia View recording here Abstract: Empirical economic research crucially relies on highly sensitive individual datasets. At the same time, increasing availability of public individuallevel data that comes from social networks, public government records and directories makes it possible for adversaries to potentially deidentify anonymized records in sensitive research datasets. Most commonly accepted formal definition of an individual nondisclosure guarantee is referred to as differential privacy. With differential privacy in place the researcher interacts with the data by issuing queries that evaluate the functions of the data. Differential privacy guarantee is achieved by replacing the actual outcome of the query with a randomized outcome with the amount of randomness determined by the sensitivity of the outcome to individual observations in the data. While differential privacy does provide formal nondisclosure guarantees, its impact on the identification of empirical economic models as well as its impact on the performance of estimators in nonlinear empirical Econometric models has not been sufficiently studied. Since privacy protection mechanisms are inherently finitesample procedures, we define the notion of identifiability of the parameter of interest under differential privacy as a property of the limit of experiments. It is naturally characterized by the concepts from the random sets theory and is linked to the asymptotic behavior in measure of differentially private estimators. We demonstrate that particular instances of regression discontinuity design and average treatment effect may be problematic for inference with differential privacy. Those parameters turn out to be neither point nor partially identified. The set of differentially private estimators converges weakly to a random set. This result is clearly supported by our simulation evidence. Our analysis suggests that many other estimators that rely on nuisance parameters may have similar properties with the requirement of differential privacy. Identification becomes possible if the target parameter can be deterministically localized within the random set. In that case, a full exploration of the random set of the weak limits of differentially private estimators can allow the data curator to select a sequence of instances of differentially private estimators that is guaranteed to converge to the target parameter in probability. We provide a decision theoretic approach to this selection. 

Friday, April 30 at 1112:30 ET  Private Stochastic Convex Optimization Speaker: Kunal Talwar, Apple View slides here Abstract: I will summarize some recent works on differentially private (DP) algorithms for stochastic convex optimization: the problem of minimizing the population loss given i.i.d. samples from a distribution over convex loss functions. In the standard l2/l2 setting, we will see two approaches to getting optimal rates for this problem. We show that for a wide range of parameters, privacy causes no additional overhead in accuracy or run time. In the process, we will develop techniques for private stochastic optimization that work for other geometries. For the LASSO setting when optimizing over the l1 ball, we will see private algorithms that achieve optimal rates. Based on joint works with various subsets of Hilal Asi, Raef Bassily, Vitaly Feldman, Tomer Koren and Abhradeep Thakurta. 

Friday, April 23 at 1112:30 ET  Privacy is Not an Afterthought Speaker: Xi He, University of Waterloo View recording here Abstract: Computing technology has enabled massive digital traces of our personal lives to be collected and stored. These datasets play an important role in numerous reallife applications and research analysis, such as contact tracing for COVID 19, but they contain sensitive information about individuals. When managing these datasets, privacy is usually addressed as an afterthought, engineered on top of a database system optimized for performance and usability. This has led to a plethora of unexpected privacy attacks in the news. Specialized privacypreserving solutions usually require a group of privacy experts and they are not directly transferable to other domains. There is an urgent need for a general trustworthy database system that offers endtoend privacy guarantees. This talk will present the challenges in designing such a system and highlight our efforts to make the system efficient, robust, and usable for database clients while achieving provable privacy guarantees. 

Friday, April 16 at 1112:30 ET  What is (and isn't) private learning? Speaker: Florian Tramèr, Stanford University View recording here Abstract: In the first part of this talk, I'll motivate the need for training machine learning models with formal differential privacy guarantees, by drawing on recent examples of privacy failures. In the second part, I'll demonstrate that differentially private machine learning still has a long way to go to become practical. For many canonical vision tasks, I'll show that "oldschool" computer vision algorithms with handcrafted features significantly outperform endtoend deep neural networks for moderate privacy budgets. 

Friday, April 9 at 1112:30 ET  Hypothesis Selection with Privacy Speaker: Gautam Kamath, University of Waterloo View recording here Abstract: The Scheffe estimator is a classic and celebrated statistical tool, which provides a sampleefficient method for selecting the distribution from a set of hypotheses which best matches a dataset. It can be extended to the private setting, enabling nearoptimal coverbased upper bounds, which tightly complement packingbased lower bounds. I will discuss applications of this method to distribution estimation and beyond, in both the central and local setting. Based on several related works, with Ishaq AdenAli, Hassan Ashtiani, Mark Bun, Sivakanth Gopi, Janardhan Kulkarni, Aleksandar Nikolov, Vikrant Singhal, Thomas Steinke, Jonathan Ullman, Zhiwei Steven Wu, and Huanyu Zhang. Arxiv links: https://arxiv.org/abs/1905.13229, https://arxiv.org/abs/2002.09465, https://arxiv.org/abs/2002.09464, https://arxiv.org/abs/2010.09929 

Friday, April 2 at 1112:30 ET  What Is The Sample Complexity of Differentially Private Learning? Speaker: Shay Moran, Technion View recording here Abstract: The increase in machine learning applications which involve private and personal data highlights the need for algorithms that handle the data *responsibly*. While this need has been successfully addressed by the field of differentially private machine learning, the cost of privacy remains poorly understood: How much data is needed for differentially private learning? How much more data does private learning require compared to learning without privacy constraints? We will survey some of the recent progress towards answering these questions in the distributionfree PAC model, including the Littlestonedimensionbased *qualitative* characterization and the relationship with online learning. If time allows, we will also discuss this question in more general (distribution and datadependent) learning models. 

Friday, March 26 at 1112:30 ET  Security and Privacy Guarantees in Machine Learning with Differential Privacy Speaker: Roxana Geambasu, Columbia University View recording here Abstract: Machine learning (ML) is driving many of our applications and lifechanging decisions. Yet, it is often brittle and unstable, making decisions that are hard to understand or can be exploited. Tiny changes to an input can cause dramatic changes in predictions; this results in decisions that surprise, appear unfair, or enable attack vectors such as adversarial examples. Moreover, models trained on users' data can encode not only general trends from large datasets but also very specific, personal information from these datasets; this threatens to expose users' secrets through ML models or predictions. This talk positions differential privacy (DP)  a rigorous privacy theory  as a versatile foundation for building into ML muchneeded guarantees of security, stability, and privacy. I first present PixelDP (S&P'19), a scalable certified defense against adversarial example attacks that leverages DP theory to guarantee a level of robustness against these attacks. I then present Sage (SOSP'19), a DP ML platform that bounds the cumulative leakage of secrets through models while addressing some of the most pressing challenges of DP, such as running out of privacy budget and the privacyaccuracy tradeoff. PixelDP and Sage are designed from a pragmatic, systems perspective and illustrate that DP theory is powerful but requires adaptation to achieve practical guarantees for ML workloads. 

Friday, March 19 at 1112:30 ET  On Distributed Differential Privacy and Counting Distinct Elements Speaker: Lijie Chen, MIT View recording here Abstract: We study the setup where each of n users holds an element from a discrete set, and the goal is to count the number of distinct elements across all users, under the constraint of (eps, delta)differentially privacy: (1) In the noninteractive local setting, we prove that the (additive) error of any protocol is Omega(n) for any constant eps and for any delta inverse polynomial in n. (2) In the singlemessage shuffle setting, we prove a lower bound of n/polylog(n) on the error for any constant eps and for some delta inverse quasipolynomial in n. We do so by building on the momentmatching method from the literature on distribution estimation. (3) In the multimessage shuffle setting, we give a protocol with at most one message per user in expectation and with an error of sqrt{n} polylog(n) for any constant eps and for any delta inverse polynomial in n. Our protocol is also robustly shuffle private, and our error of sqrt{n} matches a known lower bound for such protocols. Our proof technique relies on a new notion, that we call dominated protocols, and which can also be used to obtain the first nontrivial lower bounds against multimessage shuffle protocols for the wellstudied problems of selection and learning parity. Our lower bound for estimating the number of distinct elements provides the first omega(sqrt{n}) separation between global sensitivity and error in local differential privacy, thus answering an open question of Vadhan (2017). We also provide a simple construction that gives n/polylog(n) separation between global sensitivity and error in twoparty differential privacy, thereby answering an open question of McGregor et al. (2011). This is joint work with Badih Ghazi, Ravi Kumar, and Pasin Manurangsi from Google Research. 

Friday, March 12 at 1112:30 ET  Algorithmic Challenges in Efficient Training of Private (Deep) Language Models Speaker: Janardhan Kulkarni, Microsoft Research View recording here Abstract: Many attacks have shown that deep learning models trained on private data of users can leak sensitive information of the users. Differential Privacy is a provable way to prevent such attacks. However, training deep learning models using DP introduces several new challenges both in terms of privacy vs accuracy tradeoffs and in the resource cost of the process. In this talk, I will highlight some of the problems we encountered, our solutions for resolving them and mention many important open problems. 

Friday, March 5 at 1112:30 ET  Hiding Among the Clones: A Simple and Nearly Optimal Analysis of Privacy Amplification by Shuffling Speaker: Audra McMillan, Apple View recording here Abstract: Recent work of Erlingsson, Feldman, Mironov, Raghunathan, Talwar, and Thakurta [EFMRTT19] demonstrates that random shuffling amplifies differential privacy guarantees of locally randomized data. Such amplification implies substantially stronger privacy guarantees for systems in which data is contributed anonymously [BEMMRLRKTS17] and has led to significant interest in the shuffle model of privacy [CSUZZ19, EFMRTT19]. In this talk, we will discuss a new result on privacy amplification by shuffling, which achieves the asymptotically optimal dependence in the local privacy parameter. Our result is based on a new proof strategy which is simpler than previous approaches, and extends to approximate differential privacy with nearly the same guarantees. We'll discuss this proof strategy, the extension to approximate differential privacy, and time permitting, some of the implications of this result. 

Friday, February 26 at 1112:30 ET  Sampleefficient proper PAC learning with approximate differential privacy Speaker: Noah Golowich, MIT View recording here Abstract: An exciting recent development in the theory of differentially private machine learning is the connection between private learning and online learning, and in particular the result that a binary hypothesis class is privately learnable if and only if it is online learnable (i.e., has finite Littlestone dimension). In this talk I will discuss our work strengthening various aspects of the result that online learning implies private learning: first, we show that the sample complexity of properly learning a class of Littlestone dimension d with approximate differential privacy is Õ(d^6), ignoring privacy and accuracy parameters. This result answers a question of Bun et al. (2020) by improving upon their upper bound of 2^O(d) on the sample complexity. Prior to our work, finiteness of the sample complexity for privately learning a class of finite Littlestone dimension was only known for improper private learners, and the fact that our learner is proper answers another question of Bun et al., which was also asked by Bousquet et al. (2020). Using machinery developed by Bousquet et al., we also show that the sample complexity of sanitizing a binary hypothesis class is at most polynomial in its Littlestone dimension and dual Littlestone dimension. This implies that a class is sanitizable if and only if it has finite Littlestone dimension. An important ingredient of our proofs is a new property of binary hypothesis classes that we call irreducibility, which may be of independent interest. 

Friday, February 19 at 1112:30 ET  Leveraging Heuristics for Private Synthetic Data Release Speaker: Steven Wu, CMU View recording here Abstract: This talk will focus on differentially private synthetic dataa privatized version of the dataset that consists of fake data records and that approximates the real dataset on important statistical properties of interest. I will present our recent results on private synthetic data that leverage practical optimization heuristics to circumvent the computational bottleneck in existing work. Our techniques are motivated by a modular, gametheoretic framework, which can flexibly work with methods such as integer program solvers and deep generative models. 

Friday, February 12 at 1112:30 ET  Towards Good Statistical Inference from Differentially Private Data Speaker: Ruobin Gong, Rutgers University View recording here Abstract: Differential privacy (DP) brings provability and transparency to statistical disclosure limitation. When data users migrate their analysis to private data, there is no guarantee that a statistical model, otherwise good for nonprivate data, will still produce trustworthy conclusions. This talk contemplates two challenges faced by data users to draw good statistical inference from private data releases. When the DP mechanism is transparent, I discuss how approximate computation techniques (Monte Carlo EM, approximate Bayesian computation) can be systematically adapted to produce exact inference with respect to the joint specification of the intended model and the DP mechanism. In the presence of mandated invariants which the data curator must observe, I advocate for the congenial design of the DP mechanism via standard probabilistic conditioning on the invariant margins, as an alternative to optimizationbased postprocessing. This proposal preserves both the privacy guarantee of the output and its statistical intelligibility. A demonstration of restricted contingency table privatization is performed via a Markov chain algorithm. 

Friday, February 5 at 1112:30 ET  Private Mean Estimation of HeavyTailed Distributions Speaker: Vikrant Singhal Abstract: We give new upper and lower bounds on the minimax sample complexity of differentially private mean estimation of distributions with bounded $k$th moments. Roughly speaking, in the univariate case, we show that $$n = \Theta\left(\frac{1}{\alpha^2} + \frac{1}{\alpha^{\frac{k}{k1}}\varepsilon}\right)$$ samples are necessary and sufficient to estimate the mean to $\alpha$accuracy under $\varepsilon$differential privacy, or any of its common relaxations. This result demonstrates a qualitatively different behavior compared to estimation absent privacy constraints, for which the sample complexity is identical for all $k \geq 2$. We also give algorithms for the multivariate setting whose sample complexity is a factor of $O(d)$ larger than the univariate case. 

Monday, January 25 at 3PM ET 
Local Differential Privacy is Equivalent to the Contraction of HockeyStick Divergence Abstract: In this talk, we first show that the approximate local differential privacy (LDP) can be equivalently expressed in terms of the contraction coefficient of “HockeyStick Divergence.” This result then enables us to relate the LDP guarantees of randomized mechanisms to contraction properties of any arbitrary fdivergences. This is in fact a generalization (and improvement) of the main result in [Duchi, Jordan and Wainwright, FOCS’13] that led to informationtheoretic lower bounds for private minimax estimation problems only in the high privacy regime (i.e., epsilon<1 and delta =0). Our result allows us to drop the highprivacy assumption and obtain lower bounds for any epsilon and delta. Time permitting, I will also discuss some implications for the private Bayesian estimation problems. This is a work in progress and based on a collaboration with Maryam Aliakbarpour (UMass) and Flavio Calmon (Harvard). 