Author: Paul Lekas and Anton Van Seventer
Source: Techdirt
The release of a bipartisan draft of the American Privacy Rights Act (APRA) reinvigorated the effort to pass a federal consumer privacy law, only to sputter and stall amid concerns raised from across the political spectrum. All that is gone, however, is not forgotten: it is only a matter of time before Congress returns its institutional gaze to consumer privacy. When it does, Congress should pay careful attention to the implications of the APRA’s policy choices on AI development.
The APRA proposed to regulate AI development and use in two key ways. First, it required impact assessments and audits on algorithms used to make “consequential decisions” in areas such as housing, employment, healthcare, insurance, and credit, and provided consumers with rights to opt-out of the use of such algorithms. House drafters subsequently struck these provisions. Second, perhaps more importantly – and the focus of this article – the APRA also prohibited the use of personal data to train multipurpose AI models. This prohibition is not explicit in the APRA text. Rather, it is a direct implication of the “data minimization” principle that serves as the bedrock of the entire bill.
Data Minimization as a Framework for Consumer Privacy
Data minimization is the principle that data collection should be limited to only what is required to fulfill a specific purpose, and has both procedural and substantive components. Procedural data minimization, which is a hallmark of both European Union and United States privacy law, focuses on disclosure and consumer consent. Virginia’s Consumer Data Protection Act, for example, requires data collected and processed to be “adequate, relevant, and reasonably necessary” for its purposes as disclosed to the consumer. Privacy statutes modeled on procedural data minimization might make it difficult to process certain kinds of personal information, but ultimately with sufficient evidence of disclosure, they tend to remain agnostic about the data’s ultimate use.
Substantive data minimization goes further by limiting the ability of controllers to use consumer data for purposes beyond those expressly permitted under the law. Maryland’s Online Data Privacy Act, enacted earlier this year, is an example of this. The Maryland law permits covered businesses to collect, process or share sensitive data when it is “reasonably necessary and proportionate to provide or maintain a specific product or service requested by the consumer.” Although Maryland permits consumers to consent to additional uses, practices that are by default legal under Virginia’s and similar statutes — such as a local boat builder using data on its current customers’ employment or hobbies to predict who else in the area is likely to be interested in its business — would generally not be permissible in Maryland.
The APRA adopts a substantive data minimization approach, but it goes further than Maryland. The APRA mandates that covered entities shall not collect or process covered data “beyond what is necessary, proportionate, and limited to provide or maintain a specific product or service requested by the individual to whom the data pertains,” or alternatively “for a purpose other than those expressly permitted.” The latter category would then permit data to be used only for purposes explicitly authorized in the legislation — described as “permitted purposes” — but does not permit consumers to consent to additional uses, or even to several such “permitted purposes” at the same time.
The APRA proposes what is essentially a white list approach to data collection and processing. It does not permit personal data to be used for a range of socially-beneficial purposes, such as detecting and preventing identity theft, fraud and harassment that are essential to a functioning economy. And because the development of AI models is not among the permitted purposes, no personal data could be used to train AI models – even if consumers were to consent and even if the information was never disclosed. In contrast, current U.S. laws permit collection and processing of personal data subject to a series of risk-based regulations.
The substantive data minimization approach reflected in the APRA represents a potential sea change in norms for consumer privacy law in the United States. Each of the 19 state consumer privacy laws now in effect has by and large adopted a procedural data minimization approach in which data collection and processing is presumptively permissible. They have generally avoided substantive minimization restrictions. Even Maryland, the most stringent of these, has stopped well short of the APRA’s proposal to restrict data collection and processing to only those uses specified in the bill itself.
The GDPR’s Minimization Approach
The APRA’s approach to data minimization has more in common with the EU General Data Protection Regulation (GDPR) than with U.S. state privacy laws. The GDPR follows a substantive data minimization model, allowing collection only for a set of “specified, explicit, and legitimate” purposes. Unlike the APRA, however, a data controller may use data if a consumer provides affirmative express consent. As such, compliance practitioners typically advise companies operating in Europe that intend to “reuse” data for multiple purposes, such as to train multimodal AI models, to simply obtain a consumer’s consent to use any data sets that would undergird future technological development of these models.1
Even with the permission to use data pursuant to consumer consent, the GDPR framework has been largely criticized for slowing innovation that relies on data. Some have attributed the slow pace of European AI development, compared to the United States and China, to the GDPR’s restriction of data use. Notably, enforcement actions by EU regulators, as well as general uncertainty over the legality of training multimodal AI under the GDPR, have already forced even large companies operating in the EU to altogether stop offering their consumer AI applications within the jurisdiction.
How the APRA Would Cut Off AI Development
The APRA, if enacted in its current form, would have a starker impact on AI development than even the GDPR. This is because the APRA would not permit any “reuse” of data, nor permit the use of data for any purpose outside the bill’s white list, even in cases where a consumer affirmatively consents.
That policy choice moves the APRA from the GDPR’s already restrictive framework into a new kind of exclusively substantive privacy regulation that will hamstring AI development. Multifaceted requests by end users form the foundation of generative AI. Flexibility in consumer applications is these models’ purpose and promise. If data collected and processed for one purpose may never be reused for another purpose regardless of consumer consent or even a clear criteria, training and offering multipurpose generative AI applications is rendered facially illegal. The AI developer that could comply with the GDPR by obtaining affirmative consent in order to enable the reuse of data for multiple productive applications could not do so under the APRA.
The downsides of training entire AI models to serve only one purpose will have negative effects on both safety and reliability. Responsible AI practices include a multitude of safeguards that build off each other and their underlying data set to optimize machine learning applications for accuracy, consumer experience, and even data minimization itself. These improvements would not be feasible if every model used for a new purpose is forced to “start from scratch.” For example, filtering for inaccurate data and efforts to avoid duplicative datasets, both of which depend on well-developed training data, would be rendered ineffective. Consumers would also need to reset preferences, parameters and data output safeguards for each model, leading to user fatigue.
Moreover, the APRA approach would prevent developers from building AI tools designed to enhance privacy. For example, the creation of synthetic data based on well-developed datasets that is then substituted instead of consumers’ personal data — a privacy-protective goal — is impossible in the absence of well-developed underlying data. Paradoxically, consumers’ personal data would instead need to be duplicated to serve each model and each purpose.
The sole provision in the APRA that would generally permit personal data to be used in technological development is a specific permitted purpose that allows covered entities to “develop or enhance a product or service.” This subsection, however, applies only to de-identified data. Filtering out all personal data from AI training data sets presents an impossible challenge at scale. Models are not capable of distinguishing whether, for example, a word is a name, or what data may be linked to it. Implementing filters attempting to weed out all personal data from a training data set would inevitably also remove large swaths of non-personal data – a phenomenon known as “false positives.” High false positive rates are especially detrimental to training data sets because they refer to the removal of large amounts of valuable training data that are not personal data, leading to unpredictable and potentially biased results.
Even if this were feasible, filtering all personal data out from training data would itself lower the quality of the data set, further biasing outputs. Furthermore, many AI models include anti-bias output safeguards that would also be diminished in the absence of the data they use to control for bias. Thus, a lack of relevant training data can bias outputs, yet so too can an inherently biased model whose output safeguards are rendered ineffective because they lack the necessary personal information to accomplish their task. Unfortunately, both of these harms are almost certain to materialize under a regime that wholly eschews personal information from inclusion in training data.
Where to Go From Here
As the APRA falters and Congress looks forward to a likely redraft of federal privacy legislation, it is critical to avoid mothballing domestic AI development with a poorly-scoped overhaul of U.S. privacy norms. For several years preceding the APRA’s introduction, privacy advocates have advanced a narrative that the U.S. experiment with “notice and choice,” or notifying consumers and presenting an opportunity to opt out of data collection, has failed to protect consumer data. Improving this framework in a way that gives consumers greater control over their data is possible, and even desirable, via federal legislation. Yet a framework built around permitting only predetermined uses of data would have unintended, unforeseen and potentially disastrous consequences both for domestic technological development and U.S. competitiveness on the world stage.
1 The GDPR does not generally permit data collected for one permitted purpose to be used for others, except as subject to vague criteria. Although the law includes a series of criteria to do so, these criteria are. They include 1) a link between the new and original purpose, 2) the context of collection, “in particular regarding the relationship between data subjects and the controller,” 3) the nature and sensitivity of the personal data, 4) the possible consequences of the new processing to data subjects, and 5) appropriate technical safeguards. The GDPR also specifically articulates that this criteria also may not include contextual considerations, rendering compliance uncertain in the majority of cases.
Paul Lekas is Senior Vice President and Head of Global Public Policy and Government Affairs at the Software & Information Industry Association (SIIA). Anton van Seventer is Counsel for Privacy and Data Policy at SIIA.