Jan 20, 2023 5 min read

Power, accountability, speed: pick (one?) / 2023-01-20

Updates, including data from Hugging Face, new litigation, and thoughts on power.

Some windblown trees in honor of a recent visit to Monterrey for an open science meeting.

power and (de)centralization

In a conversation with a friend this week, I was finally able to express something I’ve been wrestling with for a while: whether open’s core assumptions about the impact of distribution on power still hold.

pic.twitter.com/HnonKcM6eL
— Jupyter Meowbooks (@untitled01ipynb) January 15, 2023

To somewhat oversimplify, one of the foundational assumptions of the original generation of free-open thinking was that distributing power away from large corporations to individuals at “the edge” was a good thing.

Some nerds got angry that Stable Diffusion removed NSFW and children content from their AI's training data, so now they're raising $25k on Kickstarter to retrain a fork of the AI with nudes.

They want more "accurate anatomy" from the AI.

The future is weird.

🙄 pic.twitter.com/clulVt1j0u
— Maybe: Fred Benenson (@fredbenenson) December 10, 2022

One way to think about the next generation of open(ish) activism is to go deep on whether or not this assumption is still true, or at least under what conditions it might/might not be true.

As just two examples: if we take “garbage in, garbage out” more seriously (say, because of the use of such tools for sexual harassment, or for racist imprisonment), what controls might we as a society want on training and usage? Similarly, how might governments enforce such controls if computation is centralized—or not?

None of this is to say we need to be centralizing; but the prior that decentralization is always net good—which was an underpinning of traditional free/open—now at least needs to be actively defended by advocates of open. (James Boyle attempts that defense, at least partially, but I’m not fully convinced.)

Data on model usage

The first section of these talk slides from Nazneen Rajani on usage of models in the wild (as viewed through the lens of Hugging Face) are extremely rich with data. A few highlights:

The number of models on HF is exploding, with 100X growth since mid-2020.
We keep talking about models that generate images and text, but three of the top five model categories are classification/recognition, not generation, so those use cases are still a big deal (slide 17).
Like many things, the winners win a lot: 0.2% of models make up 80% of usage (slide 22).
Despite Hugging Face’s noble efforts to integrate documentation into their user experience, “newer models are less likely to have model cards”. (Slide 52) This is disappointing to me, since model cards had seemed to be a good step towards (otherwise difficult) open(ish) transparency in this space. And 80+% of models lack information on data or model evaluation 😬 (slide 71)
However, a randomized control trial suggests introduction of documentation increases usage, which is good (slides 54-69)

The second section of the talk (on evaluation of models) also looks amazing but without the talk audio is a bit harder to interpret.

(This may simply reflect Nazneen's research interests, but I do find it interesting that data on usage by license was not pulled. Draw your own conclusions on that!)

Litigation

Litigation continues to pick up steam, with two lawsuits since I last wrote.

Butterick on Stable Diffusion

The same team that is suing Copilot is now also suing Stable Diffusion, MidJourney, and DeviantArt.

Interestingly, this case directly alleges copyright infringement, and not just the copyright management infringement claims that were brought against Copilot. While I don’t know exactly why this is the case, I think it’s suggestive that the complaint links repeatedly to haveibeentrained.com—which uses the public nature of the LAION training set to show whether specific pieces of content are in the training set.

This sets up a dynamic that in my opinion is potentially quite negative—public information about data sets, which we should encourage for purposes of public learning and accountability, is here being used to attempt to shut down the entire venture. The incentive, therefore, could be for model creators to obfuscate and use only private data sets—reducing accountability and transparency. As CU Law’s Blake Reid points out in this good thread, for the moment (facing lack of action from US regulators) we’re attempting to address very deep policy questions with copyright tools—so I would expect we’ll see many such unintended outcomes.

Getty—in the UK

Getty Images, which is headquartered in Seattle, is suing Stable Diffusion in the UK. Stability AI is UK-headquartered, which is presumably the rationale for bringing the case there. We don’t have filings yet—am looking forward to learning more about the angle this case takes under UK law.

I’ve seen it suggested that this is primarily the opening step in licensing negotiations, which makes some sense. Unlike the broad class in the Butterick lawsuits, Getty can settle for a licensing fee—which would be quite consistent with their existing business model.

the storm is here for moderators

This piece by two technologists in the New York Times, on the possibility of using LLMs to write lobbying letters to politicians, was widely panned by policy people. In short, the critique was that most of the hypothetical problems identified in the article either (1) already occur with humans (millions of fraudulent “letters” already occur in FCC proceedings, for example) or (2) already have solutions (policy offices deal with high volume already).

In contrast, this piece from Joe Reagle (a careful, thoughtful writer on Wikipedia and other online communities) has me spinning. In specific, Joe says “the storm is already here” for moderators of online communities, particularly those that have reputation systems like Reddit. Unlike the reputations of lobbyists, these systems are optimized for high-volume and highly quantifiable throughput, with explicit reputational rewards (eg, upvotes). These reputational rewards, if accrued to bots, can later be used for scams or other forms of influence-peddling. And if the bot doesn't get reputation, it can still help train the next iteration of your bot! So this essay is a worthy—and troubling—entry in the “what happens when we have bots that write legible text” canon.

Me on podcast

As usual, I enjoyed being on Go Time talking about IP—this time with a focus on the interplay of AI and IP.

Misc.

RAIL call for participation: This is more important than I have time to write about today, but RAIL has put out a call for participation across a variety of areas.
Model and PyTorch optimizations: two long, interesting, highly-technical reads on the many different possibilities for further optimizing large transformer models and the interaction of PyTorch improvements and GPU/compute requirements. Takeaway: still lots of possibility for all of this to get faster and cheaper, which is good for open(ish).
Information Commons and AIs: The Open Future folks have published a followup to their 2022 work on governance, commons, and training. Section 3 is particularly valuable as a summary of the state of the various levers/tools available for data commons governance.
Wikipedia and ChatGPT: I want to write more on this, but in the meantime this is a great general-interest piece on the topic with a lot of links readers may find interesting. Those who want to go deeper may want to start with long-time Wikipedian Andrew Lih’s experiments.

Final note

I found this interview on radicalism, class, race, respectability, and philosophy in the work of MLK was excellent. I share it here because it is provocative—in a good way—for anyone who is pondering how much to rock their own comfortable boats.