I attended a (successful) Tidelift all-hands this week, so short week for this newsletter even if very long week in open(ish)ML. Also, I liked the way I organized the newsletter last week but I haven't had time to make a template of it so... watch for it to be back next week :)
Litigation getting real
Search was heavily shaped by Google's early no-holds-barred litigation strategy against anyone who didn't want their content indexed. This created a decade's worth of pro-fair-use holdings in the US courts, allowing for many different uses of material on the web. Someone is going to have to play that similar role in defending training as a fair use—probably, given their leading position on the commercial side, Microsoft and OpenAI.
The time for that is apparently going to be on us soon:
- California lawyers are threatening to sue GitHub over Copilot (and Bradley Kuhn of the Software Freedom Conservancy says here they're still discussing the case with litigation counsel) and
- the RIAA is monitoring and making vaguely threatening noises about use of AI to re-use styles in music.
The core legal intuition is the same in AI training and the traditional search model. However, I think MS/OpenAI/whoever else is going to have a much harder time of it than Google did—the law hasn't changed, but this tech will be harder to explain to judges, and tech's reputation among US elites has changed. And the output is often going to be more clearly infringing (or at least arguably infringing) than early search.
Related: copilot alternatives
This week saw another Copilot/OpenAI alternative open(ish) model announced, this time from carper.ai. (This follows the announce a few weeks back of the upcoming BigCode project, with related aims.) "Two projects are announced" is not the same as "two projects are sustainable and competitive against the state of the art proprietary models" but it's nevertheless interesting, and runs counter to the pessimism (which I shared) around open's ability to compete with Copilot when it was first announced.
Van Lindberg, sharp open licensing lawyer, has a long Twitter thread on whether AI models are protectable and what the implications are.
I'm a little less skeptical than he is of whether or not a model is copyrightable; there are a lot of choices in model training design, so I think the training is less like a photocopier and more like a high-end camera, where the creator has a lot of flexibility and artistic choice. Among other things: what selection/curation is done from the universe of possible training data? what preprocessing do you do? what model tuning is done? etc. (For an example of what can be done to tune a model, Google's "FLAN" announcement from this week goes into a lot of detail.) But it's definitely an open question and something those pondering open-for-ML need to think about.
Open to not-so-open?
The licensing challenge
Copyleft licenses in traditional free/open software licensing create a level playing field in part because the core thing being licensed (the source code) is hard for a bad-faith actor to recreate. So if you can create a good license for a good project, people will stick with it.
Models, on the other hand, can be recreated from scratch by large players (assuming they have data, code, and training hardware). Don't like the license on model 1.0? Just re-train and voila, model 1.1, sans license—unless the license also governs all the training code and/or the training data. This is something to keep an eye on when evaluating model licenses, and when figuring out commitment of a company to future releases under open licenses. In related news...
Stability.ai, stable-diffusion 1.5, and "truly open"
Last week, I said "The demands of [venture capital return on investment] tend to push towards enclosure, so it'll be important to monitor [the impact of stability.ai's $100M funding round]."
Perhaps unrelated, but on Oct. 20th runway.ml (a stability.ai partner) released model weights for version 1.5 of stability-diffusion and stability.ai apparently promptly told huggingface to take them down. Stability's CIO then put up a blog post/substack saying that the company was going to slow releases until it figures out how to do open responsibly. While I think "going slow" is certainly reasonable, I really want to interrogate what this means:
We are forming an open source committee to decide on major issues like cleaning data, NSFW policies and formal guidelines for model release. This framework has a long history in open source and it's worked tremendously well.
I'm... really not sure what this means? Release teams have a very long history in open source, but there's basically no precedent in open source for anything with trust-and-safety overtones like NSFW policies.
Help us make AI truly open, rather than open in name only.
As I said last week in my open(ish) essay, we can no longer move fast without taking more responsibility for the "break thngs" that inevitably results. But unfortunately it's not clear what stability.ai means by "truly open" here—I hope they'll clear that up in the near future.
Transparency and auditability
Authenticity tooling at forefront of Adobe plans?
Adobe and Microsoft have both announced incorporation of image-focused AI in their commercial, non-alpha, products in the past 11 days. Adobe's announce was interesting to me in part because of how much it emphasized Adobe's support of the Content Authenticity Initiative (mentioned here last month). It will be interesting to see what kinds of technical counterweights like this we see as we move forward.
A group of folks are sponsoring a new set of bounties (partially analogous to bug bounties) to help identify bias issues in publicly-accessible AI systems, building on similar work of the past few years. One wonders whether "open(ish)" in ML should use language like GPL v3's to protect auditors from retribution.
- Oct. 24: "Generative and Open Source AI", Emad Mostaque of Stability.AI, live in SF and streamed
- Oct. 27: TrustworthyML symposium
- "Generative AI" seems to be a growing catchphrase; here's a map of the space. Would be interesting to see how many of the models in it are open(ish) or have open(ish) competitors.
- Data still matters a lot; here's Facebook releasing an open translation model for Hokkien—which has 20 million speakers but basically isn't written. Exactly how they got the audio is a little unclear to me.