3 min read

Links / 2022-09-23

My second email to friends on open+ML. Archived here with only light edits.

[Ed.: this was originally sent to a private mailing list, but archived here with only light edits.]

Note that I am learning as I go here, so you may see older links, not just things that happened this week.

If you’re just catching up…

This space is moving as fast as anything I can remember. So you can be forgiven if you were doing your actual job, looked up, and were completely lost.

This article on prompts is a good place to start if you’ve been busy the past few months and are trying to understand the tech and how it is being used. It also includes a good roundup of various responsive initiatives.

OSI-open speech-to-text model from OpenAI

Despite the name, OpenAI has not been very open, as a general rule—either not releasing things at all, or not releasing things under licenses that are open as we understand open. This week they released a speech-to-text model that is actually MIT: https://openai.com/blog/whisper/

The model cards for this release indicate few ethical risks. So this may be a chang in license strategy for OpenAI, or simply a change for this particular low-risk model.

(If you’re curious about OpenAI’s history in this area, this article about BLOOM starts with a good summary of OpenAI’s history on ‘open’.)

Guide to the RAIL licenses

I was still on summer sabbatical when the RAIL “ethical AI” licenses were released, and still haven’t had a chance to read them, but here’s their guide to them. Even if you’re not a fan of non-OSI-open licenses, this is worth reading for the technical analysis (eg, distinguishing between various places where license restrictions might bite) and simply as a pretty good model for how a new public license class can be explained. There’s also a more academic-style paper on them.

Striking note, BTW: I see zero familiar names in any of this work. Is that a bad sign for this community (broadly, not just this list but FSF, FSF-E, OSI, etc.) and the success (or lack thereof) of our outreach? Or just natural sign of growth?

The UK published a consultation on IP and copyright that generally says “we don’t need to do anything, yet”. As one exception, it says it plans to create a copyright exception permitting text and data mining, which would likely make ML model creation more clearly legal under copyright in the UK. (Training may still be subject to non-copyright rules, and outputs from the model would still be subject to copyright…)

For an older analysis of how the EU’s similar exception works, Felix Reda’s post on Copilot is a useful primer, concluding that Copilot’s training was not infringing in the EU. I would love to hear from any EU attorneys how this and the database directive interact!

Introduction to participation in AI governance

OSI-open and FSF-free often imply public or transparent governance, but neither org has ever formally included that in their definitions. Many AI orgs are at least talking about participation, and pondering what a formal definition of good practices might look like.

Here’s a good primer about some of the issues, with authors from Mozilla and Google. While it’s a good framework, I do have to wonder if there will be a race-to-the-bottom in this area as in others: projects with (B?)DFL-style leadership will almost certainly iterate more quickly.

An artist registered a US copyright in a graphic novel (example page), under their own name, noting in the application that they’d used AI to generate the art. The artist also said, on Instagram, that a lawyer friend had told them that this was “precedent-setting”.

Facially, this is uninteresting: it’s an artist, using a tool, and so not at odds with recent Copyright Office statements rejecting a filing “in the name of” an AI.

I share it anyway because it is going to be common to confuse “I did it with ML” and “the ML did it”, apparently in this case for both the artist and a lawyer. Don’t make that mistake!

Reminder: good lawyers know that copyright is not the only legal issue here (privacy, rights of publicity, etc.), but many clients do not. They’re used to copyright being the primary regulatory modality for software.

Example: an apparent medical privacy violation that was surfaced when a curious patient used https://haveibeentrained.com/ , which searches a commonly-used training set. Some smart programmer friends went immediately to a copyright analysis, and were surprised when I said other laws might supply! Remind clients appropriately :)


  • Getty bans AI-authored uploads the excuse given here is copyright, but I suspect lots of search tools are going to be simply overwhelmed with ML-authored content and this may give Getty some breathing room to figure out how to handle that from a search and UX perspective.
  • Critical PyTorch project spins out of Meta, lands at LF.
  • The Content Authenticity Initiative has released (a few months back) open source tools to help validate image authenticity. This is not a specifically “anti-“ML initiative, but obviously will become more relevant if deepfake tools become more accessible and effective.
  • Wikimedia’s EU policy groups (a mix of paid and volunteer, mostly independent of WMF’s US HQ) held in March an online essay-debate on use of CC-licensed photos for AI training.