I've started on a light FAQ. The main thing in there so far is "why openish"? The short version is that I think ethics and collaboration are important, so I prefer "open" to closed, but also after 20 years I am bored of arguing the binary question "is this open". So we'll mostly not have that argument here.
Doing this together!
Collaboration is fun! Please use the ghost comments to ask questions, suggest material for next week, etc. (This is very much a part-time side project, so your suggestions might help a lot.) Or if ghost comments aren't your thing, just email me directly. I'm pretty sure that I'm on a first-name basis with all subscribers right now; if I'm not, I'd like to be—introduce yourself!
ML 101(ish) things
Getting the terminology right
I told a group of lawyers yesterday that it doesn't make sense to say "Copilot is (il)legal", because different parts of Copilot have to be evaluated independently. For purposes of that conversation, I called out three different things to evaluate:
- The training of the model – when copying lots of material in order to train, what rules might affect that (copyright, privacy, competition, etc.)?
- The distribution of the model – if you're making the model available to the public or to partners, what rules govern that, including the distinct possibility that the model is not protectable under US copyright law, only as a trade secret?
- The creation of outputs – if you use the model to make a new work (art, code, etc.), what regulations might apply to that work? For example, if you use Copilot in a cleanroom to re-implement an API, that may be very problematic for you even if Copilot's training was perfectly legal for GitHub.
Similarly, if you're evaluating whether ML is "open", you have to evaluate the training code and the trained model separately. For example, the recently announced BigCode project (link below) has Apache-licensed code, but the models outputted by that Apache-licensed code are under the not-OSI-open-but-ethical RAIL license.
We're all going to need to learn (and harder: consistently apply) the new terminology to avoid confusion, because the old open terminology maps closely—but not quite.
Social and collaborative bits
The current wave of ML is about making art because art is cool and fun. This is (1) such a relief from the past few years of crypto "art" speculation and (2) willllldly popular:
Some good percentage of those users are probably software developers of various sorts, which feeds into:
Network effect of open
Traditional open has been very good at creating strong network effects around successful pieces of software. AI software has mostly not had such effects, in part because of hardware/training costs, and in part because many of the leading researchers have ethical concerns about widespread adoption. This article suggests that Stable Diffusion is good enough, and open enough, that researchers and hackers are flooding to it:
"The challenge OpenAI are facing is that they're not just competing against the team behind Stable Diffusion, they're competing against thousands of researchers and engineers who are building new tools on top of Stable Diffusion," [Simon] Willison told The Register.
"The rate of innovation there in just the last five weeks has been extraordinary. DALL-E is a powerful piece of software but it's only being improved by OpenAI themselves. It's hard to see how they'll be able to keep up."
Is this the same flood of hacker interest that the Linux kernel got and led to a tipping point, early in modern open? If it is, look out. (Or: is this a sign of an ethical race to the bottom? If it is, also: look out.)
Related tools: the easier it is to run code, (1) the easier it is for people to play with it and (2) the better it is for open/part-time developers. onnx.ai is an Apache-licensed LF toolkit to make models more shareable/interoperable. If succesful, this will be yet another step towards increased velocity in open ML.
Mostly legal bits
End of NO WARRANTY?
ML matters is likely to encourage a lot more regulation of software, which in turn is likely to make the "open" legal situation a lot more complex. This week we got a new EU proposed regulation on AI product liability. The rationale:
The specific characteristics of AI, including complexity, autonomy and opacity... may make it difficult or prohibitively expensive for victims to identify the liable person and prove the requirements for a successful liability claim. In particular, when claiming compensation, victims could incur very high up-front costs and face significantly longer legal proceedings, compared to cases not involving AI. (source: explanatory memorandum)
Replace "AI" with "software" and much of that still holds true—see, for example, the extensive software expertise required in the Toyota braking cases.
The proposal also says:
In order not to hamper innovation or research, this Directive should not apply to free and open-source software developed or supplied outside the course of a commercial activity. This is in particular the case for software...that is openly shared and freely accessible, usable, modifiable and redistributable. (source: explanatory memorandum; emphasis mine)
Much AI code is not "freely modifiable" under the OSI definition, and much FOSS (especially in AI, given the still-high cost of training) is now developed in the course of commercial activity—so how broadly this safe harbor will apply is unclear.
The actual legal mechanism is a rebuttable presumption of liability. I've found no good commentary on this since the draft was published; I will probably come back to that next week either with my own commentary, or hopefully an actual litigator will have done it for me by then :)
Related tool: https://interpret.ml is an MIT-licensed tool (using GitHub's Minimum Viable Governance model!) for building 'interpretable' AI; interpretability, already also imposed by other EU regulations, is going to be pretty important to handling the AI product liability act's rebuttable presumption of liability.
Former Wikipedia legal intern, now law professor, Tiffany C. Li has written a paper on "algorithmic destruction". tldr: if privacy law requires deletion of training material, but a trained model has the 'shadow' of the now-deleted training material, what responsibilities do we have (legally or otherwise) to retrain the model, delete old copies of the model, etc.? The answer is (of course) "it is complicated".
Specifically for open ML, I wonder if this confounds open software's expectation that old binaries do not create legal liability, creating new work/obligations in open ML that don't exist in open software.
I'm going to try to avoid "ooh, this is new/nifty" in the newsletter, especially for demos that aren't even the slightest bit open, but (1) lots of it is cool, and we need reminders of joy/awe in our tech; (2) lots of it is terrifying, and we need to be reminded of the ethical implications of all this; (3) even if not in any way open now, some variants of it may well be open soon and the sooner we can wrap our heads around it the better. And it's a nice palate-cleanser after a lot of dense words :) So, some demos that came across my radar this week:
Second, wild progress on specific image generation—not just a statue of a dog but a statue of a specific dog. Obvious implications for deepfakes.
Make 3d objects from 2d training material:
This is fun for me to write, I hope fun for you to read! Let me know what you think :) Feel free, of course, to share and forward.
Bonus: cover image prompt
midjourney, "regulators hard at work, in watercolor".