Brace yourself, it’s going to be a long one.
Because my goal here is to help myself (and you, dear readers!) puzzle out what open means now, let’s start with this: the Open Knowledge Foundation is going to relaunch the Open Data Definition.
Because my goal here is not to bore you to tears, I've placed my deeper thoughts on this (and one other question) as micro-essays at the end of the newsletter.
(All streaming unless otherwise noted)
- Top copyright scholar Pam Samuelson will lecture on Generative AI and copyright law on March 26th.
- Relaunching the Open Data Definition at Mozilla Festival on March 23rd
In this section: what values have helped define open? are we seeing them in ML?
Lowers barriers to participation
Lots of small, cumulative performance and availability wins this week.
- Stanford’s Alpaca Q&A model, trained from LLaMA, suggests that tuning foundation models into small, specific-purpose models can be both effective and cheap. It was reproduced in open code on consumer hardware (Alpaca-LoRA) in days.
- WebGPU, a new browser standard, can now host model inference entirely in the browser! Requires new browser and modern PC hardware, but still a big step in ease of distribution and implementation.
- Simon Willison dives into the LLaMA paper, and guesses that it is possible to train a GPT-3 class model for less than $100K. Compare to state of the art last year, where that might have been a few million.
Makes governance accessible
When I wrote that open “makes governance accessible”, I meant mostly that contributor-users could participate. But we continue to see that the combination of public data and public laws mean that governance tools are available very broadly in this space:
This is in some ways closer to software patents than software copyright: third-parties can come after you, so getting a permissive license from the authors is not enough. (The difference, of course, is that random members of the public generally don’t own software patent rights, while they often have ML-regulatory relevant rights, like PII.)
The big news in AI legibility this week is that OpenAI regrets having been “open”, and launched GPT-4 without even the most basic information about what its training data sets were—citing safety and competitive reasons.
Since the AI safety community overwhelmingly thinks that disclosure of this information is an important part of the public's ability to evaluate a model, this reticence to share basic information has prompted jokes, lots of confusion about OpenAI’s tax status (about which more below), and one early donor threatening to sue.
Hopes to make the world better
This week a new survey was released on AI experts and what they expect from the industry. 31% of surveyed AI researchers think that the impact of ML will be on balance bad or extremely bad. I… can’t imagine any survey of open source developers, ever, having that number above single digits. It may be wrong, but it is a very striking result.
- SaaS trend continues: The primary competitor to open in ML will likely be SaaS, largely but not completely based on open source code, delivered to users via a metered API. This week’s ML entry in this open<->SaaS pendulum is Google, which will make its models available as SaaS.
- Labor power is still real: Work at OpenAI, but don’t like OpenAI’s switch away from open? There are plenty of jobs available, including a blank check at Stability.
Open software was defined in part by, and in turn helped define, new techniques in software development. What parallels are happening in ML?
I expected ML to lead to new forms of collaboration, just like open software did, but did not entirely expect it might be collaboration… with the ML?
- "AI in a loop": LangChain is an MIT-licensed project to bridge text LLMs to real-world interactions, like web searches. This blog post is great on how it works: in essence the model is prompted to explain what it thinks a good next step might be, pause, and then use something else (either a person or API) to take that next step. This allows real-world interactivity that models can’t do by themselves, including potentially by including a human in the loop.
- Pair Programming is a practice where programmers work with a partner while programming; advocates believe it increases productivity and knowledge transfer. It’s long been somewhat controversial, partially because some people don’t like it, and partially because it’s obviously time-consuming. I am increasingly seeing programmers refer to what they do with Copilot and similar tools as pair programming, just with an ML. (example) This does make me wonder what other "pair" activities that we currently assume require two humans will transition, in whole or in part, to ML.
On the flip side: what happens when, by pairing in private with computers, you end up impoverishing various commons that rely on public collaboration? Researchers are starting to look at the impact of ChatGPT on Stack Overflow, and whether it is seeing a decrease or change in questions as a result of ChatGPT. Still too early to tell, but an interesting question to ask.
We don’t usually talk much about censorship in open software. But it will be relevant in machine learning, because from many government and regulatory perspective certain types of censorship are improvements. (We had to delete child pornography at Wikipedia, for example, and lots of privacy-law enforcement treads that line as well.)
So, here’s a new technique for, essentially, censoring models by removing an entire concept or image.
How the California-tech-libertarian ethos intersects with these kinds of techniques will be interesting to watch—early signs point to "not well!"
- The Author’s Guild released a “model clause” for author contracts that prohibits AI training. There are a few EU lawyers reading this; I’d be very curious if you feel this meets muster as a Text and Data Mining Exception opt-out!
- This tweet is something I've struggled to articulate. But it is very important! We need to think about not just hypothetical problems (which are admittedly more fun) but also very much the ones we're already causing.
I enjoyed this llama.cpp community discussion of the potential interest in open-source inference “at the edge”. It is old-school open in the best sense: simply being optimistic that “hey, with a little code, we can change lots of things!” (And, indeed, llama.cpp is underlying tech in several of this week’s earlier notes, especially around performance—and today, on a weekend, the "repo was buzzing".)
Creating new things
This week in things that might not have been possible before ML:
- fun: It’s neat to see OSM unlock new ways to interact by using LLMs to make it simpler to query.
- not fun: I’ve written before about what happens to moderators when faced with ML-amplified volume; this week’s similar example is “one-click lawsuits”. I have no idea how court systems can possibly handle this sort of influx.
Changing regulatory landscape
- Reducing risk by reducing transparency: Copyright law is regulation, so making it harder to sue yourself by being less transparent is a way of evading regulation. It’s not clear that OpenAI is evading copyright litigation intentionally, but it’s certainly going to weigh into the risk analysis for others.
- UK policy? the UK government has issued a pro-AI policy paper; among other things, it suggests that the UK should adopt a text and data mining exception so that model trainers can use data without copyright concerns.
- Medical software: This analysis of flawed software to help doctors predict sepsis is both scary (software that, when it goes bad, can literally kill you) and asks a great question: why do we regulate medical software that can kill you much differently than we regulate medicines that can kill you?
- PyTorch continues to roll along with a new release.
- New org launch: Data Tank, an EU-based data-stewardship organization, is extremely interesting. I particularly like the idea of treating data stewardship as a coherent discipline; I could easily see this in the long run paralleling the rise of Trust and Safety as a discipline. (Via Tim Davies’ also-interesting notes from a data community meeting in Paris.)
- Kyle Mitchell, license guru, has done an overview of the NVidia non-commercial license mentioned here last week.
- Mozilla has launched a “Responsible AI Challenge” with grant money attached. It’s not much ($50K total) but interesting to see Moz start to move in this space.
- A whole magazine issue on sustainable AI issues. Haven’t had time to peruse, but production value looks high and topics are interesting.
- Man experiments with telling ChatGPT “go make money”. Other man experiments with telling ChatGPT "get 10,000 stars on GitHub". Nifty and creative, and (so far) harmless. Does make you wonder, though, what OpenAI is doing to staff up a Trust and Safety team.
- Model publicly available, but not in your language? Just… retrain it. In some sense these models may end up being very translatable to other languages, which will help with accessibility—but there is an open question of who gets to do that.
What should OKFN be doing?
As the Open Knowledge Foundation works to relaunch its Open Data Definition, some questions I would offer as discussion starters. These are based on my experience as a license steward, license-review process participant at OSI, and drafter of most of version 2.0 of the ODD.
- Who is relying on the ODD right now? What for? Are new uses/users anticipated/hoped for?
Basic product design (and the ODD is, in important ways, a product) requires understanding your users. Is the audience activists? lawyers? governments? data stewards?
You don’t have to have a perfect idea of this—sometimes it is OK to just build and see what sticks!—but having some theory is really useful. As I have almost never seen use of the ODD in the wild, it’s hard for me to have an intuition here.
- What is OKFN’s theory of how open data’s role in the world has changed (or hasn’t) since ODD 2? How does that require a new ODD?
Obviously I think the answer here is “it has changed a lot” or I wouldn’t be writing this newsletter. But again, my goals are different—I’m explicitly not trying to write a formal definition! OKFN hopefully has a strong perspective on this, to help drive the process.
- What surveys and typologies of projects that identify as open data will the OKFN team be using? (Note that “identify as open” is not the same as “projects that comply with the current ODD”!)
The Open Source Initiative was grounded from day 1 in the practical experience of Debian, Linux, Mozilla, Perl and Apache. This set a useful framework for the Open Source Definition, helping ensure that it was not (just) an ideological flight of fancy.
What projects does ODD intend to partner with and/or use to test any changes to the definition? This seems especially important to catalog in a time of flux, where many things self-identify as open data without much thought to any underlying theory of open. Picking at least some of those and explicitly saying “nope, not open” will make some enemies—but also create engagement and clarity.
I wish OKFN well—but they're also chasing a moving target right now.
Elon hasn’t reached out to me, but lots of other people are asking me variants on this question:
I should preface this by saying that I’ve represented a lot of non-profits, but am not a specialist in non-profit law. If any of you reading this are specialists, please drop me a note! I would love to pick your brain in the newsletter’s first interview ;)
The basic facts:
OpenAI was formed as a non-profit, aiming to do research into AI and make that research publicly available. Their Delaware certificate of incorporation (via) says:
The specific purpose of this corporation is to provide funding for research, development and distribution of technology related to artificial intelligence. The resulting technology will benefit the public and the corporation will seek to open source technology for the public benefit when applicable.
In 2019, their last available public IRS filing, they reported transferring most of the non-profit’s IP assets to a new for-profit, wholly owned and controlled by the non-profit.
In that filing, they (not implausibly) justified the activities of the for-profit as aligned with their mission:
All these advances in AI technology [made by the for-profit] are moving the Organization closer to achieving its mission, which is the development of Artificial General Intelligence in the public interest.
To the best of my knowledge, OpenAI has given no transparency into what control its for-profit investors have into the for-profit, except to say that the non-profit retains control.
As one final piece of context, it's important to know that US non-profits can do profitable things. First, they are allowed to have “unrelated business income" from profitable lines of business not directly related to their non-profit, though that income may be subject to normal for-profit taxation. (The details of the rules are a little murky.) Second, non-profits can sell assets to for-profits. The law around that is also complicated, so it can't be done easily, but it can be legal.
- Transparency advocates should acknowledge that traditional open source might not be in the public interest here. But if not that, what? Alternatives might be “publish a lot more transparency information”, “enforce safety-related terms of service very aggressively”, “do public, non-discriminatory licensing of APIs rather than early-access for investors”, etc.
- The most basic answer to “why doesn’t everyone do this” is that most non-profits don’t have profitable ideas ;) More seriously, note that in this situation the for-profit still pays things like sales taxes. So it is not a tax-avoidance mechanism, as Elon seems to imply in his tweet.
- If the for-profit pays dividends to the non-profit that would immediately be an IRS problem, because non-profits can’t be overly dependent on a single source of funding. I have no reason to think that is the case here, though some of the dual-salary approaches that have been reported seem sketchy. We’ll know more when we get the next 1099.
- The IRS, for a while, was extremely skeptical of open source software 501(c)3s, before retreating. This could easily reopen that question for everyone, which would be a shame.
- As best as I can tell, OpenAI invented the term “profit-capped”, and is attempting to argue that this eventual reversion to a non-profit means it is still in the public benefit. A journalist or regulator looking into this would want to ask whether this means “profit-capped” (meaningless, as profit can easily be avoided if desired) or “revenue-capped” (more meaningful, though perhaps still problematic).
If I were a journalist, a major donor like Elon, or a board member of a California/US tech non-profit worried about the impact on its own tax status, I’d absolutely be urging the State of California’s non-profit regulators to look into whether the non-profit is still serving its chartered goals. Maybe they really have done the right thing here!
I remain unclear on what open is in this space, but I’m increasingly clear on what open isn’t. That’s progress, I guess? Until next week…