I’ve been doing a lot of speaking on ML of late (I think four streams/podcasts and one conference track in four weeks?) so the newsletter has suffered. Thanks to all of you who have referred me to speaking opportunities; it’s been fun! Today I'm coming to you live from Göteburg, Sweden, where I'll co-lead a track on machine learning and open tomorrow.
(All streaming unless otherwise noted)
- I’ll be moderating a panel with panelists from Internet Archive, Creative Commons, and Wikimedia on May 2. Registration for the stream at the link.
In this section: what values have helped define open? are we seeing them in ML?
Improves public ability to understand
- The Washington Post has a great visualization and report on what data is used in one of the key semi-public data sets, C4. This is the kind of democratic oversight that (1) is extremely necessary and (2) can only happen when data sets are open(ish) enough to be accessible and legible to the media.
- On the flip side, #5 in this terrific list of “Eight Things To Know About LLMs” is that “Experts are not yet able to interpret the inner workings of LLMs”. This is a nice, concise summary of the research in this area—and suggests that, at least for the moment, making models available to researchers is not a panacea for interpretability.
I’ve mused here before that the “foundation models” approach is an important one to understand, not just technically but because whoever provides and controls those models will have a lot of power.
Daniel Jeffries, formerly (briefly?) of StabilityAI, muses at length on who will “win” in foundation models. His take: there will be “Foundation Model as a Service companies who basically offer intelligence as a service but even more importantly they offer piece of mind: Guardrails, logic fixes, safety measures, upgrades, rapid response to problems.” But getting there will be costly, and error-prone, because making the wrong choices at the beginning will mean throwing everything away to retrain. The essay ends with a long section on open business models in this space that is particularly worth reading.
One oversight in the Jeffries essay is the regulatory environment. I think this may push towards open (or at least transparent) in a way that regulation of traditional software has not. Besides the safety considerations I’ve already covered here repeatedly, there’s also a growing push within academia to do research based on open models. If you’re interested in reading more on that, here’s a long read focused on natural language processing research, and a more recent editorial in Nature. It will be interesting to see if this advocacy succeeds and tips the general policy balance in favor of open foundation models.
In this section: open software defined, and was defined by, new techniques in software development. What parallels are happening in ML?
This paper is a very deep dive (with excellent, short executive summary) on what terminology and techniques we might use to discuss safety and security in ML models. Highly recommended for anyone thinking about this; the comparisons to old techniques are problematic and we need to build better vocabulary if we want to get this right.
New sub-section here; modularity is a key open source software technique, enabled by low-friction licensing. Are we seeing it in ML?
- The HuggingFace team has published a paper demonstrating the chaining of multiple models to create powerful outcomes. This may end up being an alternative (or complement) to specialized training.
- Langchain, an open source toolchain for interacting with LLMs, continues to be very actively developed, including rapidly implementing techniques from academic papers. One to keep an eye on.
In this section: ML is going to change open—not just how we understand it, but how we practice it.
Creating new things
New unquestionably open models continue to proliferate. From the past few weeks:
- MIT licensed text to image model
- instruction-tuned open text LLM based on MIT-licensed EleutherAI Pythia
- FB image segmentation model
And data sets too. This week it is Red Pajamas, a new data set explicitly aimed at duplicating the Facebook LLaMA data set, so that others can reproduce the LLaMA model. Note that funding is a mix of academic, government, and startup, suggesting that the “everyone finds something” economic model of open source software will have at least some applicability in open(ish) ML.
- David Widder and Dawn Nafus interviewed developers and wrote a paper on how those developers think about (or don’t think about) accountability. The key, they find, is modularity. By treating pieces of software as just one step in a long software supply chain, we effectively always say “ethics is someone else’s problem”. The paper is about AI, but a good summary of a literature that applies to traditional open source as well.
- This history of how academic computer science started to grapple with ethics in its curriculum is good (if too short!) It surfaces critiques that parallel Widder and Nafus—specifically, that much ethics education focuses on the individual’s role in a way that elides institutional responsibility.
Changing regulatory landscape
- This piece argues that security- and privacy-preserving models in the current technical paradigm are impossible; there’s just too much uncertainty in how they work. The author is quite serious about this, having chosen to quit Google in order to publish it. I’m very curious how this ends up interacting with the coming AI regulatory regimes and existing regimes like GDPR.
- EPIC has a very deep dive on proposed US government regulations, specifically the NIST’s AI Risk Management Framework. This one covers 102 specific actions across five recommendation areas, and several detailed followups are available here.
- In anti-collaboration news, I increasingly think that before LLMs create impactful “misinfo” themselves, they’ll accidentally create a misinfo crisis by burning out every human moderator on every platform, allowing human misinfo to flourish. Relevant to open, GitHub will be one of the first victims of moderator burnout.
- We continue to see more small-human-language LLMs, this time from South Africa.
- Good, short essay on why general-purpose AI tends to do better than special-purpose AI. Relevant to a traditional open approach of “build small pieces that can be reused”.
- A team including Mark Lemley have written a thorough summary of the state of American law on “Foundation Models and Fair Use”.
- StabilityAI, advised by Mark Lemley, has filed a motion to dismiss the copyright case against it. I have not had a chance to read all of it, but critically it alleges that the plaintiffs did not file for copyrights on the allegedly infringed works—which may end the case fairly quickly, without teaching us much about what the law is in this case.
- Recordings from the DAIR’s “Stochastic Parrots Day” are now available.
I’ve been re-reading classic machine-learning related science fiction; please leave comments or ping me if you have suggestions!
One thing that has jumped out at me is that in Stephenson’s Diamond Age, the ML-like software is referred to as “pseudo-intelligence”. I really like this—it captures the almost-but-not-quite thereness.