Inventive robots and accidental tyrannies

In Marginalia 23 I mentioned Julia Reda’s GitHub Copilot is not infringing your copyright. Reda noted:

Those who argue that Copilot’s output is a derivative work of the training data may do so because they hope it will place those outputs under the licensing terms of the GPL. But the unpleasant side effect of such an extension of copyright would be that all other AI-generated content would henceforth also be protected by copyright.

Well, the day has already arrived in relation to copyright’s close cousin, the patent. This is a blow against open society and democratic public affairs. In short: the people most likely to control the machine learning power capable of spitting out patentable “discoveries” are already rich. This legal precedent means they can simply buy some compute and further monopolise “innovation”. It’s certainly not going to incentivise any human creativity.

This story gets top billing in today’s Marginalia because it combines two strands that have been on my radar recently: technology monopolies, and machine learning aka “artificial intelligence”.

In The rise of community-owned monopolies, Konrad Hinsen writes “One question I have been thinking about in the context of reproducible research is this: Why is all stable software technology old, and all recent technology fragile?”

One response would be to point Hinsen to selection bias, and the (in)famous second world war “bullet holes in planes that returned to base” diagram. Old technology that is still around is likely to be still around because it is stable. New software could be expected to be unstable in part because it’s new. But this isn’t the whole story and he does have some interesting things to say about how even in an “open” and “free” project, a type of community monopoly can evolve:

While in theory Open Source is good for supporting diversity (“just fork the code and adapt it to your needs”), the reality of today’s major Open Source communities is exactly the opposite: a focus on “let’s all work together”. Combine this with the chronic lack of funding, and thus also a lack of incentives for developing the structured governance that would administrate funding and create activity reports, and you end up with large number of users depending on the work of a small number of inexperienced developers in precarious positions who cannot reasonably be expected to make an effort to even understand the needs of the user base at large.

Another line that caught my eye is “Standards-based markets can only form when there are multiple competing producers right from the start”. This is also true when there may have been competing producers at the start but for whatever reason there are fewer and fewer over time. Clear examples of this at the present are web browsers, where there are essentially three competing browser engines (Blink, WebKit, and Gecko), but Mozilla is so reliant on funding from Google/Alphabet that arguably there are only two independently funded endeavours. Web rendering engines are so complicated that there is no realistic opportunity for competitors, and the computing technology giants control the standards, so that’s not likely to change.

Rich Harris sounds the alarm on where this control has led with his post Stay Alert:

A short while ago, Chrome broke the web by disabling alert(), confirm() and prompt() dialogs from cross-origin iframes. The justification was that “the current UX is confusing, and has previously led to spoofs where sites pretend the message comes from Chrome or a different website”; removing the feature was deemed preferable to fixing the UX.

One may or may not agree with Harris’ stance on alert()but that’s not really the point. Chrome, and more importantly Blink (which also drives Microsoft’s Edge browser), has such a large market share that effectively what they decide determines where web technology goes. Even if you think they’re right this time, it’s extremely dangerous.

We also see this in library “science”, at least in the English speaking world. You’ve read me complaining about this before but why are the extremely weird and particular needs of the United States Congress used as the basis for both classification and controlled vocabularies in libraries across the United States, let alone those in the UK, Australia, and many others? Standards monopolisation.

Graham Lee provides an interesting take on how software develops and who gets left behind and why, in Majoring in versions. He also has some great and funny lines:

“Scripting language” does not actually mean anything. It is said by people who want to imply that a programming language is less worthy somehow because it is easier to use.

But beyond that, Lee provides a really great explanation of what has happened in Python with the migration from Python 2 to Python 3, and more importantly, the incredible complexities of trying to reach consensus in a community-managed software project used by literally millions of people (my emphasis in the below):

The last release of the 2.7 lineage was in April 2020, two decades after the first release, two decades after the discussions of py3k started, 14 years after the migration path was published, and 11 years after the release of version 3.

And people still felt that they had not had enough warning.

In fairness, some of them had not. They were not Pythonistas per se, they were computer users who happened to engage with Python when using a computer. Climate scientists, perhaps, who relied on their library vendors and their site administrators to keep everything ticking over. But they did not realise that their library vendors did not have funding for maintenance, and that their site administrators were relying on the operating system package maintainers.

The operating system package maintainers did not dare upgrade the default Python package, because that would break people’s scripts. Better to ship version 2 tomorrow so that everybody’s programs from yesterday carry on running.

Nobody was responsible for migrating to Python 3, so nobody did. It was not until 2.7, when people were finally told that this was the final release of Python 2, that these Pythonistas noticed the corner that they were painted into.

Ed Summers, in a much shorter post (Opinionated), covers some of the same ground:

Software always takes sides, and expresses opinions–and in fact often embodies multiple opinions in multiple arguments or controversies, rather than just one. The question is, do you understand the opinions it is expressing, and the decisions that are being made to express them? How can these decisions be negotiated as a group that includes the designers and users of the software?

So that was a bunch of reading about monopolies and sameness. My last few links are about splitting things into more discrete focuses – for better or for worse.

First up, two really cool things.

Brigham Young University wanted to analyse their chat logs to spot any issues with how they were responding to student queries or any patterns in terms of problems students were experiencing. The problem was that nine years of chat history was about 90,000 different transcripts. Enter machine learning. What I like about this project is (a) they did the analysis themselves with downloaded logs rather than some cloud service and (b) they shared exactly how they did it in an open access journal.

I have also been meaning to share this interesting podcast episode from What’s new? about The Women Writers Project:

Since the dawn of the printing press, women have written and published works of prose and poetry, and yet these texts have almost always received less attention than books written by men. In the early years of the internet, one project sought to redress this imbalance, and to make women writers not only more visible, but available for students and researchers to study in entirely new ways.

This project is basically the opposite of the chat analysis project – it uses humans to manually code text from nineteenth century women’s novels, allowing researchers to find links between texts, authors, and styles that would otherwise by missed. A fascinating metadata story.

Finally, there’s this garbage from Clarivate earlier in the year. In short, they’ve found a new way to artificially slice knowledge about the world into arbitrary categories. But allegedly it’s “bottom up” (though why this is “bottom up” rather than “top down” or “sideways in” for that matter isn’t really explained). Don’t get confused. There are no “responsible” research metrics. This is not about creating new knowledge or more “natural divisions” or any other marketing rubbish. This is about creating new “specialties” and therefore new journals and therefore new profit streams. In an interview with Emergence Magazine earlier this year Suzanne Simard talked about the way the academia-publishing complex conspires to reduce our understanding of how things are connected:

In academia, you get rewarded for the number of papers that you publish. They still count the number of papers. You get more money, you get more grants, you get more recognition, especially if you’re the lead author. Then you see, in areas like microbiology or even satellite imagery and remote sensing, if you can dissect your paper in these little bits and bites and publish these small ideas and have many, many, many papers, you’re much further ahead than writing that one big, seminal paper that integrates everything together, that’s going to be really hard to publish.

And so academics do. They put them in these little bite-sized pieces. I find myself doing it too. It’s how you can survive in that environment. And so it is a self-fulfilling system of always having these little bits of papers. It’s the antithesis of holistic work.

I’ve given you some bite sized pieces here, but hopefully it’s building a holistic work.