LLMs and nukes: not so different, says Russ Swan.
The Internet Archive (archive.org) passed a significant milestone this autumn on its mission to capture and preserve digital culture: it has now indexed over a trillion web pages.
A trillion. That’s big numbers. If they were actually printed on photocopy paper it would make a stack 100,000 km high – a quarter of the distance to the moon.
It’s a terrific resource, although not without controversy. The archive, also known as the Wayback Machine, was the target of a major cyberattack in 2024 and, a year earlier, was found guilty of copyright breaches for making copies of books available during the Covid lockdowns. Naughty.
Nevertheless, its existence is evolving from a thing of casual interest and novelty to one of critical international importance. It is becoming the digital equivalent of low background steel.
Those of you who have worked with radiation detection instruments may be familiar with this material. All steel produced since 1945 is contaminated by the products of the bombs on Hiroshima and Nagasaki, and the couple of thousand nuclear weapon tests performed before and since.
You can’t measure radiation if your instrument is itself radioactive
Stuff like plutonium-239 and strontium-90, which did not previously exist on this planet, found its way into every gram of steel made in the second half of the 20th century and, consequently, made that steel useless for many scientific instruments. You can’t measure radiation if your instrument is itself radioactive.
One solution has been to retrieve socalled low background steel, made before 1945, most notably from sunken ships.
There is a lot of blather about just how valuable this stuff is, and it’s hard to find reliable numbers. In truth, the quantities of steel needed by the worldwide scientific community are relatively low and the price of the low background variety reflected the difficulty in its retrieval rather than actual rarity (a battleship could make quite a lot of Geiger counters). Modern steelmaking produces uncontaminated metal anyway.
What has all this to do with the Internet Archive and its trillion page landmark?
With the release into the wild of ChatGPT, 2022 is to the digital world what 1945 was to the radiological world. It was when the bomb went off.
Since then, the growth of AI has been cancerous. A little bogus information here becomes the feedstock for the next iteration of contaminated data, round and round like the ouroboros of ancient mythology: a snake eating its own tail.
Coupled with the growing enshittification of the internet, courtesy of the tech giants who are weirdly killing the very thing that made them (see Lab News, 29 May 2024), the validity of much online content is now questionable. For decades we’ve enjoyed having the collected knowledge of the human race available to us, instantly, often for free, and mostly readily verifiable. That is no longer the case.
Factor in a White House intent on erasing any parts of history that don’t accord with its twisted world view, and we are only just beginning to realise what we’ve lost.
This means that certified pre-2022 digital content is quite precious.
It contains a lot of crap, yet it is the unpolluted low background material that we will need to rebuild everything.