Want to wade into the sandy surf of the abyss? Have a sneer percolating in your system but not enough time/energy to make a whole post about it? Go forth and be mid: Welcome to the Stubsack, your first port of call for learning fresh Awful you’ll near-instantly regret.
Any awful.systems sub may be subsneered in this subthread, techtakes or no.
If your sneer seems higher quality than you thought, feel free to cut’n’paste it into its own post — there’s no quota for posting and the bar really isn’t that high.
The post Xitter web has spawned soo many “esoteric” right wing freaks, but there’s no appropriate sneer-space for them. I’m talking redscare-ish, reality challenged “culture critics” who write about everything but understand nothing. I’m talking about reply-guys who make the same 6 tweets about the same 3 subjects. They’re inescapable at this point, yet I don’t see them mocked (as much as they should be)
Like, there was one dude a while back who insisted that women couldn’t be surgeons because they didn’t believe in the moon or in stars? I think each and every one of these guys is uniquely fucked up and if I can’t escape them, I would love to sneer at them.
(Credit and/or blame to David Gerard for starting this.)


Don’t be so sure.
These things consist of up to a trillion real numbers, ganged together in a big ‘network’ of numbers flowing through the system and being influenced by the trained numbers along the way.
They are trained by gradient descent. You start off with a huge pile of real numbers, a set of inputs, and a set of desired outputs. Because it’s all, ultimately, a bunch of matrix multiplication and smooth differentiable functions, you just do some calculus on all trillion numbers to find the derivative of how good the output is with respect to them - as this number goes up, the closeness of the output to what you want slightly goes up or slightly goes down. You repeat that for every variable, and take a step in that direction for all the variables. Repeat a billion or so times over.
Every single step in training is entirely local with respect to every single number. At no point is there a step that produces legible abstractions about how it works, just every step every number moves to become a little better. It is true that the basic topology of the network (the famous ‘transformer model’) pushes it towards certain KINDS of functional units (the famous ‘attention heads’) but much more detail than that takes a lot of work. There is very interesting math to the effect that with large numbers of parameter numbers you are unlikely to get stuck in a local maximum where you can’t get better and you just turn with different variables becoming important for the improvement through a labyrinthine path towards better performance, meaning at no point does anyone have to look into the process and figure out what is being built. The process is not unlike biological evolution, and produces things that are at least as inscrutable without detailed deep examination. We’ve been poking at molecular biology for more than fifty years in great detail with a world’s worth of biomedical researchers, these things for much shorter.
When people manage to peel these things apart and find the ‘functional units’ within them, they’re pretty wild. Most of this work has, unfortunately, been funded by cultists at Anthropic, but some of the ‘mechanistic interpretability’ literature is fascinating. You get ‘features’ represented by subsets of numbers in a particular layer, in superposition with other ‘features’ - each layer is like a huge vector sum of lots of smaller vectors, each of which does something. When you get maps of what ‘features’ activate or repress each other you get horrible spiderweb messes that look like charts of metabolism in cells.
EDIT: And even when people manage to find features, finding an individual feature takes a lot of effort and there is reason to think that every layer contains more features than there are numbers in it, because (to oversimplify) every feature is a large set of numbers that can overlap. It it utterly unsurprising and not a sign of magical thinking or ‘bad code’ that large fractions of behavior cannot be mechanistically understood at this time.
@BioMan @BurgersMcSlopshot I’ve recently had the chance to look at someone who was really proud that they used a neural net to create a forward/backward mapping through a space of 3 controls to ~50 controls that actually drove the system.
I took their files, loaded them into Onnx, and … they would have been way better off using PCA, because the neural net is approximating a simple linear system.
I think this is relevant, and the sort of “don’t understand” we’re talking about.
That’s an indication that the problem is a problem that is not well-served by a neural network. They are useful for approximating highly nonlinear functions with lots of inputs (and will not work well outside the range of inputs that you approximate within), not simple linear systems. The goal of recent ML has been to reduce as many problems to high dimensional highly nonlinear curve fitting as possible, with some great successes (machine translation, image recognition) and some not so great (shhhhh don’t tell the investors!)
@BioMan exactly. And yet here we are hammering square pegs into round holes.
If this product makes it to market in its current shape that’s gonna increase hardware costs, all because the blindly throw ML at everything bandwagon.
I’m reminded of people back in the day using map/reduce via hadoop to solve issues that could just as well be done with postgres or even sqlite and a sprinkling of sql, because that’s how google did it and no-one has any idea what “big data” really is.
Similarly, turning simple network applications into a hideous armada of microservices on a distributed kubernetes cluster, because that’s how google did it and people outside of giant tech companies don’t really know what that sort of scalability is for.
And here we are in the age of readily accessible neural network software. This too will pass, and we’ll get a new sledgehammer for walnut-opening in due course.
@danlyke @BioMan That’s right, this one goes in the square hole.
deleted by creator
@jackwilliambell @BioMan @BurgersMcSlopshot
The standard for scientific study is “Is it reproducible?”
OpenAI & others of its ilk, only rarely spits out reproducible results on anything but its original data set.
In the meantime a wholesale attack on privacy is being waged to gather data to feed LLM’s.
That data is enormously useful for creeps stalking dissidents, imposing surge pricing & “personalized pricing”, enabling ICE raids, spreading disinformation & for fraudsters.
deleted by creator
The most recent iteration of this is “Functional genetic programming with combinators” (2007), previously, on Lobsters; the generated programs have structured subprograms which can be extracted and analyzed on their own.
deleted by creator
Try “neuroevolution”
There’s some really cool work with running evolution-type algorithms versus gradient descent showing that training a network through gradient descent creates a training ‘trajectory’ (how it changes over time during the training process, in a very high dimensional space) that is basically the ‘average’ central tendency trajectory in the middle of the ‘cloud’ of trajectories that individual replicates of an evolutionary processes create. Of course, something like code is discrete chunks rather than real numbers you can calculate a gradient of, and kind of necessitates such an evolutionary process.
Sorry if I just get super nerdy technical here, I am in the middle of a project at work about the relationship between evolutionary processes and machine learning processes that’s resulting in a lot of very interesting math about the nature of both and the kinds of things that they can learn.
deleted by creator
Edited to note that I am referring to the trajectory the system takes as it changes during training/learning/evolving.
Neat! Is that new? Reckon you could get it published?
Doing a LOT of python. Here’s hoping.
For fun, take a look at this older work from someone else
https://www.nature.com/articles/s41467-021-26568-2
deleted by creator
I would say it’s more that the relationship between a text prediction model’s output and real text is precisely mathematically the relationship between a leaf bug and a leaf, down to being made by very different processes, optimized by different forces over their origin, and doing very different things inside.
Trying to force an LLM to produce true statements is like trying to get a leaf bug to photosynthesize. What they do is unrelated to that, they just happen to have been optimized over time to resemble something that does do that as seen by a certain mode of inspection.
I see I was indeed being presumptuous. Based on replies to my original (and somewhat incorrectly formed statement) it seems that while the parts of the process are understood in abstract, there’s points in an actual running implementation that become, I suppose, unfathomable or incomprehensible. Is that fair to say or am I being wrong in a different direction now?
@BioMan @BurgersMcSlopshot
> The process is not unlike biological evolution
It reminds me of simulated annealing.