No shit “this decline [in the capability of LLMs to perform logical reasoning through multiple clauses] is due to the fact that current LLMs are not capable of genuine logical reasoning; instead, they attempt to replicate the reasoning steps observed in their training data.”
Paper from Apple engineers showing that genAI models don’t actually understand what they read but just regurgitate what they’ve seen like a puppy wanting to please its owner:
https://arxiv.org/pdf/2410.05229
This is apparently doing some numbers so I want to make it clear that I think this paper is purely part of a strategic move by Apple to save face and gracefully exit from the genAI hype cycle. Hopefully other companies will do the same.
@CatherineFlick but the puppy is actually more "intelligent"
@yvan @CatherineFlick and cute. And it consumes way less resources.
@cm @yvan @CatherineFlick
And craps in far fewer inboxes
@CatherineFlick This paper should have been rejected for lack of novelty and relevance. Ah, self-published.
@tessarakt @CatherineFlick It's not "novel", in the sense that we theorists have known this since the GPT-2 days at least – but it's certainly relevant, and I haven't seen a published paper on it before.
@wizzwizz4 @tessarakt @CatherineFlick I don’t have a bibliography handy but there are previous papers that cover this. The one that comes to mind gave an LLM a word problem isomorphic to one it had “solved” because the solution was in the training set. But the new problem was in a different domain and the bot failed b
@MartyFouts @wizzwizz4 @tessarakt I think the paper is 100% strategic, Apple higher ups wouldn’t sign off on this unless they were fully behind it. They’re already pulling out of funding AI companies, this is all part of the strategy. It’s a face saving operation at this point.
@wizzwizz4
Here are some 2023 papers and this past summer that doubted math, planning, and reasoning ability of LLMs:
planning: https://openreview.net/pdf?id=2g4m5S_knF
"Sparks of intelligence" are really "embers of autoregression": https://www.pnas.org/doi/10.1073/pnas.2322420121
Math: https://arxiv.org/pdf/2405.14804
"Emergence a mirage": https://arxiv.org/pdf/2304.15004
@tessarakt @CatherineFlick
@hobs @wizzwizz4 @tessarakt Oh it's not novel indeed. But it's novel that Apple has signed off on a paper essentially undermining a core feature that genAI is supposed to be able to have in this race. That's the key aspect here. It's a paper that shows Apple's strategy, and it's going to have an effect.
@CatherineFlick
+100 very important, relevant, and impactful
It likely had to be self-published because it is impactful and novel in its stronger claims. LLM bigtech funds research that ignores the obvious and those bought researchers hate it when others reveal their naked "emperor."
Same reason Google squashed Timnit Gebru's "Stochastic Parrots" paper at the very beginning of all this, by firing her, and her boss (Melanie Mitchell) and disbanding the AI ethics team.
@wizzwizz4 @tessarakt
@CatherineFlick I miss when we treated chatbots like silly flawed games
@Lyle @CatherineFlick I wonder how ChatGPT is at rolling misc. sided die on IRC...
@CatherineFlick Sounds like a newly trained LLM wrote that.
@CatherineFlick Someone surprised by this? GenAI is just statistics. There is no intelligence in there
@CatherineFlick @Wolfshund most people even tangentially informed in relation to the tech sector know this and are unsurprised. The problem is that we have utter shitehawks like Altman breathlessly hyping their tech to people who have never been informed of the reality, coupled with so-called “journalists” doing his PR for him
@mark_f_lynch
And researchers funded by Microsoft , Google, Facebook, and even Apple are paid to corroborate Altman's nonsensical claims.
@CatherineFlick @Wolfshund
@Wolfshund
Researchers pretend to be surprised by it because 90% of published papers are flawed studies (data leakage) that imply the opposite, that intelligence can emerge if you scale up. All that emmerges is better memorization.
@CatherineFlick
@hobs @CatherineFlick I have not read a single scientific paper that suggest scaling genAI would lead to general intelligence. Have not even read a decent newspaper suggesting this. It is obviously wrong. The reason the public tends to be awed by the current LLM feats is that they can appear intelligent. Good enough for most, just look at current elections….
@CatherineFlick And they needed a study for that because of course it was not obvious at all that LLMs and genAI are terrible at reasoning.
@Erebus_Amauro @CatherineFlick So are the proponents of "AI".
@CatherineFlick it’s just fancy autocorrect
@CatherineFlick it's the equivalent to developers copying and pasting from stack overflow.
@CatherineFlick I feel like this understanding is implicit in the nature of the technology itself - they predict what’s likely to be the next word based on a big set of words. That’s all. Any “reasoning” is anthropomorphizing the technology.
@StephanieMoore @CatherineFlick For folks who understand the technology, yes. But the general non technical person does NOT understand this. School children (including teens) already trust LLMs more than other forms of research and reasoning. (Source: Anecdotal through friends who work in schools and friends who are mental health professionals of youth)
@CatherineFlick Exactly in line with what one would predict from their construction.
@CatherineFlick Why did we need a paper from Apple to tell us what Junior engineers could on day 1
@CatherineFlick I cannot imagine what bizarre free radical reactions in the human brain of a computer engineer would make them ever think that a computer program is capable of logical reasoning - someone should explain to them the difference between complicated systems and complex systems. Smoke and mirrors will only get you so far.
@frankcat
Tell that to mathematics theorem prover algorithm researchers. They can demonstrate strong reasoning ability in constrained domains. But BigTech stopped pursuing this path when they discovered they could make more money duping the public than proving theorems like Fermat's Last Theorem.
@CatherineFlick
@CatherineFlick Microsoft definitely won't, they've bet the entire company on this garbage and will not weather its collapse well. Google and Facebook and Amazon are actually in somewhat better positions to back away from it, but it still feels like whether they do or not will mostly be a question of executive ego.
@jplebreton yep, classic big CEO energy
@CatherineFlick but humans cant hold and utilise all that data. We need something to provide the knowledge on-demand.
I don't know about the hype aspects but this is damn good technology to pursue.
@marksun do we really need genAI to do this though? Why not just build good search engines that actually work well?
@CatherineFlick I think genAI like ChatGPT functions like such engine. It is just that we are finding many more uses for them and it gets harder to define what they are in simple terms.
@marksun @CatherineFlick except a search can be updated daily or hourly with fresh content. You can't train an LLM at this pace just to update it, the costs would be astronomical. This and so many more problems
@agorakit @CatherineFlick indeed fresh content is a thing. But remember there is a lot more data in history than a few hours or days. Still we are free to find ways around issues like you mentioned. It is not compulsory to use the same thing for everything.
@CatherineFlick ...why ? With the proper implementation and safeguards current models can do amazing things (even if they are dumb and can not reason), a good example is NotebookLM or ChatGPT-4o1 (try it with math).
@ErikJonker they’re very very expensive and environmentally destructive toys. Their “amazing things” are not worth the cost and the safeguards are not sustainable - keeping safeguards up to date will be yet another forever task of whack a mole that companies won’t take seriously. They return bad results too often for use as decision making tools, even with humans in the loop - humans suck at challenging them. Are these harms worth it? Really?
@ErikJonker @CatherineFlick
The chain of thought approach that o1 models have adopted illustrate the diminishing returns of the LLM approach. You now have to wait minutes for it to generate multiple token sequences and pick the best chain in order to get a significant improvement over plain GPT4. And still the results cannot always be relied upon.
https://youtu.be/PH-qIyqwY4U
Will the o2 models have to generate multiple chains of multiple chains of thought, squaring the thinking time?
@bornach @CatherineFlick …for me NotebookLM is an example why those things aren’t necessarily a problem with the right application and the right context
@bornach @CatherineFlick kinda reminds me of the late days of 3dfx, the once-world-beating 3D graphics accelerator card company in the late 90s. for a while their cards were the best around, but they didn't really improve their architecture and the final card they didn't quite manage to ship was this absurd monster with 4 of the same chip crammed onto a single card. meanwhile, Nvidia's chips were doing more per clock and getting better architecturally with each generation.
@bornach @CatherineFlick (fwiw, i don't think there is an Nvidia analog here, i agree with various experts that LLMs are going to run their course and have already more or less hit the limits of what they can do, and it doesn't seem like these companies have anything else up their sleeves at the moment.)
@CatherineFlick as much as I don't like corporate face-saving, I'm relieved to see Apple being unenthusiastic about genAI. I use a Mac for music work, and I would not be happy to see "AI" features keep getting incorporated into the OS. It would be extra cool if they backpedal for macOS 16 and remove or at least lessen the "Apple Intelligence" features.
That said, I've been hearing people like Timnit Gebru use the phrase "stochastic parrot" for years, so this paper is definitely funny
@CatherineFlick the scary thing is seeing people post responses from Grok or Google’s “AI Overview” as evidence of whatever they’re arguing online about. People who aren’t in the IT field are already treating these tool’s hallucinations as gospel.
@CatherineFlick this is unfair to puppies who are all the bestest and smarterest.
@brunogirin @CatherineFlick Puppies could probably write a smarter #RSS #podcast #feed poller than the current #Apple shambles!
@CatherineFlick could this obvious paper be the end of AI?
@f4grx lol I doubt it
@CatherineFlick I've said it before - LLMs are just parrots with large dictionaries.
@CatherineFlick i hope openAI responds with a press release that literally contains the phrase “stop FUDding my projects”
@CatherineFlick Finally, people begin realize that just because a husky can vocalize "I love you" doesn't mean that they even understand the concept of verbally expressing affection. They just do what they were trained to do.
https://www.youtube.com/watch?v=qXo3NFqkaRM
@CatherineFlick "AI Trained To Do One Thing Doesn't Do Another Thing, Experts Discover. People Shocked."
@CatherineFlick There was real debate that it did?
@crazyeddie
Yes 90% of LLM studies purported to find evidence of reasoning with papers like "chain of thought reasoning", "discrimination of faithful and unfaithful error recovery", "flow of thought reasoning", and "cooperative LLM agents".
@CatherineFlick
@hobs @CatherineFlick What's the purported mechanism?
@CatherineFlick dumb, but genuine question here..
how is this different from what developing children do?
@elan @CatherineFlick Think of it this way. Imagine a group of children that learned to talk, and then about at age 4 they never had any interaction with any more outside info- no school, no parents, no community. Not a perfect analogy, but what came to mind.
@scrummy @elan @CatherineFlick the children didn't just learn to talk. They memorised much of the public knowledge.