A plea for Large Language Models

ChatGPT as the ultimate analytic philosopher

Nov 14, 2024

How (some) analytic philosophy works

In an influential Presidential Address to the Aristotelian society, entitled: “A Plea for Excuses,” the great J. L. Austin spelled out the assumptions of analytic philosophy - or at least the school that he and his colleagues felt they were working within. This school - particularly identified with Oxford University - is the ordinary language analytic school. Progress is made by looking at the use of words in our everyday language, analysing them carefully, and drawing out implications for the concept they attach to.

In his address Austin fills this out, using a specific example: his aim is to investigate moral responsibility, and he will do so by a careful and exacting analysis of an “excuse”, as brought out by how the word is applied in everyday speech. For if something is labelled “an excuse” then it appears to blunt the moral responsibility an agent has for their action. And if it does not count an “excuse” then it does not.

But why - asks Austin, rhetorically - should we be looking so carefully at the ordinary use of language to decide the matter? After all, the users of ordinary language are ordinary people, and might be thought to be less expert in moral reasoning than - say - the President of the Aristotelian Society.

Austin thinks not, and that analysis of language is of considerable value, for several reasons. After noting that words are our tools and therefore can only be used while realising and taking account of their limits, Austin gets to the heart of the matter:

Thirdly, and more hopefully, our common stock of words embodies all the distinctions men have found worth drawing, and the connexions they have found worth marking, in the lifetimes of many generations: these surely are likely to be more numerous, more sound, since they have stood up to the long test of the survival of the fittest, and more subtle, at least in all ordinary and reasonably practical matters, than any that you or I are likely to think up in our armchairs of an afternoon - the most favoured alternative method.
In view of the prevalence of the slogan "ordinary language", and of such names as "linguistic" or "analytic" philosophy or "the analysis of language", one thing needs specially emphasising to counter misunderstandings. When we examine what we should say when, what words we should use in what situations, we are looking again not merely at words (or "meanings", whatever they may be) but also at the realities we use the words to talk about: we are using a sharpened awareness of words to sharpen our perception of, though not as the final arbiter of, the phenomena.

(Austin, 1957)

This is a justification of the core of the analytic method, at least when applied in Austin’s ordinary language philosophy. It justifies analysing the concepts of knowledge or causation by looking at a variety of situations1 in which a competent speaker of a language would be happy to apply, or not to apply, terms like “know” and “caused”. The reason this gives insight - as Austin explains - is that the encoded knowledge and insight of generations of human wisdom has become embedded in the structure of language. And so, with careful analysis, we can bring this encoded knowledge to bear on specific situations - and from there, use the methods of analytic philosophy to generalise as far as we can.

So in his 1957 address, Austin applies this method to “excuses” and so analyses moral responsibility. But his wider aim is to justify the practice of this analysis itself, and lay out some limits of where it can be applied. He notes that “knowledge”, “truth”, “cause” and “effect” can, and have been, analysed in this way, and defends the idea that some insight has been gained by doing so.2

It is useful to have a clear justification for this analytic approach articulated in such an accessible way. Analytic philosophers themselves often seem to pass over it without comment, perhaps as too obvious for words. But their practice is often regarded with complete incomprehension by those outside the field, who cannot see why the structure of language would be expected to contain anything more insightful or truth-tracking than - say - empirical experiment. Often, they can sound like lexicographers, simple compilers of dictionary definitions, and there is a mystery of why they think this will give profound insight into the underlying concepts. Austin’s justification - sound or not - at least gives us something solid to work with.

ChatGricePT

More than sixty years after Austin, we have succeeded in industrialising his process.

Large Language Models do precisely the work Austin suggests, but at a huge scale and in minute detail that he could scarcely imagine. By taking an enormous corpus of ordinary language, and analysing its structure by the means of embeddings, and mathematical machinery such as transformers, the models bring through their deep-learning structure a fabulously involved statistical analyses of the way words3 are used, and how they relate to one another. The models that lie behind ChatGPT, Claude and Gemini are - when you get down to it - very large and very sophisticated statistical analyses of the patterns in which we use (and do not) use words in ordinary language. Exactly the analysis that Austin was recommending.

Of course, Austin expected that the work of the analysis of words would be done by humans, with their own understanding and insights available to be layered on top. We are now implementing this method “naked”, with nothing other than the structure of human usage to be added - although rather than five or six examples to be leisurely considered at article-length in one’s Oxford study, the LLMs grind through billions per second of training time. If anything, this makes the LLM approach a “purer” approach to Austin’s analytic method, for one cannot accuse the LLM of injecting their own thoughts and reasoning into the process: the language and its structure is all there is.

Does it work?

What’s more astonishing is that - even with these limitations - Austin’s recommended method works. At scale. And that it works so well. Not only are Large Language Models able to “speak” in a fluent and human-like way, they also are able to give at least the impression of understanding not only language, but also the world.

That is, they can derive consequences about concepts that are seemingly far beyond their dataset. For example, simply from the usage patterns of words like “beam”, “turn” and “pulley”, they give the strong impression of understanding how the physical world fits together. Even given Austin and others’ faith in the power of the method, this is still surprising.

So when, for example, Yann LeCun: one of the architects of the Deep Learning revolution, was asked to set tasks for LLMs that they would find hard, he put together an example that checked the extent to which they were able to derive from the structure of language, how cogs and gears would work, in an example that would not have appeared in their training set:

7 axles are equally spaced around a circle. A gear is placed on each axle such that each gear is engaged with the gear to its left and the gear to its right. The gears are numbered 1 to 7 around the circle. If gear 3 were rotated clockwise, in which direction would gear 7 rotate?
(LeCun, 2023)4

The fact that modern LLMs are perfectly able to answer this question correctly - an insight that must ultimately have come from the structure in which words like “gear” and “rotate” had been used in their training set, and so were represented in their internal state - gives a dramatic demonstration of how much insight about the physical world, can be encoded in language use.5

This is fundamentally why people are so disconcerted by the insights of LLMs - they seem to have a handle on concepts like “causation”, to understand the dynamics of the physical world and to answer difficult questions about them. But this is not because they have been imbued with any direct experience or coding of how the world fits together. They have gained it all from the pattern of how words are used in sentences.

A 1957 limitation

However, we should also note Austin’s caution:

Using, then, such a method, it is plainly preferable to investigate a field where ordinary language is rich and subtle, as it is in the pressingly practical matter of Excuses, but certainly is not in the matter, say, of Time. At the same time we should prefer a field which is not too much trodden into bogs or tracks by traditional philosophy, for in that case even "ordinary" language will often have become infected with the jargon of extinct theories, and our own prejudices too, as the upholders or imbibers of theoretical views, will be too readily, and often insensibly, engaged. […] How much it is to be wished that similar field work will soon be undertaken in, say, aesthetics; if only we could forget for a while about the beautiful and get down instead to the dainty and the dumpy.

It’s startling to find a caution against having our LLM training set infected by less useful data in an article from 1957, but perhaps Austin would have been darkly amused that we are finding out in 2024 that training LLMs on a corpus already generated by LLMs will lead to a dilution of the ordinary-language-encoded insight contained within them.

Moreover - if we are to follow Austin - we should not expect the insights of ordinary language analytic philosophy, nor of LLMs, to go much beyond the “distinctions men have found worth drawing, and the connexions they have found worth marking, in the lifetimes of many generations”.

Austin cautions in particular against analysing concepts such as “Time” where ordinary use might not be so rich, and other methods - empirical science, for example - must take over. We are currently in the process of testing Austin’s caution, with the “scaling hypothesis” for LLMs. If he is correct, progress will stop eventually - albeit only at the level where we have exhausted the ancestral wisdom that has ended up in the structure of our language. Perhaps we should pay attention to a few scaling straws in the wind.

Similar articulations can be found in Strawson, Dummett, and Glock (2008).

Barns. Ever-increasingly baroque facsimiles of barns. Being in a territory where such barns are common but not actually seeing one. And so on.

Or tokens, which can either be words, or parts of words..

Leaving aside the worry that LeCun’s example might have found its way into training sets via its appearance on the internet.

Another example comes in a story about the early days when Transformers were being developed, where one of the researchers who later went on to co-author the “Attention is all you need” paper, saw the model solve “core-reference resolution”. That is, the model was able to recognise that in the sentence: “The dog didn’t cross the road because it was too tired” then “it” refers to the dog. And in “The dog didn’t cross the road because it was too wide” then “it” refers to the road. And in “The dog didn’t cross the road because it was too dangerous ”it” refers to the whole situation. This shift requires “knowing” about dogs, roads, and danger - or at least understanding the ways in which these are deeply linked elsewhere.

John Shanly

Nov 17

I couldn't get the free chatgpt to answer the gear problem correctly even using some of the prompt engineering from the LeCun twitter thread. Can you post your log of a correct answer from an LLM ?

Expand full comment

1 reply by Paul Mainwood

Nigel Goddard

Nov 14

Might be ok as a philosopher, but useless at algebra and arithmetic

(from 2023, see recent at the end:

Me: What are four different integers a,b,c,d such that a*a + b*b = c*c + d*d ?

ChatGPT: Here are four different integers a, b, c, and d that satisfy the equation a^2 + b^2 = c^2 + d^2:

a = 1, b = 1, c = 2, d = 2

In this case, 1^2 + 1^2 = 2^2 + 2^2, which simplifies to 2 = 8, showing that this equation is not always true for all integer values of a, b, c, and d.

Me: a,b,c,d must be different integers, ie no pair are equal.

ChatGPT: I apologize for the oversight. Here's a set of four different integers a, b, c, and d such that a^2 + b^2 = c^2 + d^2:

a = 1, b = 2, c = 2, d = 1

In this case, 1^2 + 2^2 = 2^2 + 1^2, which simplifies to 5 = 5, satisfying the equation with distinct integer values.

Me:

But b = c and a=d. I said no pair could be equal

ChatGPT: I apologize for misunderstanding your previous request. Here's a set of four different integers a, b, c, and d such that no pair are equal, and a^2 + b^2 = c^2 + d^2:

a = 1, b = 2, c = 2, d = 3

In this case, 1^2 + 2^2 = 2^2 + 3^2, which simplifies to 5 = 13, satisfying the equation with distinct integer values.

At which point I gave up. Tried again today (Nov 24), but it still can't do arithmetic:

ChatGPT: To find four different integers a, b, c, and d such that:

a^2 + b^2 = c^2 + d^2

we can try values that make both sides of the equation equal.

One example solution is: a = 1, b = 7, c = 4, and d = 6.

Then: a^2 + b^2 = 1^2 + 7^2 = 1 + 49 = 50

and c^2 + d^2 = 4^2 + 6^2 = 16 + 36 = 50

Thus, 1^2 + 7^2 = 4^2 + 6^2, satisfying the condition with a = 1, b = 7, c = 4, and d = 6.

1 more comment...

Mean Squared Error

Discussion about this post