Tools for developers

Gaelic word frequencies

When we were were working on Adaptxt, the predictive texting tool for Android phones, we had to do a bit of statistical analysis on the frequency of Gaelic words. The result looks a bit like this:

36451 a
20934 an
13882 air
12409 na
10009 tha
9602 agus
7180 ann
5086 gu
4434 e
4235 airson
4004 o
3841 am
3408 seo
3133 le
3014 aig
2757 gàidhlig
2720 is
2621 bha
2439 do
2354 mu
2353 ri
2280 mar
2265 de
2137 mi
2034 iad
2034 nan
1961 anns
1782 bho
1737 eile
1704 bhith
1654 bheil
1594 sin
1559 chaidh
1472 sinn

The number tells you the instances of a particular form (not lemmata) in a Gaelic web-corpus. We’re posting the data here just in case it’s of any use to other people too.

A thing or two about it:

  • These figures are based on written online Gaelic and they’re a bit skewed as any such corpus is (i.e. a written corpus as opposed to a spoken corpus). For example the word pàrlamaid occurs much more frequently than you’d expect in conversation.
  • This is a frequency list of individual word forms not lemmata. That means that (for example) both gàidhlig and ghàidhlig crop up in the list. That’s both good and bad news. The advantage is that you can see how common each form is (for example that chaidh is more common than deach) but it doesn’t tell you overall how common the verb rach is, for example.
  • Before we analysed the frequencies, we ran a spellchecker over the corpus to filter out typos and English words. So even though words like Coop will often occur in Gaelic, spoken or otherwise, it’s not in there.

The file is here (opens best in the Gaelic version of LibreOffice).

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Pin It on Pinterest