When we were were working on Adaptxt, the predictive texting tool for Android phones, we had to do a bit of statistical analysis on the frequency of Gaelic words. The result looks a bit like this:
The number tells you the instances of a particular form (not lemmata) in a Gaelic web-corpus. We’re posting the data here just in case it’s of any use to other people too.
A thing or two about it:
- These figures are based on written online Gaelic and they’re a bit skewed as any such corpus is (i.e. a written corpus as opposed to a spoken corpus). For example the word pàrlamaid occurs much more frequently than you’d expect in conversation.
- This is a frequency list of individual word forms not lemmata. That means that (for example) both gàidhlig and ghàidhlig crop up in the list. That’s both good and bad news. The advantage is that you can see how common each form is (for example that chaidh is more common than deach) but it doesn’t tell you overall how common the verb rach is, for example.
- Before we analysed the frequencies, we ran a spellchecker over the corpus to filter out typos and English words. So even though words like Coop will often occur in Gaelic, spoken or otherwise, it’s not in there.