Words words words

This past week I was on vacation, and found myself with about 4 hours of free time in which I could write a little program that I've thought about for years. It was a simple idea. Assuming that I learn about 2000 words per year on average since I was a child, my vocabulary will probably be around 70k words right now. On that note, I doubt very much that I use even a small fraction of that when writing on a regular basis, but exactly what sort of fraction that amount, I could not say. Also, as I have a particular style of writing, and a tendency to link words together in a peculiar sort of way, if I mapped out all of the arrangements of words that I use on a regular basis, I should be able to discover a sort of fingerprint for what makes my writing my own. To have a model of English syntax as understood by me would allow me to explore the bigger questions about not only my writing but the that of others.

Consider the fact that while one can string any two words together in order, only a small fraction of those orderings are sensical. While English may or may not have 1,000,000+ words, and the OED 600,000 definitions, and the average American a basic grasp of only 40k of those, there might be only be a few order of magnitude greater number of combinations which make sense than original words. While two words may may sense in isolation, "lol" and "cats", it takes a peculiar context to infer meaning from "lol cats", and adding the word "spork" to that list probably makes it even harder to grok "lol spork cats". Mind you Google Images even struggles to combine these concepts into a single image. If one were to consider the number of possible uses of a word like "antidisestablishmentarianism" outside of discussions of the longest word in the English language, or perhaps discussions about the social political structure of English society in the 19th century, one would invariably draw a complete and utter blank. Certainly it is hard to wedge into a conversation about much of anything really. So we can safely assume that most of the possible combinations of "x antidisestablishmentarianism" and "antidisestablishmentarianism y" simply will fail for nearly all x and y in the English language.

So taking these two assumptions, I put together a program that could parse through all of my writings for the past few years and determine the frequency of occurrence and the orders in which they occur. For this blog, from 2009 to 2010, including words made up for programming examples, I used a total of 10,116 distinct words. Of those words, only 1254 were used 10 or more times. The most frequently used words are:

888 it
890 for
1180 that
1433 in
1820 is
2195 and
3036 to
3087 of
3114 a
4600 the

And none of these are really all that surprising. Where things get more interesting is when you skip past all of the articles, pronouns, prepositions, and conjunctions, and end up in the realms of nouns and verbs. If you look at an arbitrary set of words from this list, say 120s-130s in terms of frequency (ie. words I use in nearly every post at least once):

123 method
125 things
129 does
129 model
131 may
132 build
133 application
135 there
136 design
137 using

It almost makes sense. Something like this can only occur if one writes a lot about Object Oriented Application Design. It is also to mention that I've mentioned the programming language C, almost as many times as I've used the word "no":

140 c
141 no

Even a non-sense word like "foo" has appeared 60 times since I started writing a blog in 2009. When taken into consideration that I tend to talk about programming languages a lot, I talk a lot about some:

11 ruby
23 python
33 perl
38 erlang
39 lisp
52 java
54 smalltalk
65 forth
322 javascript

Which is not terribly surprising, since most of my code mentioned in this blog is javascript, and I have linked in the past to a couple implementations of forth written in javascript. Some earlier posts go into a considerable amount of detail into experiments which tried to identify the structure of a then current Smalltalk image as well. And let's not forget I often use Java as a negative example of best technique. So what happens if we look at how I use a word like Erlang in a sentence typically? Erlang is often preceded by one of:

a, in, of, to, and, about, problems, more, each, like, safe, than, makes, meets, implemented, complicated, because, industry, networked, behind

And is followed by one of:

apps, parallel, in

If we modeled this as a directed graph of my use of the word Erlang, we'd find that until this article, has a cycle with the word "in", and the rest is a mash of typical prepositions and context words like "problems", "safe", "implemented", "complicated", "industry", "networked", "apps", and "parallel". It is as if we conjured up a Venn diagram of all the concerns an Erlang hacker would likely be addressing when thinking about Erlang. It becomes obvious when you start filtering context words surrounding any of these core concepts you paint a very clear picture of what language is about. The reason that no two arbitrary words are likely to make sense is that their meaning is tied to a complex external reality which is being modeled by these symbolic expressions. The language we use is more constrained by the experiences we have as the experiences are constrained by our vocabulary to describe them. Hopefully, when I get some more free time, I'll start adding some Graphviz directed graph representations for various concepts that occupy this space. This little experiment has sparked a whole host of interesting applications for this technology, and I've already turned it loose on project Gutenberg. Charles Dickens really had a interesting list of quirks.