DAVE'S LIFE ON HOLD

Occasionally Forth

Over the past few weeks, I have found myself typing python, Perl, or ruby less and less. Sometimes I fire up an interactive REPL in one of these environments, just to prove out a theory.  Sometimes I just want to parse out some strings, or examine some binary data in a file, or just use it as a fancy calculator.   I have a host of programs I can use from the command line: strings, hexdump, bc. But sometimes the things you want to do are more experimental than dd + hexdump + less can allow for. 

The motivation for this change was something small, I added a convenience method to gforth making mmap easy to use:

  s" filename " mmap

Where mmap is d as:

  : mmap ( string -- length address ) ... ;

Where the ... is replaced by some C code I had written for the Firth bootstrap environment.   What this did was make accessing the contents of a file from Forth dirt simple. With 2 variables squirreling away the result, it then becomes trivial to loop over the contents of the file, or poke around with "@" and ".".  And since by default, I set the mapping to request:

  PROT_READ|PROT_WRITE|PROT_EXEC

there is very little one can't do with memory mapped into those pages.  And what I started to find was most of the things I would do in Perl or Python, just weren't necessary. 

The crux of the issue is data types. In most "scripting" languages, there is usually native support for strings, arrays, and hash maps. Numbers are supported, but are typically not terribly efficient as they are either boxed or tagged, and a considerable amount of the programming run time when performing calculations is dedicated to maintaining the facade.  Languages like Perl and Python are useful for sorts of ad hoc exploratory jobs because they translate simple strings of text into more complex structures.  These structures then instruct everything one does in the language. 

Forth doesn't have that stock. It comes with by default rather limited string manipulation, and raw buffers of memory. You can build anything you want, including Perl style data structures, but it never feels quite the right thing to do.  It's not that those structures aren't useful, only that there are other ways to think about things. 

Take for example my mmap from earlier.  Recently, I processed all of the words I've typed in the past few years.  The total number of discrete words I have used was less than 20k words. This number is significantly less than the 65k words one could index in a table. With no word that I have used being greater than 28 characters in length, I can easily build a table of 32 * 20k = 640k of all the words I am likely to use over the next decade. Even if I go on a vocabulary spree, using a new word a day, I still wouldn't hit that limit in the next decade.  That said, I am exceedingly unlikely to exhaust this dictionary in my lifetime. 

What this means is I really only need 1 sparse table, and 16bits per word to represent everything I am likely to write.  This already compresses everything I have written over the past 3-4 years to a very manageable image file, which I can easily mmap with a single word. It also means with another word, I can convert any word I type to an index value. At that point the value of Perl or Python starts to dramatically diminish. While the documents I'm editing aren't ASCII, I can analyze the structure and contents of the data much more easily. Building a search index for the document is trivially simple.  Exporting documents to UTF-8 HTML docs isn't any effort as well. 

Beyond the whole word encoding, I have a spell checker which autocompletes words I am likely to type as I type them, and can accurately predict my own word choices. I can even represent source code this way, and in fact a considerable number of words are punctuation marks.  Representing "
One thing I do a lot of is grabbing other people's documents online using curl and parse them for useful info. I have found maintaining a second larger index of a few MB is very handy for these docs.  In these cases I increase the index size to 24bit. For parsing HTML, <> denote a word and work like whitespace. Having all of your HTML tags as 24bit values makes "" a reasonable trade off, though you loose a bit on words like "a" and "I", but those losses are pretty small compared to the savings of """.  This makes balance checking cheap, as well as, extracting plain text from the document. 

Juggling a 24bit packed document becomes a little tricky,  but a single "expand" command can pad these numbers out to 32bit representations which are quick to read and manipulate in registers.  And this is the thing that I started to notice, once I stopped burning cycles trying to maintain a facade, I stopped worrying about the facade. One representation of a concept or word could be better or worse depending on the context. In practically every application I write, I am finding 16bit indexes to a dictionary table to be superior to flat text. Editing a word swaps out one 16bit value for another. Adding a word to the middle of a document, shifts 2 bytes per word that follows up in memory, and copy and paste selects full words. 

Even linked list representations are easier, as I can guarantee no list ever appears beneath the 64k barrier in addressing.  Basically I can do cons and get tagging for free by means of a simple < check, meaning I can represent all the structured data I'd like without the overhead of tagging.  But what about numbers?  Why would I want to store numbers in linked lists?  I have perfectly nice native number support, with double precision integer math operations, and the ones I've typed are already in my dictionary. (hint even in config files we use very few discrete values)

So as my vocabulary in Forth grows, and my useful database of words grows, and my documentation becomes better indexed and searchable, I find myself powering up Perl and Python less, and Forth more.  Maybe it is because I can search docs at a ok prompt, or don't have to spell correctly. Maybe it is because I can read binary data, or code loops in assembler when I need speed without firing up GCC. Maybe it is because when it comes right down to it, I just have less use for regular expressions with more regular data.  Or maybe it is just because I'm tired of slurp, and have better ways to read a file." alt=" " as 16 bits is no more costly, but representing "function " as 16 bits is an awesome savings, and makes it easy to weary a document for every occurrence. 

One thing I do a lot of is grabbing other people's documents online using curl and parse them for useful info. I have found maintaining a second larger index of a few MB is very handy for these docs.  In these cases I increase the index size to 24bit. For parsing HTML, <> denote a word and work like whitespace. Having all of your HTML tags as 24bit values makes "" a reasonable trade off, though you loose a bit on words like "a" and "I", but those losses are pretty small compared to the savings of """.  This makes balance checking cheap, as well as, extracting plain text from the document. 

Juggling a 24bit packed document becomes a little tricky,  but a single "expand" command can pad these numbers out to 32bit representations which are quick to read and manipulate in registers.  And this is the thing that I started to notice, once I stopped burning cycles trying to maintain a facade, I stopped worrying about the facade. One representation of a concept or word could be better or worse depending on the context. In practically every application I write, I am finding 16bit indexes to a dictionary table to be superior to flat text. Editing a word swaps out one 16bit value for another. Adding a word to the middle of a document, shifts 2 bytes per word that follows up in memory, and copy and paste selects full words. 

Even linked list representations are easier, as I can guarantee no list ever appears beneath the 64k barrier in addressing.  Basically I can do cons and get tagging for free by means of a simple < check, meaning I can represent all the structured data I'd like without the overhead of tagging.  But what about numbers?  Why would I want to store numbers in linked lists?  I have perfectly nice native number support, with double precision integer math operations, and the ones I've typed are already in my dictionary. (hint even in config files we use very few discrete values)

So as my vocabulary in Forth grows, and my useful database of words grows, and my documentation becomes better indexed and searchable, I find myself powering up Perl and Python less, and Forth more.  Maybe it is because I can search docs at a ok prompt, or don't have to spell correctly. Maybe it is because I can read binary data, or code loops in assembler when I need speed without firing up GCC. Maybe it is because when it comes right down to it, I just have less use for regular expressions with more regular data.  Or maybe it is just because I'm tired of slurp, and have better ways to read a file." />