Release 5 of the 12dicts word lists

This file describes release 5 of the 12dicts word list package, released on June 3, 2007. Almost no changes have been made to the files from previous editions, and so you should refer to the readme.html document from the previous release for information on them. The only changes to existing files were to correct a small number of embarrassing errors.

It is probably valuable to present here the matrix of the lists and their features updated to release 5.

	neol2007	3esl	6of12	2of12	2of4brif	5desk	2+2lemma 2+2gfreq	2of12inf
Size	373	21877	32153	41236	60387	61406	80431	81536
Abbreviations	Y	Y	Y	N	N	N	N	N
Acronyms	Y	Y	Y	N	N	Y	N	N
American English	Y	Y	Y	Y	N	Y	Y	Y
British English	N	N	N	N	Y	N	Y	N
Hyphenations	Y	Y	Y	Y	N	N	N	N
Inflections	Y	N	N	N	Y	N	Y	Y
Names	Y	Y	Y	N	N	Y	N	N
Phrases	Y	Y	Y	N	N	N	N	N

The new lists, in brief, are as follows:

The 2+2lemma list is composed of the words in the 2of12inf and 2of4brif lists, lemmatized. The word "lemmatized" is a rare word, which you will find in none of these lists, but what it means is that this list is formatted as a collection of word sets, each set composed of a headword and some number (possibly zero) of closely related words.
The 2+2gfreq list contains exactly the same words as 2+2lemma, but they have been arranged by frequency groups, using data supplied by Google on the frequency of English words on the World Wide Web.
The neol2007 list contains a number of new and/or trendy words which you might choose to add as appropriate to the other lists, if you are concerned about including the coolest (or is it hottest?) buzzwords of the 21st century.

The 2+2lemma list

The list 2+2lemma.txt contains the words in the 2of12inf.txt and 2of4brif.txt lists, plus a few additional words from 3esl.txt. Also, the new words from the neol2007.txt list (see below) have been added, marked with a + if they would not have otherwise been included. (Marking the new words permits them to be removed if it is preferred for these lists to be in synch with the older 12dicts lists.) Finally, British forms of words in the 2of12inf list not already in the 2of4brif list have been added. Words marked with a % in the 2of12inf list ("Scrabble inflections") have however been omitted, with the result that, despite augmentation from other lists, this list in fact contains fewer words than 2of12inf.txt.

The 2+2lemma list is not formatted as a simple list of words. It is composed of entries of 1 or 2 lines each. The first line contains a headword, and the second line, which is indented if present, contains an alphabetized list of related words. A simple example:

funny
funnier, funnies, funniest, funnily, funniness

The list of related words contains three sorts of entries.

Inflections.
Variant spellings.
Words formed with certain suffixes.

In addition to true variant spellings such as "grey" for "gray" and "thru" for "through", item 2 also includes words which, though pronounced differently, are clearly variants of the headword. Thus, "hooray" is considered a variant of "hurrah" (but mere synonyms like "furze" and "gorse" remain independent).

Item 3 is based on a small list of suffixes, producing closely and consistently related words. These suffixes are -ful, -ish, -less, -like, -ly, -more and -ness. -ally is also allowed, if there is no -al word to apply the -ly suffix to. (For instance, "basically" is considered to be derived from "basic", because there is no word "basical".) When one of these suffixes is used in an unusual way, the resulting word is considered independent. For instance, "likely" is not considered to be derived from "like", nor "bashful" from "bash". There are some rather difficult questions here, such as how closely "slavish" is related to "slave", or "sluggish" to "slug". In general, I have chosen the course of least surprise by treating such pairs as independent.

Here are some other notes on the determination of what words are related.

Certain uses of the suffixes -ed and -s are treated as inflections, even though technically they are not. Thus, "talented" is treated as derived from "talent", and "optics" from "optic".

Words ending with the suffix -ability/ibility are treated as relatives of the corresponding -able/ible word.

Sometimes, the choice of which variant to treat as the headword is somewhat arbitrary. I have consistently chosen an American spelling over a British spelling here. This has some effect on the number of headwords. I treat "cheque" as a variant of "check", whereas, to an observer with a British bias, they would no doubt be separate headwords.

No distinction is made of different meanings of the same word, even when they are so different that dictionaries list them separately. "wind" the noun and "wind" the verb are considered as a single word, as are "second" the adjective, "second" the noun and "second" the verb.

It may sometimes happen that two different words have the same inflection ("putting" derives both from "putt" and "put"; "holier" relates to "holey" as well as "holy"), or that an inflection is a headword in its own right (as with "wound", the past tense of "wind", or "crooked", the past tense of "crook"). These situations are noted in the 2+2lemma list as cross-references to the alternate headword. There are two specific situations which might not be obvious where inflections are treated as different words. These occur when a present tense form or a -ness word has a plural inflection, as with "meaning" and "kindness". Such words are always made headwords, even when the relationship to the original root is very close. Here is an example showing how cross-references are indicated:

base
based, baseless, basely, baseness, baser, bases -> [basis], basest, basing

Almost always, a given word has only one cross-reference - the exception is the incredible tangle shown in the example below:

slue -> [slough]
slew -> [slay, slew, slough], slewed, slewing, slews -> [slew, slough], slued, slues -> [slough], sluing

where 4 uncommon words mostly pronounced sloo have become thoroughly confused.

The 2+2gfreq list

The 2+2gfreq.txt file contains exactly the same words as the 2+2lemma list, but with the headwords arranged approximately by the order of their frequency of use. The "g" in the name stands for both "grouped" and "Google". Here's how it was put together.

In 2006, Google published a mammoth corpus of word and phrase frequency data extracted from the English language Web. (See the announcement.) The 2+2gfreq list was made by accumulating the frequency counts for all of the words associated with a single headword of 2+2lemma.txt, in all of their spellings. (There were in general multiple spellings for each word because Google distinguished words on the basis of capitalization, so that "price", "Price" and "PRICE" were counted separately.) The resulting data was sorted by frequency, and then grouped into bands based on powers of 2. That is, the band of least frequency contained words which occurred 200 to 400 times in the Google data, the next band contained words occurring 400 to 800 times, and so on. After the words were grouped in this fashion, each band was sorted alphabetically by headword, and a separator line was inserted between adjacent bands. There were 27 bands in all, plus a small number of words which did not appear in Google's data at all.

One might reasonably inquire why the data was not simply presented in frequency order. The reason is that I think this would have implied more significance to the data than is actually there. As I will explain in the following paragraphs, the Google data is only loosely representative of true English word frequencies, and further inaccuracies have been introduced by my own procedures. I think that dividing the words into frequency bands as I have done here is less prone to misinterpretation than some other procedure purporting to greater accuracy.

Let me explore here some of the reasons for not taking the Google frequency data, and my procedures for processing it, too seriously.

The Web is full of gibberish. The phrases "NEVERENDING sweetie - animal thread" and "REDRUM REDRUM REDRUM REDRUM REDRUM" each were found by Google slightly more often than the phrase "over the past 10 years" (all about 160,000 times). I speculate that at least some of this is explained by the technique of setting up many replicated web pages linking to one another, with the hope of creating the semblance of a frequently referenced site. At any rate, one suspects after seeing this example that the frequency of the word "sweetie" in the Google data might be somewhat higher than the frequency in literature and conversation. One quarter of the total use of that word as counted by Google came from repetitions of the above not-exceptionally-lucid phrase!
The Web is biased towards certain kinds of content, and the vocabulary of that content is overrepresented. Three such biases are towards advertising and marketing, computers and pornography. The advertising bias is illustrated by the surprisingly high frequency of words such as "credit", "sale", "brand" and "discount". The computer bias is illustrated by words such as "click", "online", "icon" and "network". And the pornographic bias is illustrated by the high frequency of "anal", "nude", "teen" and other words less savory. Perhaps my favorite example is that "nostdinc" (a compiler option under the Linux operating system) occurs more frequently on the Web than the common word "responsibility".
Google's techniques for identification of English language text are somewhat imperfect, and additionally, some pages contain text in multiple languages. As a result, extremely common words in other languages, such as "la", "el" and "en", show up in the Google data with considerably higher frequency than credible for English.
When I correlated Google's data with the 2+2lemma list, I chose to ignore capitalization. This was necessary - as it would appear that capitalization on the Web is random, or at least beyond simple explanation. According to Google, the most common form of the word "borscht" on the Web is "Borscht", and of "mesh" is "MeSH". But there is a side-effect to this, which is that some words seem to be unnaturally frequent because of having the same form as commonly used names. Words in which this effect can be observed include "john", "china", "bush", "yahoo" and "august". And, perhaps surprisingly, the frequency of the headword "we" is exaggerated, as the most frequent form of "us" is "US", and probably most occurrences of "US" refer to the country rather than to the pronoun.
As noted above, the lemmatization of 2+2lemma introduces certain ambiguities - should the word "putting" count for the "put" or the "putt" headword? Since there is no way of knowing, when I accumulated the frequencies I used the expedient technique of dividing the count evenly between all the possible headwords. This assumes that "put" and "putt" are equally probable interpretations, which is of course wrong. Two excellent examples involving words of very high frequency are two forms of the verb "to be": "are" and "art". "are", as a noun, is an obscure unit of measurement, but it is credited with half the total count for the word, and thereby ends up in frequency band 5. (Not only are hardly any uses of "are" noun uses, but, almost certainly, most occurrences of the plural "ares" are actually references to the Greek god of war rather than to the unit.) "art" illustrates the other side of the problem. It is an archaic form of the verb "to be", and in this form is likely quite uncommon on the Web. The noun "art" shows up in band 8, but if half its count had not been credited to "be", it would be in band 7.

Now, after all that, you may be thinking that the Google frequency data, and this use of it, is silly beyond measure, and I don't want to leave you with that impression. With a few glaring exceptions, like "are" and "john", I find the division of the 2+2lemma data into frequency bands to be quite reasonable, and I'm making it available for exactly that reason. I don't know whether there are any practical uses for this data or not, but English word frequency information has always been of interest to "word nerds" like myself, and I offer this approximation on that basis.

The neol2007 list

As I noted above, the existing 12dicts lists have not been updated in the last 4 years. I have been working on other projects, and I think it is unlikely that these lists will see any further changes, except perhaps for minor error corrections. However, language does not remain static, and the 2007 editions of the 12dicts source dictionaries will contain some words which either did not exist, or were not important enough to list, when their previous editions were printed.

In lieu of trying to bring these lists up to date, I am publishing the file neol2007.txt, which contains newly popular words and phrases obtained from various sources which seem to me to be possibly worth adding. (Some of these words are already in one or more of the larger lists, but are now common enough to belong in the smaller ones as well.) Many of these words relate to two of the most important trends of recent years, the conflict between the developed world and Islamic extremism (called by many "the war on terror") and the growing importance of the Internet in our daily lives. A few of the words, such as "break-dance" and "dotcom", actually date back to the previous century - their omission from previous releases of 12dicts reflects the fact that lexicographers never manage to keep up. After all, it took them 20 years to recognize the word "mosh".

neol2007.txt is divided into two parts, a section of individual uncapitalized words and their inflections (as recorded in 2of12inf.txt and 2+2lemma.txt), and a section of additional hyphenated words, phrases and acronyms. (Observe the use of the % suffix to denote "Scrabble inflections".) Depending on your application for the 12dicts lists, you can choose to ignore these words, or add them in part or in whole to the other lists. (I note again that these words have already been added to the 2+2lemma and 2+2gfreq lists, marked with a plus sign to facilitate their removal.) I intend, if there are further revisions to 12dicts, to provide an appropriate neol20xx file each time.

My other projects

Since the previous release of 12dicts, I have been fooling around with English spelling reform. One of the results of this activity is the development of CAAPR and ABCD, both of which may be downloaded from my website, www.wyrdplay.org. CAAPR is the Combined Anglo-American Pronunciation Reference, a fancy name for a bi-dialectal pronunciation dictionary whose word list is derived primarily from the 12dicts 6of12 list. ABCD, Alan's Basic Codes with Diacritics, is also a pronunciation dictionary, of a somewhat different sort - the notation is designed to clarify when a word is spelled in accordance with normal English spelling patterns (as with "fault" or "tunnel"), and when it is not (as with "fought" or "colonel"). Though these files were developed as a result of my interest in spelling reform, they may be of interest to other "word nerds" unconcerned with that particular quixotic pastime.

Click the following links to CAAPR and ABCD if interested.

Conclusions

In the previous editions of 12dicts, I suggested that you write to me (biljir@pobox.com) and let me know what use you were making of 12dicts. I will repeat that request now. I have been delighted to see the interest in these lists for projects ranging from interactive games to literacy programs. And I have been particularly pleased to occasionally hear of first-year Computer Science assignments specifying a 12dicts list rather than /usr/dicts/words for their input. Keep up the good work, and let me know what you're doing. (Oh, and please put "12dicts" in the subject line when you email me. This will allow me to easily notice your mail even if it is misclassified by an overzealous filter as spam. Speaking of spam, the publication of my email address in this package has led to a marked increase in the amount of spam I receive and, ironically, much of it contains subject lines which appear to have been extracted at random from my own lists. This is a use of 12dicts of which I do not approve!)

A note on "licensing": 2+2lemma.txt and 2+2gfreq.txt were derived from 2of12inf.txt, which was itself derived in part from Kevin Atkinson's AGID, described in the file agid.txt. I place no additional restrictions on the use of these files beyond those imposed by agid.txt. I release neol2007.txt into the public domain.

- Alan Beale -