First Word Instance

Monday, January 18th, 02021 at 11:11 UTC

There is a twitter bot called @NYT_first_said which pretty much does what it says on the tin.

Tweets words when they appear in the New York Times for the first time.
@NYT_first_said

It seems to be a pretty straight-forward bit of code. It gets new articles published daily. Then splits the text into words. It skips a few words here and there if they are proper nouns or contain numbers, etc. Finally, it checks the archive to see if the word has been previously used. If not, it sends out the word as a tweet.

The most recent tweet was:

legendness
— New New York Times (@NYT_first_said) January 18, 2021

This got us thinking that it wouldn’t be hard to replicate such a concept. We built a small application to fetch every published post in our WordPress database. The archive isn’t huge, which means we can look at every single article and see the unique words (the first time it was used) compared to the previous.

Over time, each new article has less and less of a probability of using a new word and therefore they become even more unique when they do.

We have published 260 articles which contain 14,100+ unique words. A sample of recent words used for the first time include:

omnibuses
consultancy
competencies
instalment
swept
ebikes
tug
commodity
bespoke
cobblers
dc
terrorists
impeached
tripled
headsets
heath
exceptionally

You can see that we haven’t yet exhausted regular English words yet. We have a ways to go before the only new unique words we use are new word creations.

Potential Improvements

There is a lot or room for improvement, but also overtime, these correct themselves. For instance, we could try to lemmetize words. This means that words like jump, jumps, jumped and jumping all are reduced to their base-form: jump. Then we wouldn’t have 4 unique words, just one. With enough articles, we’ll have written all the different forms and it is unlikely any future occurrence will be considered ‘unique’ with or without lemmatisation.

Words with numbers are ignored. We mix letters and numbers when talking about units: 5h or 10h or 30min. All of these will be ignored. That’s probably fine, otherwise we’ll be finding infinite amount of new words which are simply a number reference.

Spelling-mistakes, while mistakes, are new words! (Unless, we are consistent mis-spellers). There could be some list of common mis-spellings or dictionary cross-reference, but this is probably pushing it too far for a fun, hobby project.

The NYTimes bot skips capital letter words. These start sentences, but are probably elsewhere too so they will eventually get picked-up and indexed. What it does mean is that it skips proper nouns like place names and surnames used in articles. For them, that probably makes sense as they get quotes or references from people. We decided to make all input lowercase and the look for unique words.

Much like the @NYT_first_said bot that posts to twitter, we can also reuse the @optionalBot to post any new words found in our articles.

Googlewack

A Googlewack is when you google a term and it only appears once.

Ironically, as we find articles with words used for the first time (which only appear on a single page), then tweet or write about them in some Omnibus, they will appear on two pages and immediately disqualifying them from ever being a googlewack.

Categories: Applications, Briefs, Meta

Tags: nlp, text, words

(optional.is)

First Word Instance

Potential Improvements

Googlewack