Thursday, March 1, 2012

A List of 377 Common, Relatively Insignificant Words - First Cut

Any word can have significance.  I needed a list of words that would be least likely to have much significance in most situations.

I assembled this list as an aid in identifying words that would be relatively unlikely to provide insight into the contents of a file, when used in the name of that file.  I hoped that elimination of these words would make it easier to notice important emphases among the words that remained.  This blog should have a post on that project at about the same time as this post.

I assembled this list from several sources.  I started with the list of the 100 most common words in English -- a list that, according to Wikipedia, was computed from an analysis of the billion-word Oxford English Corpus.  I reduced that list to remove about 20 words that, although common, seemed especially capable of having significance even in a filename or other brief communication (e.g., "person," "time," "know").

Among the remaining words on that list, I expanded some to include other forms in normal usage.  For example, along with "be" as the second most common word in English, I added "is," "am," "are," "was," "were," and "been."  Some such alternate forms were already on the list (e.g., "we" and "us"); others may have been much less frequently used.  I added some common words that were substantially similar to, or that would arise from combinations of, words on the list (e.g., "whenever," "wouldn't").  I also expanded the several numerical words on the list to include all single-word cardinal and ordinal numbers up to "ninety" and "ninetieth," along with other counting adjectives (e.g., "many").

From that start with the list of 100 most common words, I turned to certain specific types of words.  I focused on commonly recognized parts of speech, particularly conjunctions (e.g., "but"), pronouns (e.g., "she"), conjunctive adverbs (e.g., "however"), and prepositions (e.g., "across").  There was a lot of overlap; lists of these kinds of words expanded but also tended to confirm what was already on the list.  I added some relatively generic adverbs (e.g., "actually") and adjectives (e.g., "actual").  I also drew from Wikipedia's lists of the first and second hundred English basic words, and threw in some other frequently used words of relatively minor significance (e.g., "evidently").

These steps produced the following list of words that I intended to apply to my project.

a
aboard
about
above
according
accordingly
across
actual
actually
additional
additionally
after
again
against
all
almost
along
alongside
also
although
always
am
amid
amidst
among
amongst
an
and
another
anti
any
anybody
anyone
anything
anyway
apparent
apparently
are
around
as
astride
at
atop
away
barring
be
because
been
before
behind
below
beneath
beside
besides
between
beyond
both
but
by
can
cannot
can't
certain
certainly
circa
clear
clearly
commonly
comparable
comparative
comparatively
concerning
consequent
consequently
considering
contrarily
conversely
could
couldn't
cum
despite
did
didn't
different
do
does
doesn't
done
down
during
each
eight
eighteen
eighteenth
eighth
eightieth
eighty
either
eleven
eleventh
elsewhere
equally
especially
even
every
everybody
everyone
everything
evident
evidently
except
excepting
excluding
few
fifteen
fifteenth
fifth
fiftieth
fifty
finally
first
five
following
for
fortieth
forty
four
fourteen
fourteenth
fourth
from
further
furthermore
generally
get
gets
getting
go
going
gone
got
had
has
have
he
hence
henceforth
her
here
hers
herself
him
himself
his
honestly
how
however
I
if
I'll
I'm
important
in
incidentally
including
inside
instead
into
is
isn't
it
its
it's
itself
I've
just
less
likely
likewise
little
many
may
me
meanwhile
might
mine
minus
more
moreover
most
much
must
my
myself
namely
near
nearly
neither
never
nevertheless
next
nine
nineteen
nineteenth
ninetieth
ninety
ninth
no
nobody
none
nonetheless
nor
not
nothing
notwithstanding
now
of
off
often
on
once
one
only
onto
or
other
others
otherwise
our
ours
ourselves
out
outside
over
particular
particularly
per
plus
prior
provided
rather
re
really
regard
regarding
regardless
relatively
same
seem
seemingly
seems
seven
seventeen
seventeenth
seventh
seventieth
seventy
several
she
should
similar
similarly
since
six
sixteen
sixteenth
sixth
sixtieth
sixty
small
so
some
somebody
someone
something
soon
specific
specific
specifically
still
subsequent
subsequently
such
ten
tenth
than
that
the
their
theirs
them
themselves
then
there
thereafter
therefore
these
they
third
thirteen
thirteenth
thirtieth
thirty
this
those
though
three
through
throughout
thru
thus
till
to
together
too
toward
towards
truly
twelfth
twelve
twentieth
twenty
twice
two
ultimately
under
underneath
undoubtedly
unless
unlike
until
up
upon
us
versus
very
via
vis-a-vis
vs
was
way
we
well
went
were
what
whatever
when
whenever
where
whereas
wherever
whether
which
whichever
while
who
whoever
whom
whomever
whose
why
will
with
within
without
worse
worst
would
wouldn't
yes
yet
you
your
yours
yourself
yourselves

2 comments:

raywood

Depending on the situation, it appeared that I could have added some other kinds of items to the list. One such category: numerical (as distinct from verbal) numbers (1, 2, ... and 1st, 2nd, ... ) up to some point. Another possibility: years and months. Also common terms for filenames (e.g., "spreadsheet" and variations on "email" and "doc").

raywood

The project in which I attempted to use this list was one that sought to identify multiword phrases.