Definition of “word”

See also words from string (Z13402). Tokenization by whitespace could be generalized to tokenization by delimiter(s). If punctuation is suppressed by whitespace substitution or inclusion within delimiters, we converge on a common function.

In the domain of lexical forms, conventions vary by language. In English we have a particular difficulty with hyphens and apostrophes (occasionally described by the misnomer “interpunction”).

  • The string “don’t” is generally regarded as equivalent to “do not”, which is two words, not one.
  • The string “can’t” is generally regarded as equivalent to “cannot”, which might be considered a single word.
  • Contraction of “is” to “’s” may be indistinguishable from a possessive, so a whitespace-delimited string ending ’s may be considered either one word or two (whereas such a string ending s’ is always a single word, if correct).
  • Compound words are typically hyphenated in some contexts and left as separate words in others. A “well-known” distinction is one that is well known. Sometimes a form with neither hyphens nor spaces may be used (see, for example, https://books.google.com/ngrams/graph?content=wellknown%2Cwell-known%2Cwell+known&year_start=1800&year_end=2000&corpus=en-2019&smoothing=3.)

GrounderUK (talk) 13:40, 30 March 2024 (UTC)Reply

M513x542S1ce50491x458S22a00492x492S14c50487x511 M530x529S10018470x471S10641507x501S26505486x480S21600500x494 https://www.wikifunctions.org/w/index.php?title=Talk:Z13402&oldid=94708
Return to "Z13402" page.