Talk:Z13402
Latest comment: 3 months ago by GrounderUK in topic Definition of βwordβ
Definition of βwordβ
See also words from string (Z13402). Tokenization by whitespace could be generalized to tokenization by delimiter(s). If punctuation is suppressed by whitespace substitution or inclusion within delimiters, we converge on a common function.
In the domain of lexical forms, conventions vary by language. In English we have a particular difficulty with hyphens and apostrophes (occasionally described by the misnomer βinterpunctionβ).
- The string βdonβtβ is generally regarded as equivalent to βdo notβ, which is two words, not one.
- The string βcanβtβ is generally regarded as equivalent to βcannotβ, which might be considered a single word.
- Contraction of βisβ to ββsβ may be indistinguishable from a possessive, so a whitespace-delimited string ending βs may be considered either one word or two (whereas such a string ending sβ is always a single word, if correct).
- Compound words are typically hyphenated in some contexts and left as separate words in others. A βwell-knownβ distinction is one that is well known. Sometimes a form with neither hyphens nor spaces may be used (see, for example, https://books.google.com/ngrams/graph?content=wellknown%2Cwell-known%2Cwell+known&year_start=1800&year_end=2000&corpus=en-2019&smoothing=3.)