Back in elementary school you learnt the difference between nouns, verbs, adjectives, and adverbs.

These "word classes" are not just the idle invention of grammarians, but are useful categories for many language processing tasks.

You might wonder what justification there is for introducing this extra level of information.

Many of these categories arise from superficial analysis the distribution of words in text.

Consider the following analysis involving By convention in NLTK, a tagged token is represented using a tuple consisting of the token and the tag.

We can create one of these special tuples from the standard string representation of a tagged token, using the function Other corpora use a variety of formats for storing part-of-speech tags.

method that divides up the tagged words into sentences rather than presenting them as one big list.

NLTK's corpus readers provide a uniform interface so that you don't have to be concerned with the different file formats.

In contrast with the file fragment shown above, the corpus reader for the Brown Corpus represents the data as shown below.

The process of classifying words into their is a noun meaning "trash" (i.e. Thus, we need to know which word is being used in order to pronounce the text correctly.

(For this reason, text-to-speech systems usually perform POS-tagging.) seem to have their uses, but the details will be obscure to many readers.

