Dating i letland
172 For Tweets in Dutch, we first look at the official user interface for the Twi NL data set, Among other things, it shows gender and age statistics for the users producing the tweets found for user specified searches.
These statistics are derived from the users profile information by way of some heuristics.
2004), with and without preprocessing the input vectors with Principal Component Analysis (PCA; (Pearson 1901); (Hotelling 1933)).
We also varied the recognition features provided to the techniques, using both character and token n-grams.
Gender recognition has also already been applied to Tweets. (2010) examined various traits of authors from India tweeting in English, combining character N-grams and sociolinguistic features like manner of laughing, honorifics, and smiley use.
The authors do not report the set of slang words, but the non-dictionary words appear to be more related to style than to content, showing that purely linguistic behaviour can contribute information for gender recognition as well.Later, in 2004, the group collected a Blog Authorship Corpus (BAC; (Schler et al.2006)), containing about 700,000 posts to (in total about 140 million words) by almost 20,000 bloggers. Slightly more information seems to be coming from content (75.1% accuracy) than from style (72.0% accuracy). We see the women focusing on personal matters, leading to important content words like love and boyfriend, and important style words like I and other personal pronouns.Two other machine learning systems, Linguistic Profiling and Ti MBL, come close to this result, at least when the input is first preprocessed with PCA. Introduction In the Netherlands, we have a rather unique resource in the form of the Twi NL data set: a daily updated collection that probably contains at least 30% of the Dutch public tweet production since 2011 (Tjong Kim Sang and van den Bosch 2013).However, as any collection that is harvested automatically, its usability is reduced by a lack of reliable metadata.