Location
Event Description
Recently, large-scale language data combined with modern machine learning techniques have shown strong value as means for studying human psychology and behavior. For example, language alone has been shown predictive in mental health, personality, and health behaviors. However, many applications for such language-based assessments have readily available and important data beyond language (i.e. extra-linguistics), such as predicting the subjective well-being of a community using tweets, where one can take into account their age, education, and demographic attributes. Language may capture some characteristics while extra-linguistic variables captures others. We believe that effectively integrating linguistic and extra-linguistic data can yield benefits beyond either independently.
In this thesis, we develop methods which effectively integrate extra-linguistic data with language data focused primarily on social scientific applications. The central challenge is dealing with the size and heterogeneity of, often sparse and noisy, language data versus the, often low-dimensional and non-sparse, extra-linguistic variables. First, we consider structured extra-linguistics, like socioeconomic (income and education rates) and demographics (age, gender, etc.), and propose two integration methods, named residualized controls (RC) and residualized factor adaptation (RFA), to be used in county-wise prediction tasks. Demonstrating techniques that integrate information at both the model-level and data-level, we found consistently strong improvement over naively combining features, for example, increasing county level well-being predictions by over 12%. Next, we consider unstructured extra-linguistic data. In the first part, we incorporate social network connections and language over time to propose a novel metric for quantifying the stickiness of words - their ability to spread across friendship connections in a social network over time (or in other words, stick in ones vocabulary after seeing friends use it). We obtain which language features are more probable to disseminate through friendship and show such a metric is useful for predicting who will be friends and what content will spread. In addition, we analyze language content over time by proposing a novel dynamic content-specific topic modeling technique that can help to identify different sub-domains of a thematic scope and can be used to track societal shifts in concerns or views over time.