Lexica
Age and Gender Lexica
Our data-driven age and gender lexica were generated from about 97,000 Facebook, Blogger and Twitter users. [.zip]
Link to PublicationAPA Citation
Sap, M., Park, G., Eichstaedt, J. C., Kern, M. L., Stillwell, D. J., Kosinski, M., Ungar, L. H., & Schwartz, H. A. (2014). Developing age and gender predictive lexica over social media. EMNLP.
Bibtex Citation
@inproceedings{sap2014developing, author={Schwartz, H Andrew}, title={Developing age and gender predictive lexica}, year={2014}, }Refined Lexica
Please email Andrew Schwartz to request a refined lexica such as LIWC. This may be posted shortly.
Link to PublicationAPA Citation
Schwartz, H. A., Eichstaedt, J. C., Dziurzynski, L., Kern, M. L., Blanco, E., Ramones, S., Seligman, M. E. P., & Ungar, L. H. (2013). Choosing the right words: Characterizing and reducing error of the Word Count Approach.. Proceedings of *SEM-2013: Second Joint Conference on Lexical and Computational Semantics, Atlanta, Georgia, USA, (), . Atlanta, GA, USA. 296-305.
Bibtex Citation
@inproceedings{schwartz2013choosing, author={Schwartz, H Andrew and Eichstaedt, Johannes C and Dziurzynski, Lukasz and Kern, Margaret L and Blanco, Eduardo and Ramones, Stephanie and Seligman, Martin E P and Ungar, Lyle H}, title={Choosing the right words: Characterizing and reducing error of the Word Count Approach.}, booktitle={Proceedings of *SEM-2013: Second Joint Conference on Lexical and Computational Semantics, Atlanta, Georgia, USA}, year={2013}, location={Atlanta, GA, USA}, pages={296-305} }PERMA Lexicon Our lexicon to predict well-being as measured through PERMA scales. [.zip] [Usage license]
Link to PublicationAPA Citation
H. Andrew Schwartz, M. S. (2016). Predicting Individual Well-Being Through the Language of Social Media. Pacific Symposium on Biocomputing, 21(), . . 516-527.
Bibtex Citation
@{h. andrew schwartz2016predicting, author={H. Andrew Schwartz, Maarten Sap, Margaret L. Kern, Johannes C. Eichstaedt, Adam Kapelner, Megha Agrawal, Eduardo Blanco, Lukasz Dziurzynski, Gregory Park, David Stillwell, Michal Kosinski, Martin E.P. Seligman, Lyle H. Ungar.}, title={Predicting Individual Well-Being Through the Language of Social Media}, year={2016}, pages={516-527} }Spanish PERMA Lexicon Our lexicon to measure PERMA in Spanish, derived from Spanish tweets annotated with PERMA. [.zip]
Link to PublicationAPA Citation
H. Andrew Schwartz, M. S. (2016). Predicting Individual Well-Being Through the Language of Social Media. Pacific Symposium on Biocomputing, 21(), . . 516-527.
Bibtex Citation
@{h. andrew schwartz2016predicting, author={H. Andrew Schwartz, Maarten Sap, Margaret L. Kern, Johannes C. Eichstaedt, Adam Kapelner, Megha Agrawal, Eduardo Blanco, Lukasz Dziurzynski, Gregory Park, David Stillwell, Michal Kosinski, Martin E.P. Seligman, Lyle H. Ungar.}, title={Predicting Individual Well-Being Through the Language of Social Media}, year={2016}, pages={516-527} }Prospection Lexicon: Temporal Orientation
-
Optimism Lexicon
Combining the affect lexicon with the future orientation lexicon produces an optimism lexicon (positive future-oriented thinking). Simply filter messages to those future-oriented using the future orientation lexicon, then apply the affect lexicon.
Empathic Concern Lexicon
Lexicon and clusters included. [.zip]
Link to PublicationAPA Citation
Sedoc, J., Buechel, S., Nachmany, Y., Buffone, A., Ungar, L. (2019). Learning Word Ratings for Empathy and Distress from Document-Level User Responses. https://arxiv.org/abs/1912.01079
Bibtex Citation
@{sedoc2019learning, author={João Sedoc, Sven Buechel, Yehonathan Nachmany, Anneke Buffone, Lyle Ungar}, title={Learning Word Ratings for Empathy and Distress from Document-Level User Responses}, year={2019}, eprint={1912.01079}, archivePrefix={arXiv}, primaryClass={cs.CL} }Personal Distress Lexicon
Lexicon and clusters included. [.zip]
Link to PublicationAPA Citation
Sedoc, J., Buechel, S., Nachmany, Y., Buffone, A., Ungar, L. (2019). Learning Word Ratings for Empathy and Distress from Document-Level User Responses. https://arxiv.org/abs/1912.01079
Bibtex Citation
@{sedoc2019learning, author={João Sedoc, Sven Buechel, Yehonathan Nachmany, Anneke Buffone, Lyle Ungar}, title={Learning Word Ratings for Empathy and Distress from Document-Level User Responses}, year={2019}, eprint={1912.01079}, archivePrefix={arXiv}, primaryClass={cs.CL} }Additional Lexica Please visit [LexHub] for more resources.
When Applying the Lexica To calculate the lexicon usage, one can take the sum over all words of the word weight in that particular lexicon multiplied by that word's relative frequency, and subsequently adding the intercept value to correct for the model bias (found under '_intercept' in the lexicon csvs).
Click here for a walk-through example
A weighted lexicon is often applied as the sum of all weighted word relative frequencies over a document:
whereis the lexicon
weight for the word,
is frequency of the word in the document (or for a given user), and
is the total word count for that document (or user).
For example, let's say a lexicon has the following weights for words a, b, and c:
and two documents with the following frequencies of words:
therefore the total word uses in the documents are:
The documents' lexicon usage are given by summing the weighted relative frequencies:
Once the usages have been computed, the intercept of the lexicon needs to be added to the usages:
If the lexicon used represents age,
and
are the predicted ages for both documents. If it represents gender, simply take the sign of the result and if it's positive, the document is female, else it's male.