-
Age and Gender Lexica
Our data-driven age and gender lexica were generated from about 97,000 Facebook, Blogger and Twitter users. [.zip]
Link to Publication
APA Citation Bibtex CitationSap, M., Park, G., Eichstaedt, J. C., Kern, M. L., Stillwell, D. J., Kosinski, M., Ungar, L. H., & Schwartz, H. A. (2014). Developing age and gender predictive lexica over social media. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), (), . . .@inproceedings{sap2014developing,author={Sap, Maarten and Park, Greg and Eichstaedt, Johannes C and Kern, Margaret L and Stillwell, David J and Kosinski, Michal and Ungar, Lyle H and Schwartz, H Andrew},}
title={Developing age and gender predictive lexica over social media},
booktitle={Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
year={2014},
-
Refined Lexica
Please email Andrew Schwartz to request a refined lexica such as LIWC. This may be posted shortly.
Link to Publication
APA Citation Bibtex CitationSchwartz, H. A., Eichstaedt, J. C., Dziurzynski, L., Kern, M. L., Blanco, E., Ramones, S., Seligman, M. E. P., & Ungar, L. H. (2013). Choosing the right words: Characterizing and reducing error of the Word Count Approach.. Proceedings of *SEM-2013: Second Joint Conference on Lexical and Computational Semantics, Atlanta, Georgia, USA, (), . Atlanta, GA, USA. 296-305.@inproceedings{schwartz2013choosing,author={Schwartz, H Andrew and Eichstaedt, Johannes C and Dziurzynski, Lukasz and Kern, Margaret L and Blanco, Eduardo and Ramones, Stephanie and Seligman, Martin E P and Ungar, Lyle H},}
title={Choosing the right words: Characterizing and reducing error of the Word Count Approach.},
booktitle={Proceedings of *SEM-2013: Second Joint Conference on Lexical and Computational Semantics, Atlanta, Georgia, USA},
year={2013},
location={Atlanta, GA, USA},
pages={296-305}
- PERMA Lexicon
Our lexicon to predict well-being as measured through PERMA scales. [.zip] [Usage license]
Link to Publication
APA Citation Bibtex Citation& H. Andrew Schwartz, M. S. (2016). Predicting Individual Well-Being Through the Language of Social Media. Pacific Symposium on Biocomputing, 21(), . . 516-527.@{h. andrew schwartz2016predicting,author={H. Andrew Schwartz, Maarten Sap, Margaret L. Kern, Johannes C. Eichstaedt, Adam Kapelner, Megha Agrawal, Eduardo Blanco, Lukasz Dziurzynski, Gregory Park, David Stillwell, Michal Kosinski, Martin E.P. Seligman, Lyle H. Ungar.},}
title={Predicting Individual Well-Being Through the Language of Social Media},
year={2016},
pages={516-527}
- Spanish PERMA Lexicon
Our lexicon to measure PERMA in Spanish, derived from Spanish tweets annotated with PERMA. [.zip]
Link to Publication
APA Citation Bibtex CitationH. Andrew Schwartz, M. S. (2016). Predicting Individual Well-Being Through the Language of Social Media. Pacific Symposium on Biocomputing, 21(), . . 516-527.@{h. andrew schwartz2016predicting,author={H. Andrew Schwartz, Maarten Sap, Margaret L. Kern, Johannes C. Eichstaedt, Adam Kapelner, Megha Agrawal, Eduardo Blanco, Lukasz Dziurzynski, Gregory Park, David Stillwell, Michal Kosinski, Martin E.P. Seligman, Lyle H. Ungar.},}
title={Predicting Individual Well-Being Through the Language of Social Media},
year={2016},
pages={516-527} - Prospection Lexicon: Temporal Orientation
- Affect and Intensity Lexicon
-
Optimism Lexicon
- Empathic Concern Lexicon
Lexicon and clusters included. [.zip]
Link to Publication
APA Citation Bibtex CitationSedoc, J., Buechel, S., Nachmany, Y., Buffone, A., Ungar, L. (2019). Learning Word Ratings for Empathy and Distress from Document-Level User Responses. https://arxiv.org/abs/1912.01079@{sedoc2019learning,author={João Sedoc, Sven Buechel, Yehonathan Nachmany, Anneke Buffone, Lyle Ungar},}
title={Learning Word Ratings for Empathy and Distress from Document-Level User Responses},
year={2019},
eprint={1912.01079},
archivePrefix={arXiv},
primaryClass={cs.CL} - Personal Distress Lexicon
Lexicon and clusters included. [.zip]
Link to Publication
APA Citation Bibtex CitationSedoc, J., Buechel, S., Nachmany, Y., Buffone, A., Ungar, L. (2019). Learning Word Ratings for Empathy and Distress from Document-Level User Responses. https://arxiv.org/abs/1912.01079@{sedoc2019learning,author={João Sedoc, Sven Buechel, Yehonathan Nachmany, Anneke Buffone, Lyle Ungar},}
title={Learning Word Ratings for Empathy and Distress from Document-Level User Responses},
year={2019},
eprint={1912.01079},
archivePrefix={arXiv},
primaryClass={cs.CL} -
Additional Lexica
Please visit LexHub for more resources.
-
When Applying the Lexica
To calculate the lexicon usage, one can take the sum over all words of the word weight in that particular lexicon multiplied by that word's relative frequency, and subsequently adding the intercept value to correct for the model bias (found under '_intercept' in the lexicon csvs).
Click here for a walk-through exampleA weighted lexicon is often applied as the sum of all weighted word relative frequencies over a document:
whereis the lexicon
weight for the word,
is frequency of the word in the document (or for a given user), and
is the total word count for that document (or user).
For example, let's say a lexicon has the following weights for words a, b, and c:
and two documents with the following frequencies of words:
therefore the total word uses in the documents are:
The documents' lexicon usage are given by summing the weighted relative frequencies:
Once the usages have been computed, the intercept of the lexicon needs to be added to the usages:
If the lexicon used represents age,
and
are the predicted ages for both documents. If it represents gender, simply take the sign of the result and if it's positive, the document is female, else it's male.