Lexica

Age and Gender Lexica
Our data-driven age and gender lexica were generated from about 97,000 Facebook, Blogger and Twitter users. [.zip]
Link to Publication

APA Citation

Sap, M., Park, G., Eichstaedt, J. C., Kern, M. L., Stillwell, D. J., 
Kosinski, M., Ungar, L. H., & Schwartz, H. A. (2014). Developing age 
and gender predictive lexica over social media. EMNLP.

Bibtex Citation

@inproceedings{sap2014developing,
  author={Schwartz, H Andrew},
  title={Developing age and gender predictive lexica},
  year={2014},
}

Refined Lexica
Please email Andrew Schwartz to request a refined lexica such as LIWC. This may be posted shortly.
Link to Publication

APA Citation

Schwartz, H. A., Eichstaedt, J. C., Dziurzynski, L., Kern, M. L., Blanco, E., Ramones, S., Seligman, M. E. P., & Ungar, L. H. (2013). Choosing the right words: Characterizing and reducing error of the Word Count Approach.. Proceedings of *SEM-2013: Second Joint Conference on Lexical and Computational Semantics, Atlanta, Georgia, USA, (), . Atlanta, GA, USA. 296-305.

Bibtex Citation

@inproceedings{schwartz2013choosing,
  author={Schwartz, H Andrew and Eichstaedt, Johannes C and Dziurzynski, Lukasz and Kern, Margaret L and Blanco, Eduardo and Ramones, Stephanie and Seligman, Martin E P and Ungar, Lyle H},
  title={Choosing the right words: Characterizing and reducing error of the Word Count Approach.},
  booktitle={Proceedings of *SEM-2013: Second Joint Conference on Lexical and Computational Semantics, Atlanta, Georgia, USA},
  year={2013},
  location={Atlanta, GA, USA},
  pages={296-305}
 }

PERMA Lexicon Our lexicon to predict well-being as measured through PERMA scales. [.zip] [Usage license]
Link to Publication

APA Citation

H. Andrew Schwartz, M. S. (2016). Predicting Individual Well-Being Through the Language of Social Media. Pacific Symposium on Biocomputing, 21(), . . 516-527.

Bibtex Citation

@{h. andrew schwartz2016predicting,
    author={H. Andrew Schwartz, Maarten Sap, Margaret L. Kern, Johannes C. Eichstaedt, Adam Kapelner, Megha Agrawal, Eduardo Blanco, Lukasz Dziurzynski, Gregory Park, David Stillwell, Michal Kosinski, Martin E.P. Seligman, Lyle H. Ungar.},
    title={Predicting Individual Well-Being Through the Language of Social Media},
    year={2016},
    pages={516-527}
  }

Spanish PERMA Lexicon Our lexicon to measure PERMA in Spanish, derived from Spanish tweets annotated with PERMA. [.zip]
Link to Publication

APA Citation

H. Andrew Schwartz, M. S. (2016). Predicting Individual Well-Being Through the Language of Social Media. Pacific Symposium on Biocomputing, 21(), . . 516-527.

Bibtex Citation

@{h. andrew schwartz2016predicting,
    author={H. Andrew Schwartz, Maarten Sap, Margaret L. Kern, Johannes C. Eichstaedt, Adam Kapelner, Megha Agrawal, Eduardo Blanco, Lukasz Dziurzynski, Gregory Park, David Stillwell, Michal Kosinski, Martin E.P. Seligman, Lyle H. Ungar.},
    title={Predicting Individual Well-Being Through the Language of Social Media},
    year={2016},
    pages={516-527}
  }

Prospection Lexicon: Temporal Orientation
Affect and Intensity Lexicon
Optimism Lexicon

Combining the affect lexicon with the future orientation lexicon produces an optimism lexicon (positive future-oriented thinking). Simply filter messages to those future-oriented using the future orientation lexicon, then apply the affect lexicon.

Empathic Concern Lexicon
Lexicon and clusters included. [.zip]
Link to Publication

APA Citation

Sedoc, J., Buechel, S., Nachmany, Y., Buffone, A., Ungar, L. (2019). Learning Word Ratings for Empathy and Distress from Document-Level User Responses. https://arxiv.org/abs/1912.01079

Bibtex Citation

 @{sedoc2019learning,
 author={João Sedoc, Sven Buechel, Yehonathan Nachmany, Anneke Buffone, Lyle Ungar},
 title={Learning Word Ratings for Empathy and Distress from Document-Level User Responses},
 year={2019},
 eprint={1912.01079},
 archivePrefix={arXiv},
 primaryClass={cs.CL}
 }

Personal Distress Lexicon
Lexicon and clusters included. [.zip]
Link to Publication

APA Citation

Sedoc, J., Buechel, S., Nachmany, Y., Buffone, A., Ungar, L. (2019). Learning Word Ratings for Empathy and Distress from Document-Level User Responses. https://arxiv.org/abs/1912.01079

Bibtex Citation

 @{sedoc2019learning,
 author={João Sedoc, Sven Buechel, Yehonathan Nachmany, Anneke Buffone, Lyle Ungar},
 title={Learning Word Ratings for Empathy and Distress from Document-Level User Responses},
 year={2019},
 eprint={1912.01079},
 archivePrefix={arXiv},
 primaryClass={cs.CL}
 }

Additional Lexica Please visit [LexHub] for more resources.
When Applying the Lexica To calculate the lexicon usage, one can take the sum over all words of the word weight in that particular lexicon multiplied by that word's relative frequency, and subsequently adding the intercept value to correct for the model bias (found under '_intercept' in the lexicon csvs).

Click here for a walk-through example

A weighted lexicon is often applied as the sum of all weighted word relative frequencies over a document:
$usage_{lex}=\sum_{word\in lex}w_{lex}(word)*\frac{freq(word,doc)}{freq(*,doc)}$
where $w_{lex}(word)$ is the lexicon $(lex)$ weight for the word, $freq(word, doc)$ is frequency of the word in the document (or for a given user), and $freq(*, doc)$ is the total word count for that document (or user).
For example, let's say a lexicon has the following weights for words a, b, and c:
$\textbf{lex:}\ a: 3, b: 87, c: -15$

and two documents with the following frequencies of words:
$document_1: a: 2, b: 10, c: 3, d: 0, e: 6, f: 4$
$document_2: a: 5, b: 3, c: 8, d: 4, e: 0, f: 10$

therefore the total word uses in the documents are:
$document_1: 2 + 10 + 3 + 0 + 6 + 4 = 25$
$document_2: 5 + 3 + 8 + 4 + 0 + 10 = 30$

The documents' lexicon usage are given by summing the weighted relative frequencies:
$doc_{1,lex} = (2/25)\cdot3 + (10/25)\cdot87 + (3/25)\cdot(-15) = 33.24$
$doc_{2,lex} = (5/30)\cdot3 + (3/30)\cdot87 + (8/30)\cdot(-15) = 5.2$

Once the usages have been computed, the intercept of the lexicon needs to be added to the usages:
$intercept_{lex} = 23.2189$
$lex_{doc1} = 34.24 + 23.2189 = 56.4589$
$lex_{doc2} = 5.2 + 23.2189 = 28.4189$

If the lexicon used represents age, $lex_{doc1}$ and $lex_{doc2}$ are the predicted ages for both documents. If it represents gender, simply take the sign of the result and if it's positive, the document is female, else it's male.