Data Sets

From wiki
Jump to navigation Jump to search

Data Sets used to mine data from Natural Language (e.g. Twitter

ocean.1to3grams.gender_age.rmatrix.csv

via

http://wwbp.org/data.html

excerpt

In psychology, the Big Five personality traits are five broad domains or dimensions of personality that are used to describe human personality. The theory based on the Big Five factors is called the five-factor model (FFM). The five factors are openness, conscientiousness, extraversion, agreeableness, and neuroticism. Acronyms commonly used to refer to the five traits collectively are OCEAN, NEOAC, or CANOE. Beneath each global factor, a cluster of correlated and more specific primary factors are found; for example, extraversion includes such related qualities as gregariousness, assertiveness, excitement seeking, warmth, activity, and positive emotions.

sql table

https://alaning.me/index.php/Twitter_Project_Database_Tables#PERSONABLE

	[word] [varchar](40) NOT NULL,
	[ext] [real] NULL,
	[agr] [real] NULL,
	[con] [real] NULL,
	[neu] [real] NULL,
	[ope] [real] NULL

c# sql loader

http://git.alaning.me/root/load_db_personable

NRCEmotional

via

http://www.saifmohammad.com/WebPages/ResearchInterests.html

excerpt

The lexicon has human annotations of emotion associations for more than 24,200 word senses (about 14,200 word types). The annotations include whether the target is positive or negative, and whether the target has associations with eight basic emotions (joy, sadness, anger, fear, surprise, anticipation, trust, disgust).

sql table

https://alaning.me/index.php/Twitter_Project_Database_Tables#NRCEmotional

[word] [varchar](20) NOT NULL,
[positive] [real] NULL,
[negative] [real] NULL,
[anger] [real] NULL,
[anticipation] [real] NULL,
[disgust] [real] NULL,
[fear] [real] NULL,
[joy] [real] NULL,
[sadness] [real] NULL,
[surprise] [real] NULL,
[trust] [real] NULL

c# sql loader

http://git.alaning.me/root/load_db_nrcemotional

NRC Hashtag Sentiment Lexicon

via

http://www.saifmohammad.com/WebPages/ResearchInterests.html

http://www.umiacs.umd.edu/~saif/WebPages/Abstracts/NRC-SentimentAnalysis.htm

excerpt

is also a list of words with associations to positive an negative sentiments. It has the same format as the NRC Hashtag Sentiment Lexicon. However, it was created from the sentiment140 corpus of 1.6 million tweets, and emoticons were used as positive and negative labels (instead of hashtagged words).


sql table

https://alaning.me/index.php/Twitter_Project_Database_Tables#NRCSentimental

[word] [varchar](225) NOT NULL,
[sentimentScore] [real] NULL,
[numPositive] [real] NULL,
[numNegative] [real] NULL

c# sql loader

http://git.alaning.me/root/load_db_nrcsentimental

Sentiment140 Lexicon

via

http://www.saifmohammad.com/WebPages/ResearchInterests.html

http://www.umiacs.umd.edu/~saif/WebPages/Abstracts/NRC-SentimentAnalysis.htm

excerpt

is a list of words with associations to positive and negative sentiments. The lexicon is distributed in three files: unigrams-pmilexicon.txt, bigrams-pmilexicon.txt, and pairs-pmilexicon.txt. The hashtag lexicon was created from a collection of tweets that had a positive or a negative word hashtag such as #good, #excellent, #bad, and #terrible.

sql table

https://alaning.me/index.php/Twitter_Project_Database_Tables#NRCSentimentalTags

[word] [varchar](40) NOT NULL,
[sentimentScore] [real] NULL,
[numPositive] [real] NULL,
[numNegative] [real] NULL

c# sql loader

http://git.alaning.me/root/load_db_nrcsentimentaltags

labMIT-1.0.txt

via

http://www.uvm.edu/storylab/2011/12/08/hedonometrics/ https://pypi.python.org/pypi/labMTsimple/2.1.4

excerpt

language assessment by Mechanical Turk

sql table

https://alaning.me/index.php/Twitter_Project_Database_Tables#LABMIT

	[word] [varchar](20) NOT NULL,
	[happiness_rank] [real] NULL,
	[happiness_average] [real] NULL,
	[happiness_standard_deviation] [real] NULL,
	[twitter_rank] [real] NULL,
	[google_rank] [real] NULL,
	[nyt_rank] [real] NULL,
	[lyrics_rank] [real] NULL

c# sql loader

http://git.alaning.me/root/load_db_labmit

LIWC2007.dic

via

http://www.liwc.net https://code.google.com/p/negotiations-ling773/source/browse/trunk/resources/LIWC2007.dic?r=2

excerpt

Linguistic Inquiry and Word Count (LIWC) is a text analysis software program designed by James W. Pennebaker, Roger J. Booth, and Martha E. Francis. LIWC calculates the degree to which people use different categories of words across a wide array of texts, including emails, speeches, poems, or transcribed daily speech. With a click of a button, you can determine the degree any text uses positive or negative emotions, self-references, causal words, and 70 other language dimensions.

sql table

https://alaning.me/index.php/Twitter_Project_Database_Tables#TEDIOUS

[word] [varchar](255) NULL,
[funct] [int] NULL,
[pronoun] [int] NULL,
[ppron] [int] NULL,
[i] [int] NULL,
[we] [int] NULL,
[you] [int] NULL,
[shehe] [int] NULL,
[they] [int] NULL,
[ipron] [int] NULL,
[article] [int] NULL,
[verb] [int] NULL,
[auxverb] [int] NULL,
[past] [int] NULL,
[present] [int] NULL,
[future] [int] NULL,
[adverb] [int] NULL,
[preps] [int] NULL,
[conj] [int] NULL,
[negate] [int] NULL,
[quant] [int] NULL,
[number] [int] NULL,
[swear] [int] NULL,
[social] [int] NULL,
[family] [int] NULL,
[friend] [int] NULL,
[humans] [int] NULL,
[affect] [int] NULL,
[posemo] [int] NULL,
[negemo] [int] NULL,
[anx] [int] NULL,
[anger] [int] NULL,
[sad] [int] NULL,
[cogmech] [int] NULL,
[insight] [int] NULL,
[cause] [int] NULL,
[discrep] [int] NULL,
[tentat] [int] NULL,
[certain] [int] NULL,
[inhib] [int] NULL,
[incl] [int] NULL,
[excl] [int] NULL,
[percept] [int] NULL,
[see] [int] NULL,
[hear] [int] NULL,
[feel] [int] NULL,
[bio] [int] NULL,
[body] [int] NULL,
[health] [int] NULL,
[sexual] [int] NULL,
[ingest] [int] NULL,
[relativ] [int] NULL,
[motion] [int] NULL,
[space] [int] NULL,
[time] [int] NULL,
[work] [int] NULL,
[achieve] [int] NULL,
[leisure] [int] NULL,
[home] [int] NULL,
[money] [int] NULL,
[relig] [int] NULL,
[death] [int] NULL,
[assent] [int] NULL,
[nonfl] [int] NULL,
[filler] [int] NULL

https://alaning.me/index.php/Twitter_Project_Database_Tables#DEVIOUS

[tweet_id] [varchar](255) NULL,
[funct] [int] NULL,
[pronoun] [int] NULL,
[ppron] [int] NULL,
[i] [int] NULL,
[we] [int] NULL,
[you] [int] NULL,
[shehe] [int] NULL,
[they] [int] NULL,
[ipron] [int] NULL,
[article] [int] NULL,
[verb] [int] NULL,
[auxverb] [int] NULL,
[past] [int] NULL,
[present] [int] NULL,
[future] [int] NULL,
[adverb] [int] NULL,
[preps] [int] NULL,
[conj] [int] NULL,
[negate] [int] NULL,
[quant] [int] NULL,
[number] [int] NULL,
[swear] [int] NULL,
[social] [int] NULL,
[family] [int] NULL,
[friend] [int] NULL,
[humans] [int] NULL,
[affect] [int] NULL,
[posemo] [int] NULL,
[negemo] [int] NULL,
[anx] [int] NULL,
[anger] [int] NULL,
[sad] [int] NULL,
[cogmech] [int] NULL,
[insight] [int] NULL,
[cause] [int] NULL,
[discrep] [int] NULL,
[tentat] [int] NULL,
[certain] [int] NULL,
[inhib] [int] NULL,
[incl] [int] NULL,
[excl] [int] NULL,
[percept] [int] NULL,
[see] [int] NULL,
[hear] [int] NULL,
[feel] [int] NULL,
[bio] [int] NULL,
[body] [int] NULL,
[health] [int] NULL,
[sexual] [int] NULL,
[ingest] [int] NULL,
[relativ] [int] NULL,
[motion] [int] NULL,
[space] [int] NULL,
[time] [int] NULL,
[work] [int] NULL,
[achieve] [int] NULL,
[leisure] [int] NULL,
[home] [int] NULL,
[money] [int] NULL,
[relig] [int] NULL,
[death] [int] NULL,
[assent] [int] NULL,
[nonfl] [int] NULL,
[filler] [int] NULL

c# sql loader

http://git.alaning.me/root/load_db_tedious

AFINN-111.txt

via

https://github.com/uwescience/datasci_course_materials/tree/master/assignment1

excerpt

AFINN is a list of English words rated for valence with an integer between minus five (negative) and plus five (positive). The words have been manually labeled by Finn Årup Nielsen in 2009-2011.

sql table

https://alaning.me/index.php/Twitter_Project_Database_Tables#FANATICAL

	[word] [varchar](20) NOT NULL,
	[value] [real] NULL

c# sql loader

http://git.alaning.me/root/load_db_fanatical

Twitter API

Table to hold individual tweets as pulled from the api

	[contributors] [nvarchar](MAX) NULL,
	[coordinates] [nvarchar](MAX) NULL,
	[created_at] datetime NULL,
	[entities] [nvarchar](MAX) NULL,
	[favourite_count] [nvarchar](MAX) NULL,
	[favorited] [nvarchar](MAX) NULL,
	[geo] [nvarchar](MAX) NULL,
	[id_str] [nvarchar](MAX) NULL,
	[in_reply_to_screen_name] [nvarchar](MAX) NULL,
	[in_reply_to_status_id_str] [nvarchar](MAX) NULL,
	[in_reply_to_user_id_str] [nvarchar](MAX) NULL,
	[lang] [nvarchar](MAX) NULL,
	[place] [nvarchar](MAX) NULL,
	[retweet_count] [nvarchar](MAX) NULL,
	[retweeted] [nvarchar](MAX) NULL,
	[source] [nvarchar](MAX) NULL,
	[text] [nvarchar](MAX) NULL,
	[truncated] [nvarchar](MAX) NULL,
	[user_id_str] [nvarchar](MAX) NULL