Monday, May 30, 2022
HomeData ScienceCompetition Chatter (Half 3) - Bonnaroo Evaluation within the Fourth Dimension

Competition Chatter (Half 3) – Bonnaroo Evaluation within the Fourth Dimension


On this collection of posts (half 1, half 2), I’ve been exhibiting the right way to use Python and different knowledge scientist instruments to investigate a group of tweets associated to the 2014 Bonnaroo Music and Arts Competition. Thus far, the investigation has been restricted to abstract knowledge of the total dataset. The great thing about Twitter is that it happens in realtime, so we are able to now peer into the fourth dimension and study these tweets as a operate of time.

Extra Natural

Earlier than we view the Bonnaroo tweets as a time collection, I want to make a fast remark in regards to the organic-ness of the tweets. In case you recall from the earlier publish, I eliminated duplicates and retweets from my assortment with the intention to make the tweet database extra indicative of true viewers reactions. On additional investigation, it appears that there have been many spammy media sources nonetheless within the assortment. To make the tweets much more natural, I made a decision to take a look at the supply of the tweets.

As a result of Kanye West was the preferred artist from the earlier posts’ evaluation, I made a decision to take a look at the highest 15 sources that talked about him:

twitterfeed                    1585
dlvr.it                         749
Twitter for iPhone              366
IFTTT                           256
Hootsuite                       201
Twitter for Web sites            188
Twitter Internet Shopper              127
Fb                        120
Twitter for Android             119
WordPress.com                   102
Tumblr                           81
Instagram                        73
iOS                              42
TweetDeck                        38
TweetAdder v4                    37

twitterfeed and dlvr.it are social media platforms for deploying mass tweets, and a have a look at a few of these tweets reveals this reality. So, I made a decision to create an inventory of “natural sources”, which consists of cell Twitter shoppers, and use these to cull the tweet assortment

organic_sources = ['Twitter for iPhone', 'Twitter Web Client',
                    'Facebook', 'Twitter for Android', 'Instagram']
organics = organics[organics['source'].isin(organic_sources)]

With this new dataset, I re-ran the band recognition histogram from the earlier publish, and I used to be stunned to see that Kanye bought bumped down to 3rd place! It appears to be like like Kanye’s well-liked with the media, however Jack White and Elton John had been extra well-liked with the Bonnaroo viewers.

Let’s now have a look at the time dependence of the tweets. For this, we want to use the created_at subject as our index and inform pandas to deal with its components as datetime objects.

# Clear up subject
organics['created_at'] = [tweetTime['$date'] for tweetTime in organics['created_at']]

organics['created_at'] = pd.to_datetime(Sequence(organics['created_at']))
organics = organics.set_index('created_at',drop=False)
organics.index = organics.index.tz_localize('UTC').tz_convert('EST')

To take a look at the variety of tweets per hour, now we have to resample our tweet assortment.

ts_hist = organics['created_at'].resample('60t', how='depend')

The vast majority of my time spent creating this weblog publish consisted of preventing with matplotlib making an attempt to get respectable trying plots. I assumed it will be cool to attempt to make a “fill between” plot, which took approach longer to determine than it ought to have. The secret’s that fill_between takes 3 inputs: an array for the x-axis and two y-axis arrays between which the operate fills coloration. If one simply desires to plot a daily curve and fill to the x-axis, one should create an array of zeros that’s the similar size because the curve. Additionally, I get fairly confused with which instructions needs to be referred to as with ax, plt, and fig. Anyway, the code and corresponding determine are beneath.

 # Prettier pandas plot settings
 # Undecided why 'default' is just not the default...
pd.choices.show.mpl_style='default'
x_date = tshist.index
zero_line = np.zeros(len(x_date))

fig, ax = plt.subplots()
ax.fill_between(x_date, zero_line, ts_hist.values, facecolor='blue', alpha=0.5)
# Format plot
plt.setp(ax.get_xticklabels(),fontsize=12,household='sans-serif')
plt.setp(ax.get_yticklabels(),fontsize=12,household='sans-serif')
plt.xlabel('Date',fontsize=30)
plt.ylabel('Counts',fontsize=30)

plt.present()

Tweets per Hour

As you may see, tweet frequency was fairly constant throughout every day of the pageant and persevered till the early hours of every morning.

Band Reputation Time Sequence

We are able to now return to questions from the earlier publish and have a look at how the highest 5 bands’ recognition modified with time. Utilizing my program from the earlier publish, buildMentionHist, we are able to add a column for every band to our present organics dataframe. Every row of the bands’ columns will comprise a True or False worth corresponding as to if or not the artist was talked about within the tweet. We resample the columns like above however do that in bins of 10 minutes.

import buildMentionHist as bmh
import json

path = 'bonnaroooAliasList.json'
alias_dict = [json.loads(line) for line in open(path)][0]
bandPop = organics['text'].apply(bmh.build_apply_fun(alias_dict),
                                 alias_dict)
top_five = bandPop.index.tolist()[:5] # Get prime 5 artists' names
bandPop = pd.concat([organics, bandPop], axis=1)

top_five_ts = DataFrame()
for band in top_five:
    top_five_ts[band] = bandPop[bandPop[band] == True]['text'].resample('10min', how='depend')

We now have a dataframe referred to as top_five_ts that comprises the time collection data for the highest 5 hottest bands at Bonnaroo. All now we have to do now could be plot these time collection. I needed to once more make some fill between plots however with totally different colours for every band. I used the prettyplotlib library to assist with this as a result of it has nicer trying default colours. I plot each the total time collection and a “zoomed-in” time collection that’s nearer to when the artists’ popularities peaked on Twitter. I bumped into quite a lot of bother making an attempt to get the dates and instances formatted appropriately on the x-axis of the zoomed-in plot, so I’ve included that code beneath. There may be in all probability a greater approach to do it, however at the very least this lastly labored.

import pytz
import prettyplotlib as ppl
from prettyplotlib import brewer2mpl

for band in top_five_ts:
    ppl.fill_between(top_five_ts.index.tolist(),0.,top_five_ts[band])

ax = plt.gca()
fig = plt.gcf()
set2 = brewer2mpl.get_map('Set2','qualitative',8).mpl_colors

# Observe: need to make legend by hand for fill_between plots.
# BEGIN making legend
for coloration in set2:
    legendProxies.append(plt.Rectangle((0, 0), 1, 1, fc=coloration))

leg = legend(legendProxies, topfive, loc=2)
leg.draw_frame(False)
# END making legend


# BEGIN formatting xaxis
datemin = datetime(2014,6,13,12,0,0)
datemax = datetime(2014,6,16,12,0,0)
est = pytz.timezone('EST')
plt.axis([est.localize(datemin), est.localize(datemax),  0, 80])
fmt = dates.DateFormatter('%m/%d %H:%M',tz=est)
ax.xaxis.set_major_formatter(fmt)
ax.xaxis.set_tick_params(route='out')
# END formatting xaxis

plt.xlabel('Date',fontsize=30)
plt.ylabel('Counts',fontsize=30)

Right here is the total time collection:

Top Five TS Popularity

And right here is the zoomed-in time collection:

Top Five TS Popularity Zoom

If we have a look at when every band went on stage, we are able to see that every bands’ recognition spiked whereas they had been performing. That is good – it appears to be like like we’re measuring actually “natural” curiosity on Twitter!

Band Efficiency Time
Jack White 6/14 10:45PM – 12:15AM
Elton John 6/15 9:30PM – 11:30PM
Kanye West 6/13 10:00PM – 12:00AM
Skrillex 6/14 1:30AM – 3:30AM
Vampire Weekend 6/13 7:30PM – 8:45PM

Delving into the Textual content

Up till now, I’ve not seemed an excessive amount of in regards to the precise textual content of the tweets apart from to discover a point out of an artist. Utilizing the nltk library, we are able to be taught a bit of extra about some basic qualities of the textual content. The best amount is trying on the most steadily used phrases. To do that, I am going by means of each tweet and break all the phrases up into particular person components of an inventory. Within the language of pure language processing, we’re “tokenizing” the textual content. Frequent english stopwords are omitted, in addition to any mentions of the artists or artists’ aliases. I exploit a daily expression code to solely seize phrases from the sentences and ignore punctuation (apart from apostrophes). I additionally take our alias_dict from the earlier publish and make it possible for these phrases should not collected when tokenizing the tweets.

from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
import re

def custom_tokenize(textual content, custom_words=None, clean_custom_words=False):
    """
    This routine takes an enter "textual content" and strips punctuation
    (besides apostrophes), converts every phrases to lowercase,
    removes commonplace english stopwords, removes a set of
    custom_words (elective), and returns an inventory of all the
    leftover phrases.

    INPUTS:
    textual content = textual content string that one desires to tokenize
    custom_words = customized record or dictionary of phrases to omit
                    from the tokenization.
    clean_custom_world = Flag as True if you wish to clear
                            these phrases.
                         Flag as False if mapping this operate
                            to many keys. In that case,
                            pre-clean the phrases earlier than operating
                            this operate.
    OUTPUTS:
    phrases = It is a record of the tokenized model of every phrase
            that was in "textual content"
    """
    tokenizer = RegexpTokenizer(r"[w']+")
    stop_url = re.compile(r'http[^s]+')
    stops = stopwords.phrases('english')

    if clean_custom_words:
        custom_words = tokenize_custom_words(custom_words)

    phrases = [w.lower() for w in text.split() if not re.match(stop_url, w)]
    phrases = tokenizer.tokenize(' '.be a part of(phrases))
    phrases = [w for w in words if w not in stops and w not in custom_words]

    return phrases

def tokenize_custom_words(custom_words):
    tokenizer = RegexpTokenizer(r"[w']+")
    custom_tokens = []
    stops = stopwords.phrases('english')

    if kind(custom_words) is dict: # Helpful for alias_dict
        for ok, v in custom_words.iteritems():
            k_tokens = [w.lower() for w in k.split() if w.lower() not in stops]
            # Take away all punctuation
            k_tokens = tokenizer.tokenize(' '.be a part of(k_tokens))
            # Take away apostrophes
            k_tokens = [w.replace("'","") for w in k_tokens]
            # Beneath takes care of nested lists, then tokenizes
            v_tokens = [word for listwords in v for word in listwords]
            v_tokens = tokenizer.tokenize(' '.be a part of(v_tokens))
            # Take away apostrophes
            v_tokens = [w.replace("'","") for w in v_tokens]
            custom_tokens.prolong(k_tokens)
            custom_tokens.prolong(v_tokens)
    elif kind(custom_words) is record:
        custom_tokens = [tokenizer.tokenize(words) for words in custom_words]
        custom_tokens = [words.replace("'","") for words in custom_tokens]

    custom_tokens = set(custom_tokens)
    return custom_tokens

Utilizing the above code, I can apply the custom_tokenize operate to every row of my organics dataframe. Earlier than doing this, although, I be certain that to run the tokenize_custom_words operate on the alias dictionary. In any other case, I’d find yourself cleansing the aliases for each row within the dataframe which is a waste of time.

import custom_tokenize as tk

clean_aliases = tk.tokenize_custom_words(alias_dict)
token_df = organics['text'].apply(tk.custom_tokenize,
                                  custom_words=clean_aliases,
                                  clean_custom_words=False)

Lastly, I acquire all the tokens into one large record and use the FreqDist nltk operate to get the phrase frequency distribution.

# Have to flatten all tokens into one large record:
big_tokens = [y for x in token_df.values for y in x]
distr = nltk.FreqDist(big_tokens)
distr = distr.pop('bonnaroo') # Clearly highest frequency
distr.plot(25)

Word Frequency Distribution

A pair issues caught my eye – the primary being that individuals like to speak about themseleves (see recognition of ” i’m “). Additionally it was fairly well-liked to misspell “Bonnaroo” (see recognition of ” bonaroo “). I needed to see if there was any correlation between mispellings and perhaps folks being intoxicated at night time, however the time collection habits of the mispellings appears to be like related in form (although not magnitude) to the total tweet time collection that’s plotted earlier within the publish.

misspell = token_df.apply(lambda x: 'bonaroo' in x)
misspell = misspell[misspell].resample('60t', how='depend')

Misspelling Time Series

The opposite factor that me was the the phrase “greatest” was one of many prime 25 most frequent phrases. Assuming that “greatest” correlates with happiness, we are able to see that individuals bought happier and happier because the pageant progressed:

“Best” Time Series

That is, in fact, a reasonably simplistic measure of textual content sentiment. In my subsequent publish, I want to quantify extra sturdy measures of Bonnaroo viewers sentiment.

By the way in which, the code used on this entire collection on Bonnaroo is obtainable on my GitHub.

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

- Advertisment -
Google search engine

Most Popular

Recent Comments