Monday, May 30, 2022
HomeData ScienceCompetition Chatter (Half 2) - Evaluating Band Recognition from Bonnaroo Tweets

Competition Chatter (Half 2) – Evaluating Band Recognition from Bonnaroo Tweets


In my earlier submit, I wrote about how I collected tweets in regards to the Bonnaroo Music and Arts Competition in the course of the entirety of the competition. There are a variety of questions that may very well be answered by this dataset, like

  • Do individuals spell worse as they grow to be extra intoxicated all through the evening?
  • Does textual content sentiment decline as individuals go extra days with out bathing?
  • Who on this planet tweets from a laptop computer throughout a music competition?

I might actually like to reply the above questions (and plan to), however I’ll give attention to the obvious query for this submit:

Which band was most fashionable?

And whereas this query appears easy to reply, there are various causes this weblog submit is so lengthy. To begin, we don’t also have a first rate definition of the query!

What does it imply for an artist to be the preferred as measured by tweets? For now, let’s work off of the oldest rule of PR: “Any publicity is nice publicity”. In that case, we will rank band recognition just by the variety of tweets that point out every artist. Let’s attempt to do that and see what occurs.

Of Pythons and Pandas

From the earlier submit, I’ve my dataset of Bonnaroo tweets sitting in a MongoDB database. I want to get these tweets out of the database and into IPython, a software program package deal for interactive computing in Python. I exported the database as an enormous JSON file after which loaded it into IPython with an inventory comprehension.

The following step is to get the JSON document into pandas, a Python library used primarily for manipulating tabular information. The principle object for coping with tabular information, the DataFrame, properly reads in such a JSON document.

import json
import pandas as pd
from pandas import DataFrame, Collection

path = 'tweetCollection.json'
document = [json.loads(line) for line in open(path)]
df = DataFrame(document)

If I kind df.rely() I see that, positive sufficient, all 157,600 tweets are current and that 8,656 of them include location information.

_id           157600
created_at    157600
geo             8656
supply        157600
textual content          157600
dtype: int64

Whereas watching the tweets stream in, I observed that there are lots of retweets. To me, these don’t appear as “natural” as a bona-fide, unique tweet. Folks and software program can spam twitter all they need with retweets, however it’s tougher to spam with unique tweets. So, I believe a extra sturdy measure of band recognition is distinctive tweets. Pandas permits us to simply examine this:

rely                                                157600
distinctive                                               107773
prime       RT @502michael502: Islam The Faith Of Reality...
freq                                                   1938
Title: textual content, dtype: object

Of the 157,600 unique tweets, solely about two-thirds of them are distinctive tweets. And wait, what’s that hottest tweet that’s repeated 1,938 instances?

print df['text'].describe()['top']
RT @502michael502: Islam The Faith Of Reality

http://t.co/BO7Sjw6pSl

#FathersDay #AFLDonsDees  #Bonnaroo #Brasil2014 #WorldCup #Jewish #…

Wow. Positive sufficient, should you go to the twitter web page for @502michael502, you will note that random pro-Islam messages are tweeted and retweeted hundreds of instances with an assorted assortment of hashtags containing trending and spiritual phrases. I assume Bonnaroo was fashionable sufficient to make it onto @502michael502’s trending hashtags! And right here I assumed he was only a huge jam band fan.

Okay, now we will attempt to take away retweets. We begin by grabbing solely distinctive tweets.

# Retain solely distinctive text-unique tweets
uniques = df.drop_duplicates(inplace=False, cols='textual content')

# I do not know learn how to make .startswith() case insensitve,
# so examine each instances:
organics = uniques[ uniques['text'].str.startswith('RT')==False ]
organics = organics[ organics['text'].str.startswith('rt')==False ]

# In case RT was positioned additional within the textual content than the start.
# Embody areas round ' RT ' to forestall grabbing phrases like begin
organics = organics[ organics['text'].str.comprises(' RT ', case=False)==False ]

print organics.rely()
_id           93311
created_at    93311
geo            8537
supply        93311
textual content          93311
dtype: int64

There we go: we now have gone from 157,600 whole tweets to 93,311 “natural” tweets. There’s nonetheless extra work that we might do to get extra natural tweets. For instance, I might argue that information media sources tweeting about artists at Bonnaroo aren’t a great measure of band recognition. Such tweets are tougher to detect, although. One methodology may very well be to take a look at the supply of the tweet – perhaps tweets from cell telephones usually tend to be people than media organizations? I’ll save this for one more submit as a result of we nonetheless have lots of work to do!

Most BeautifulSoup within the Room

Now that I’ve grouped collectively all the tweets that we care to research, we should seek for mentions of every Bonnaroo artist. However I’m lazy. There are 189 totally different artists acting at Bonnaroo, and certainly not do I really feel like typing all of them out.

Enter BeautifulSoup, a Python library for scraping web sites. All I’ve to do is try the band lineup on the Bonnaroo web site, determine which div components correspond to the listed bands, and BeautifulSoup will seize the contents.

import urllib2
from BeautifulSoup import BeautifulSoup

url='http://lineup.bonnaroo.com/'
ufile = urllib2.urlopen(url)
soup = BeautifulSoup(ufile)
bandList = soup.discover('div',{'class':'ds-lineup ds-player'}).findAll('a')
fout = open('bonnarooBandList.txt','w')
for row in bandList:
    band=row.renderContents()
    fout.write(band + 'n')

fout.shut()

With a Little Assist from my (API) Mates

After I wrote the above script, I assumed I used to be completed. In a while, I thought of the truth that individuals don’t all the time name bands by their full title. For instance, the Purple Sizzling Chili Peppers are sometimes abbreviated RHCP. I used to be amazed when I discovered that MusicBrainz, an internet music encyclopedia, not solely retains monitor of bands’ aliases and mispellings, however MusicBrainz truly has an API for accessing this info. Even higher, any individual created a Python wrapper for the API.

I additionally needed to carry out some “scrubbing” of the aliases which might be retrieved from the MusicBrainz API. I think about {that a} band is “talked about” in a tweet if all phrases in any of the band’s aliases are current within the tweet textual content. For instance, a match for each “arctic” and “monkeys” within the textual content can be a point out of “Arctic Monkeys”. Nonetheless, I don’t need to miss a point out of “The Flaming Lips” as a result of “the” isn’t included.

I ameliorated this concern through the use of nltk, a Pure Language Processing library. The library comprises an inventory of English stopwords (frequent phrases like “the”) which I used as a filter. Notice: This can be a problem for bands like “The Head and the Coronary heart” the place the filter would go away behind “head” and “coronary heart”. Each these phrases might simply be in a tweet and never relate to the band.

The code beneath exhibits how I used the general public API’s and nltk in an effort to get searchable aliases.

import musicbrainzngs as mbrainz
import json
import string
import re
import nltk
from nltk.corpus import stopwords


mbrainz.auth(username,password) # Use a username and password
mbrainz.set_useragent(program_version,email_address) # Embody title
                                # of your program and e-mail deal with

# For use for eradicating punctuation
regex = re.compile('[%s]' % re.escape(string.punctuation))

################################
def clean_aliases(alias, regex):
    """
    This operate converts every alias to lowercase, removes
    punctuation, and removes cease phrases. The output is a
    listing containing the remaining phrases.
    """
    alias = alias.decrease().substitute(' &', '') # Take away ampersands
    alias = regex.sub('', alias) # Take away punctuation
    alias_words = 
    [w for w in alias.split() if w not in stopwords.words('english')]
    return alias_words
################################

with open('bonnarooBandList.txt','r') as fin:
    with open('bonnarooAliasList.json','w') as fout:
        aliasDict={} # Initialize alias dictionary
        for band in fin:
            band = band.rstrip('n')
            # Take away ampersands
            band_query = band.decrease().substitute(' &', '')
            # Take away punctuation
            band_query = regex.sub('', band_query)
            # Solely seize first end result
            end result = 
                mbrainz.search_artists(artist=band_query, restrict=1)
            # Take away stopwords
            band_query = clean_aliases(band_query, regex)
            #Initialize with stripped model of title
            # listed on Bonnaroo web site
            aliasList = [band_query]
            attempt:
                for alias in end result['artist-list'][0]['alias-list']:
                    alias = clean_aliases(alias['alias'], regex)
                    aliasList.append(alias) # Construct alias Checklist
            besides: # Some artists don't return aliases
                cross # So do not do something!
            aliasDict[band] = aliasList
        json.dump(aliasDict,fout)
    fout.shut()
fin.shut()

The Closing Histogram

Okay, so we now have a dictionary of simply searchable aliases for all artists that carried out at Bonnaroo. All we now have to do now’s undergo every tweet and see if any of the aliases for any of the artists are talked about. We are able to then construct a histogram of “mentions” for every artist by including up all the mentions in all the tweets for a given artist.

Within the code beneath, I just do this. By working the operate on the backside, get_bandPop, we get a pandas Collection in return that comprises every artist and the variety of instances they had been talked about in all the tweets.

# For use for eradicating punctuation
regex = re.compile('[%s]' % re.escape(string.punctuation))

def clean_sentence(sentence):
    """
    Converts every sentence to lowercase and removes
    punctuation.
    """
    sentence = sentence.decrease().substitute(' &', '') # Take away ampersands
    sentence = regex.sub('', sentence) # Take away punctuation
    # sentence_words = [w for w in sentence.split() if w not in stopwords.words('english')]
    return sentence

def find_mention(sentence, phrase_list):
    """
    Takes a phase_list, which is an inventory of phrases the place
    every phrase corresponds to an inventory of the phrases within the phrase, and
    checks to see whether or not all of the phrases of any of the phrases are
    current in "sentence".
    """
    for phrases in phrase_list:
        phrases = set(phrases)
        if phrases.issubset(sentence):
            return True
    return False # Not one of the phrase lists had been subsets

def check_each_alias(sentence, alias_dict):
    """
    Checks to see whether or not any of the aliases for
    every band talked about in alias_dict are talked about
    in "sentence".

    band_bool is a dictionary that comprises all band
    names as keys and True or False as values corresponding
    as to if or not the band was talked about within the sentence.
    """
    band_bool={}
    sentence = set(clean_sentence(sentence).cut up())
    for okay, v in alias_dict.iteritems():
        band_bool[k] = find_mention(sentence, v)
    return pd.Collection(band_bool)

def build_apply_fun(alias_dict):
    """
    Flip check_each_alias into an nameless operate.
    """
    apply_fun = lambda x : check_each_alias(x, alias_dict)
    return apply_fun

def get_bandPop(df, alias_dict):
    """
    For tweet DataFrame enter "df", construct histogram of of mentions
    for every band in alias_dict.
    """
    bandPop = df['text'].apply(build_apply_fun(alias_dict), alias_dict)
    bandPop = bandPop.sum(axis=0)
    bandPop.type(ascending=False)
    return bandPop

And now, lastly, all we now have to do is kind bandPop[:10].plot(sort='bar') (and perhaps fiddle round in matplotlib for an hour adjusting properties of the determine) and we get a histogram of mentions for the highest ten hottest bands at Bonnaroo:

And naturally it’s Kanye! Is anyone shocked?

Wow, that was lots of work for one, measly histogram! Nonetheless, we now have a bunch of information analytical equipment that we will use to delve deeper into this dataset. In my subsequent submit, I’ll just do that!



RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

- Advertisment -
Google search engine

Most Popular

Recent Comments