Saturday, May 4, 2024
HomeData ScienceLikes Out! Guerilla Dataset! | Ethan Rosenthal

Likes Out! Guerilla Dataset! | Ethan Rosenthal


Zack de la Rocha

tl;dr -> I collected an implicit suggestions dataset together with side-information in regards to the gadgets. This dataset accommodates round 62,000 customers and 28,000 gadgets. All the information lives right here inside this repo. Take pleasure in!

In a earlier submit, I wrote about easy methods to use matrix factorization and specific suggestions knowledge with a purpose to construct suggestion methods. That is knowledge the place a consumer has given a transparent desire for an merchandise resembling a star ranking for an Amazon product or a numerical ranking for a film like within the MovieLens knowledge. A pure subsequent step is to debate suggestion methods for implicit suggestions which is knowledge the place a consumer has proven a desire for an merchandise like “variety of minutes listened” for a tune on Spotify or “variety of occasions clicked” for a product on a web site.

Implicit feedback-based methods doubtless consitute nearly all of fashionable recommender methods. After I got down to write a submit on these methods, I discovered it troublesome to search out appropriate knowledge. This is smart – most corporations are detest to share customers’ click on or utilization knowledge (and for good causes). A cursory google search revealed a pair datasets that individuals use, however I stored discovering points with these datasets. For instance, the million tune database was proven to have some points with knowledge high quality, whereas many different folks simply repurposed the MovieLens or Netflix knowledge as if it was implicit (which it’s not).

This began to really feel like a type of “fuck it, I’ll do it myself” issues. And so I did.

All code for accumulating this knowledge is positioned on my github. The precise collected knowledge lives on this repo, as properly.

Again once I was a graduate pupil, I assumed for a while that possibly I’d work within the {hardware} house (or at a museum, or the federal government, or a gazillion different issues). I needed to have public, digital proof of my (shitty) CAD abilities, and I stumbled upon Sketchfab, a web site which lets you share 3D renderings that anyone else with a browser can rotate, zoom, or watch animate. It’s type of like YouTube for 3D (and now VR!).

Customers can “like” 3D fashions which is a wonderful implicit sign. It seems you possibly can really see which consumer appreciated which mannequin. This presumably permits one to reconstruct the traditional suggestion system “scores matrix” of customers as rows and 3D fashions as columns with likes as the weather within the sparse matrix.

Okay, I can see the likes on the web site, however how do I really get the information?

After I was at Perception Information Science, I constructed an unsightly script to scrape a tutoring web site. This was comparatively straightforward. The location was largely static, so I used BeautifulSoup to easily parse by the HTML.

Sketchfab is a extra fashionable website with in depth javascript. One should look ahead to the javascript to render the HTML earlier than parsing by it. A technique of automating that is to make use of Selenium. This software program primarily allows you to write code to drive an precise net browser.

To rise up and operating with Selenium, you have to first obtain a driver to run your browser. I went right here to get a Chrome driver. The Python Selenium package deal can then be put in utilizing anaconda on the conda-forge channel:

conda set up --channel https://conda.anaconda.org/conda-forge selenium

Opening a browser window with Selenium is sort of easy:

from selenium import webdriver

chromedriver = '/path/to/chromedriver'
BROWSER = webdriver.Chrome(chromedriver)

Now we should resolve the place to level the browser.

Sketchfab has over 1 Million 3D fashions and greater than 600,000 customers. Nevertheless, not each consumer has appreciated a mannequin, and never each mannequin has been appreciated by a consumer. I made a decision to restrict my search to fashions that had been appreciated by at the least 5 customers. To start out my crawling, I went to the “all” web page for well-liked fashions (sorted by variety of likes, descending) and began crawling from the highest.

BROWSER.get('https://sketchfab.com/fashions?sort_by=-likeCount&web page=1')

Upon opening the primary fashions web page, you possibly can open the chrome developer instruments (ctrl-shift-i in linux) to disclose the HTML construction of the web page. This seems like the next (click on to view full-size):

Wanting by the HTML reveals that all the displayed 3D fashions are housed in a <div> of sophistication infinite-grid. Every 3D mannequin is inside a <li> aspect with class merchandise. One can seize the checklist of all these checklist components as follows:

elem = BROWSER.find_element_by_xpath("//div[@class='infinite-grid']")
item_list = elem.find_elements_by_xpath(".//li[@class='item']")

It seems that every Sketchfab mannequin has a novel ID related to it which we will name its mannequin ID, or mid. This mid might be present in every checklist aspect by the data-uid attribute.

merchandise = item_list[0]
mid = merchandise.get_attribute('data-uid')

The url for the mannequin is then merely https://sketchfab.com/fashions/mid the place you change mid with the precise distinctive ID.

I’ve written a script which automates this assortment of every mid. This script known as crawl.py in the primary repo. To log all mannequin urls, one runs

python crawl.py config.yml --type urls

All advised, I ended up with 28,825 fashions (from October 2016). The mannequin identify and related mid are within the file model_urls.psv right here.

With a view to log which consumer appreciated which mannequin, I initially wrote a Selenium script to go to each mannequin’s url and scroll by the customers that had appreciated the mannequin. This took for-fucking-ever. I noticed that possibly Sketchfab serves up this data by way of an API. I did a fast Google search and stumbled upon Greg Reda’s weblog submit which described easy methods to use semi-secret APIs for accumulating knowledge. Certain sufficient, this labored completely for my process!

With a mid in hand, one can hit the api by passing the next parameters

import requests

mid = '522e811044bc4e09bf84431e6c1cc109'
rely = 24
params = {'mannequin':mid, 'rely':rely, 'offset':0}

url = 'https://sketchfab.com/i/likes'
response = requests.get(url, params=params).json()

Within response['results'] is a listing of details about every consumer that loved the mannequin. crawl.py has a perform to learn within the mannequin urls file output by crawl.py after which gather each consumer that loved that mannequin.

python crawl.py config.yml --type likes

After operating this script accumulating likes on 28,825 fashions in early October 2016, I ended up with knowledge on 62,583 customers and 632,840 model-user-like combos! This knowledge is fortunately sufficiently small to nonetheless slot in a github repo (52 Mb) and lives right here

Regardless that these likes are public, I felt just a little unhealthy about making this knowledge really easy to publicly parse. I wrote a small script known as anonymize.py which hashes the consumer ID’s for the mannequin likes. Operating this script is easy (simply be certain that to supply your individual secret key):

python anonymize.py unanonymized_likes.csv anonymized_likes.csv "SECRET KEY"

The likes knowledge in the primary repo has been anonymized.

An thrilling space of advice analysis is the mixture of consumer and merchandise aspect data with implicit or specific suggestions. In later posts, I’ll tackle this house, however, for now, let’s simply attempt to seize some aspect data. Sketchfab customers are in a position to categorize fashions that they add (e.g. “Characters”, “Locations & scenes”, and so on&mldr;) in addition to tag their fashions with related labels (e.g. “chicken”, “maya”, “blender”, “sculpture”, and so on&mldr;). Presumably, this additional details about fashions might be helpful in making extra correct suggestions.

crawl.py has one other perform for grabbing the related classes and tags of a mannequin. I couldn’t discover an API means to do that, and the Selenium crawl is extraordinarily sluggish. Fortunately, I’ve already bought the information for you 🙂 The mannequin “options” file known as model_feats.psv and is within the /knowledge listing of the primary repo.

python crawl.py config.yml --type options

With all of our knowledge in hand, subsequent weblog posts will dive into the wild west of implicit suggestions suggestion methods. I’ll present you easy methods to practice these fashions, use these fashions, after which construct a easy Flask app, known as Rec-a-Sketch, for serving 3D Sketchfab suggestions.

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

- Advertisment -
Google search engine

Most Popular

Recent Comments