Tuesday, May 31, 2022
HomeNatural Language ProcessingHashingVectorizer vs. CountVectorizer - Kavita Ganesan, PhD

HashingVectorizer vs. CountVectorizer – Kavita Ganesan, PhD


Beforehand, we realized the best way to use CountVectorizer for textual content processing. Rather than CountVectorizer, you even have the choice of utilizing HashingVectorizer.

On this tutorial, we are going to find out how HashingVectorizer differs from CountVectorizer and when to make use of which.

CountVectorizer vs. HashingVectorizer

HashingVectorizer and CountVectorizer are supposed to do the identical factor. Which is to transform a group of textual content paperwork to a matrix of token occurrences. The distinction is that HashingVectorizer doesn’t retailer the ensuing vocabulary (i.e. the distinctive tokens).

With HashingVectorizer, every token immediately maps to a column place in a matrix, the place its dimension is pre-defined. For instance, when you have 10,000 columns in your matrix, every token maps to 1 of the ten,000 columns. This mapping occurs by way of hashing. The hash perform used is known as Murmurhash3.

See determine 1 for a visible instance of how HashingVectorizer works.

Determine 1: How HashingVectorizer Works

The advantage of not storing the vocabulary (dictionary of tokens) is 2 folded. First, that is very environment friendly for a big dataset.

Holding a 300M token vocabulary in reminiscence might be a problem in sure computing environments as these are basically strings. It calls for  extra reminiscence in comparison with their integer counterparts.

By not having to retailer the vocabulary, the ensuing HashingVectorizer object when saved, can be a lot smaller and thus quicker to load again into reminiscence when wanted.

The draw back of doing that is that it’ll not be doable to retrieve the precise token given the column place. This could be particularly essential in duties like key phrase extraction, the place you need to retrieve and use the precise tokens.

Now let’s have a look at the best way to use HashingVectorizer to generate uncooked time period counts with none sort of normalization.

HashingVectorizer Utilization

Toy Dataset and Imports

Right here we’re utilizing 5 cat within the hat e-book titles as we used within the CountVectorizer tutorial.

from sklearn.feature_extraction.textual content import HashingVectorizer
# dataset
cat_in_the_hat_docs=[
      "One Cent, Two Cents, Old Cent, New Cent: All About Money (Cat in the Hat's Learning Library",
      "Inside Your Outside: All About the Human Body (Cat in the Hat's Learning Library)",
      "Oh, The Things You Can Do That Are Good for You: All About Staying Healthy (Cat in the Hat's Learning Library)",
      "On Beyond Bugs: All About Insects (Cat in the Hat's Learning Library)",
      "There's No Place Like Space: All About Our Solar System (Cat in the Hat's Learning Library)" 
]

Generate Uncooked Time period Counts

Now, we are going to use HashingVectorizer to compute token counts utilizing our toy dataset.

# Compute uncooked counts utilizing hashing vectorizer 
# Small numbers of n_features may cause hash collisions 
hvectorizer = HashingVectorizer(n_features=10000,norm=None,alternate_sign=False) 
# compute counts with none time period frequency normalization 
X = hvectorizer.fit_transform(cat_in_the_hat_docs)

Discover that the matrix dimension must be pre-specified. Additionally, in case your vocabulary is big and your matrix dimension is small, then you’ll find yourself with hash collision. That means, two completely different tokens (e.g. `espresso` and `caffe`) may map to the identical column place, distorting your counts. So that you need to watch out throughout initialization. We’re additionally turning off normalization with norm=None.

Now should you test the form, it’s best to see:

(5, 10000)

5 paperwork, and a ten,000 column matrix. On this instance, a lot of the columns can be empty because the toy dataset is admittedly small.

Let’s print the columns which are populated for the primary doc.

# print populated columns of first doc
# format: (doc id, pos_in_matrix)  raw_count
print(X[0])
  (0, 93)	3.0
  (0, 689)	1.0
  (0, 717)	1.0
  (0, 1664)	1.0
  (0, 2759)	1.0
  (0, 3124)	1.0
  (0, 4212)	1.0
  (0, 4380)	1.0
  (0, 5044)	1.0
  (0, 7353)	1.0
  (0, 8903)	1.0
  (0, 8958)	1.0
  (0, 9376)	1.0
  (0, 9402)	1.0
  (0, 9851)	1.0

15 distinctive tokens, one with a depend of 3 and the remaining all 1. Discover that the place ranges from 0 to 9999. All different arguments that you just use with CountVectorizer resembling cease phrases, n_grams dimension and and so on. would additionally apply right here.

Identical Outcomes With CountVectorizer

Now we are able to obtain the identical outcomes with CountVectorizer.

Generate Uncooked Time period Counts

from sklearn.feature_extraction.textual content import CountVectorizer
cvectorizer = CountVectorizer()
# compute counts with none time period frequency normalization
X = cvectorizer.fit_transform(cat_in_the_hat_docs)

When you print the form, you will notice:

(5, 43)

Discover that as an alternative of (5,10000) as within the HashingVectorizer instance, you see (5,43). It is because we didn’t pressure a matrix dimension with CountVectorizer. The matrix dimension is predicated on what number of distinctive tokens have been present in your vocabulary, the place on this case it’s 43.

Now if we print the counts for the primary doc, that is what we might see:

# print populated columns of first doc
# format: (doc id, pos_in_matrix)  raw_count
print(X[0])
  (0, 28)	1
  (0, 8)	3
  (0, 40)	1
  (0, 9)	1
  (0, 26)	1
  (0, 23)	1
  (0, 1)	1
  (0, 0)	1
  (0, 22)	1
  (0, 7)	1
  (0, 16)	1
  (0, 37)	1
  (0, 13)	1
  (0, 19)	1
  (0, 20)	1

That is just like the HashingVectorizer instance, we’ve got 15 populated columns. However, the column place ranges from (0,42).

When to make use of HashingVectorizer?

If you’re utilizing a massive dataset in your machine studying duties and you’ve got no use for the ensuing dictionary of tokens, then HashingVectorizer can be candidate.

Nonetheless, should you fear about hash collisions (which is sure to occur if the dimensions of your matrix is just too small), then  you may need to stick with CountVectorizer till you are feeling that you’ve maxed out your computing sources and it’s time to optimize. Additionally, should you want entry to the precise tokens, then once more CountVectorizer is the extra applicable selection.

Really helpful studying

Assets

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

- Advertisment -
Google search engine

Most Popular

Recent Comments