Thursday, November 24, 2022
HomeHackerAn AI-powered Private Identifiable Data (PII) Scanner

An AI-powered Private Identifiable Data (PII) Scanner

Octopii is an open-source AI-powered Private Identifiable Data (PII) scanner that may search for picture property resembling Authorities IDs, passports, photographs and signatures in a listing.


Octopii makes use of Tesseract’s Optical Character Recognition (OCR) and Keras’ Convolutional Neural Networks (CNN) fashions to detect varied types of private identifiable data that could be leaked on a publicly dealing with location. That is achieved within the following steps:

1. Importing and cleansing picture(s)

The picture is imported by way of OpenCV and Python Imaging Library (PIL) and is cleaned, deskewed and rotated for scanning.

2. Performing picture classification and Optical Character Recognition (OCR)

A listing is looped over and looked for photos. These photos are scanned for distinctive options by way of the picture classifier (achieved by evaluating it to a skilled mannequin), together with OCR for locating substrings throughout the picture. This will likely have one of many following outcomes:

  • Finest case (rating >=90): The picture is distributed into the picture classifier algorithm to be scanned for options resembling an ISO/IEC 7810 card specification, colours, location of textual content, photographs, holograms and so forth. Whether it is efficiently categorised as a kind of PII, OCR is carried out on it in search of explicit phrases and strings as a last examine. When each of those are confirmed, the consequence from Octopii is extraordinarily dependable.

  • Common case (rating >=50): The picture is partially/incorrectly recognized by the picture classifier algorithm, however an OCR examine finds contradicting substrings and reclassifies it.

  • Worst case (rating >=0): The picture is just recognized by the picture classifier algorithm however an OCR scan returns no outcomes.

  • Incorrect classification: False positives as a result of a really small mannequin or OCR record might incorrectly classify PIIs, giving inaccurate outcomes.

As a last verification technique, photos are scanned for sure strings to confirm the accuracy of the mannequin.

The accuracy of the scan can decided by way of the boldness scores in output. If all of the talked about situations are met, a rating of 100.0 is returned.

To coach the mannequin, knowledge will also be fed into the script, and the newly improved h5 file can be utilized.


  1. Set up all dependencies by way of pip set up -r necessities.txt.
  2. Set up the Tesseract helper domestically by way of sudo apt set up tesseract-ocr -y (for Ubuntu/Debian).
  3. To run Octopii, sort python3 <location identify>, for instance python3 pii_list/
python3 <location to scan> <further flags>

Octopii at present helps native scanning and scanning S3 directories and open listing listings by way of their URLs.



Open-source initiatives like these thrive on group help. Since Octopii depends closely on machine studying and optical character recognition, contributions are a lot appreciated. This is contribute:

1. Fork

Fork the official repository at

2. Perceive

There are 3 information within the fashions/ listing.
– The keras_models.h5 file is the Keras h5 mannequin that may be obtained from Google’s Teachable Machine or by way of Keras in Python.
– The labels.txt file incorporates the record of labels similar to the index that the mannequin returns.
– The ocr_list.json file consists of key phrases to seek for throughout an OCR scan, in addition to different miscellaneous data resembling nation of origin, common expressions and so forth.

Producing fashions by way of Teachable Machine

Since our present dataset is kind of small, we may benefit from a big Keras mannequin of worldwide PII for this undertaking. For those who shouldn’t have experience in Keras, Google gives an especially simple to make use of mannequin generator referred to as the Teachable Machine. To make use of it:

  • Go to and choose ‘Picture Undertaking’ → ‘Normal Picture Mannequin’.
  • A couple of lessons are seen. Rename the category to an asset sort ypu’d wish to add, resembling “German Passport” or “California Driver License”.
  • Add photos by clicking the ‘Add’ button and add some picture property. Notice: photos must be sq.

Tip: segregate your picture property into folders with the folder identify being the identical as the category identify. You’ll be able to then drag and drop a folder into the add dialog.

  • Click on ‘+ Add a category’ on the backside of the web page so as to add extra lessons with knowledge and repeat. You can also make the lessons extra particular, resembling “Goa Driver License Previous Format”.

Notice: Solely add the identical as the category identify, for instance, the German Passport class will need to have German Passport photos. Importing the unsuitable knowledge to the unsuitable class will confuse the machine studying algorithms.

  • Confirm the lessons and pictures one final time. When you’re prepared, click on on the ‘Prepare Mannequin’ button. You’ll be able to improve the epoch measurement (resembling 5000) to enhance mannequin accuracy.
  • To check, you possibly can take a look at the mannequin by clicking the Enter dropdown and deciding on ‘File’, then importing a pattern picture.
  • When you’re prepared, click on the ‘Export Mannequin’ button. Within the dialog that pops up, choose the ‘Tensorflow’ tab (not Tensorflow.js) and choose the ‘Keras’ radio button, then click on ‘Obtain my mannequin’ to export the newly generated mannequin. Extract the downloaded zip file and paste the keras_model.h5 file and labels.txt file into the fashions/ listing in Octopii.

The pictures used for the mannequin above will not be seen to us since they’re in a proprietary format. You need to use each dummy and precise PII. Be certain they’re square-ish in picture measurement.

Updating OCR record

When you generate fashions utilizing Teachable Machine, you possibly can enhance Octopii’s accuracy by way of OCR. To do that:

  • Open the present ocr_list.json file. Create a JSONObject with the important thing having the identical identify because the asset class. NOTE: The important thing identify have to be precisely the identical because the asset class identify from Teachable Machine.
  • For the key phrases, use as many distinctive phrases out of your asset as potential, resembling “Earnings Tax Division”. Retailer them in a JSONArray.
  • (Superior) you may also add regexes for issues like ID numbers and MRZ on passports if they’re distinctive sufficient. Use to check your regexes earlier than including them.
  • Save/overwrite the present ocr_list.json file.

3. Edit

You’ll be able to exchange every file you modify within the fashions/ listing after you create or edit them by way of the above strategies.

4. Pull request

Submit a pull request out of your forked repo and we’ll choose it up and exchange our present mannequin with it if the adjustments are massive sufficient.

Notice: Please take the next steps to make sure high quality

  • Be certain the mannequin returns extraordinarily correct outcomes by testing it domestically first.
  • Use correct textual content casing for label names in each the Keras mannequin and ocr_list.json.
  • Be certain all JSON is legitimate with acceptable character escapes with no duplicate keys, regexes or key phrases.
  • For nation names, please use the ISO 3166-1 alpha-2 code of the nation.



MIT License

(c) Copyright 2022 RedHunt Labs Personal Restricted

Creator: Owais Shaikh



Please enter your comment!
Please enter your name here

- Advertisment -
Google search engine

Most Popular

Recent Comments