Subscribe to DSC Newsletter

You can search Google for pictures similar to a given image, for plagiarism detection or to find people that look like you, read article recently posted this week. This is more sophisticated than a traditional image search based on metadata / keywords.

Here we ask you to

  1. Reverse-engineer Google's algorithm, or come up with a better algorithm, for bitmap images
  2. Same questions for vector images or graphs (a vector image is represented by a graph). How do you define the concept of similarity between graphs (once represented in their canonical form)? 

The image matching/comparison algorithm (for bitmap images) could work as follows. Let's say you have a database  of 10,000 square images. For each image, you first normalize it (reduction to 64 by 64 pixels, contrast, brightness adjustment, histogram equalization, and binarization). Then you compute a few core metrics such as texture (from a list of 20 pre-selected textures), category (from a list of 50 pre-selected categories), entropy, median color in various parts of the image etc. This multivariate feature attached to each image is called the signature. The signature could include the entire normalized image itself, consisting of only 64 x 64 = 4,096 bits or 4 kilobytes, which is relatively small. 

When you process a new image, you compute its signature and compare it with signatures available in your database. The closest signature represents the best image match. Signatures can be stored in memory as hash tables for fast retrieval, to speed up the image matching algorithm. You need a metric to measure the distance or similarity between two signatures, but that's trivial.

Note: Google's algorithm is still primitive, it fails to recognize gender. But interesting nevertheless. If you use an original image (not a screenshot), it has metadata (probably stored in the Google image index database) and Google will tell you which websites are displaying your image, based on the metadata. The screenshot tool alters the metadata, making image forensics (plagiarism detection) far more difficult, and could be used by pirates to upload copyrighted images and eluding infringement detection. 

Related articles

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Views: 2145

Reply to This

Replies to This Discussion

A cropped image would definitely make it more difficult to detect matches even if the image is the same. This is especially true if the subset image is nothing similar to the original. I'll present two solutions to this problem. The first will only consider a non-cropped image of any size. The second will present standard computer vision techniques that can be used to find similar images.

First Approach:

Four quadrant can be created by dividing the image into four parts: upper right, upper left, lower lower, and lower right. I found that quadrant brightness along with overall brightness, down to the decimal place, are relatively unique to each photo. In fact, with 1 decimal place there are 2550 possible values for each quadrant. Naively, there is a 9.274683493E-18 probability that there exist photos which are exact duplicates by chance. An image could be decomposed into identifiers and could be used to query the database for matches and reference an image located on a file system. For example, the actual pictures would be located in a folder corresponding to the scrape number: Image_Dump/Scrape_num/picture.jpg.

Please see my github markdown document for more information.

Second Approach:

Satellite images that have overlapping geographic regions can be stitched together using a feature matching technique. This preprocessing technique can be used for matching a person's unique features across different images. A picture of a person could be compared with a set of new pictures for similarity. If enough features are in common between the images, the picture can be warped to match the new picture where both pictures are similar. A subtracting method could both picture shows whether the images are close enough. Various algorithms and parameters can be used to make this process less prone to false positive or false negatives.

Please see my github markdown document for more information.

My name is Rohan Kotwani. I have a Bachelors in Electrical Engineering and a Masters in Analytics. If you are interested in discuss about potential job/project opportunities, please send me an email at [email protected]


On Data Science Central

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service