Use The Throwaway Email Address API To Keep Away From Spammers

Are you tired of receiving spam emails? Do you want to avoid being added to junk mail lists? If so, you should try this API! A disposable email address is an email account that you set up for those…

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转

How did the machine read nutritional facts?

How it was possible to read nutritional tables with OCR, Tesseract and a lot of computer vision!

Some time ago I was immersed in a project that worked with the formation of a data lake of food data, collecting from products in general to the nutritional information of a mass of food, however at a certain point it was realized that most of the nutritional information was inserted in images and not in text, making it difficult to web scrapping with the Scrapy framework and Python.

This problem opened the opportunity to learn something that has been present in technology for a long time and gained percussion with David H. Shepard. This technology is well known as Optical Character Recognition (OCR). In fact, the following video is very cool to get a sense of the early days of OCR:

So the challenge was that from an image, it would be possible to get the data from nutritional tables in text.

Searching about OCR, I came across a tool that is currently leading in use, as well as being known for its great efficiency. This tool is Tesseract.
Searching about OCR, I came across a tool that is currently leading in use, as well as being known for its great efficiency. This tool is Tesseract.

This engine and package were essential for the development of the project, to the point that they became a requirement (which is covered in the project repository).

Calories on the New Nutrition Facts Label. Font: FDA.

As mentioned in the beginning of the project the challenge was focused on reading nutritional tables, and unlike a controlled environment, when it comes to nutritional tables, we have many words that are sometimes disconnected, besides horizontal lines and vertical lines that separate the words from the values, so the question is: how to make it easier for the machine to read them? That is, the mission was to remove these lines and columns and leave only the words and values, so that we would not “divert” the machine’s attention to what it did not need to know.

To solve this problem two things were done.

image of binarized lines. Font: Author. — Image of binarized lines. Font: Author.

But once you have these lines identified, how do you delete them from the image?

The answer was no less than K-means, an unsupervised machine learning algorithm that is commonly used for clustering, which needs no external inputs for its operation, needing only to determine the number of K-means, i.e. the number of clusters required for your problem.

But what’s with the K-means after all? What does it have to do with the problem of lines?

As described in its brief introduction, k-means is used for grouping, so instead of deleting the rows we performed a color clustering on the image and overwrote (instead of deleting) the rows by the predominant color of the image, and usually when it comes to a nutrition table, the predominant colors are for the background, rows, letters and sometimes details, respectively.

For this reason, also, the infamous elbow method was not used to define the amount of Ks, because we have a situation with the known number of clusters needed.

As I said, EAST alone only locates where there is text, but does not read it, a process which we can assimilate to an illiterate, who knows that there is text there, but has no idea what is written. Its result can be seen in the image below:

EAST: An Efficient and Accurate Scene Text Detector. Font: Youtube. — EAST: An Efficient and Accurate Scene Text Detector. Font: YouTube.

There are excellent tutorials and publications on the Internet that teach how to use this kind of technology, of which I can mention:

After applying EAST with a series of morphological filters, we then read the words in the Western style, i.e., from top to bottom and from left to right, making it closer to what is done in human reading, as is demonstrated in the following image.

At this stage, we can assimilate no longer an illiterate, but a child who is learning to read and understands a few things. But how can “this child’s” reading be corrected or improved?

SymSpell is an alternative algorithm to the Symmetric Delete spelling correction algorithm, and SymSpell has been found to be 1000x faster at performing this task, working with a dictionary that is loaded into memory, and supports a number of languages. Its repository is on GitHub and can be accessed from the following link.

Continuing with our assimilation, in this case we can say that SymSpell is like a teacher for our reading, which applies corrections by working with a similarity distance between words.

The whole process of reading the image with the nutrition table until you have the text corrected and ready for use is summarized in the following image.

It is worth saying that this project is part of the open source community and has new implementation participation, including yours! Feel free to contribute pull requests and issues.

Logo of the project. Font: Author.

Thank you very much for reading, I hope it has added something to your life. Feel free to contact me for more information!

Use The Throwaway Email Address API To Keep Away From Spammers

How did the machine read nutritional facts?

Add a comment

Related posts:

Best Astrologer in Harihar

A quick look at the refugee crises around the world

How the Service Management Framework reduces Operation Errors?