Skip to content

Utilizing EasyOCR for text extraction, Sage for grammar correction, and RuBERT for understanding and classifying the text.

Notifications You must be signed in to change notification settings

StrangePineAplle/OCR-Marketplace-Fraud-Image-Classifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 

Repository files navigation

OCR-Marketplace-Fraud-Image-Classifier

Python CLIP CatBoost

OCR base solution

Overview

The Marketplace Fraud Detection with EasyOCR and RuBERT project aims to detect fraudulent images in online marketplaces by classifying them as either fraudulent or non-fraudulent. The model is trained to identify images that contain deceptive text, such as "деньги за отзывы телеграм @mony_tg," which are commonly associated with scams.

Data

The dataset used for training the Marketplace Fraud Image Classifier consists of two main categories: fraudulent images and non-fraudulent (normal) images.

Fraudulent Images

The fraudulent data examples contain images that feature deceptive text, such as scam messages or suspicious offers. These images are crucial for training the model to accurately identify potential fraud attempts. Some examples of fraudulent images are shown below:

Non-Fraudulent (Normal) Images

The non-fraudulent data examples represent legitimate images that do not contain any suspicious content. These images serve as a contrast to the fraudulent data, helping the model distinguish between genuine and fraudulent listings. Examples of non-fraudulent images are provided below:

By training the model on a balanced dataset consisting of both fraudulent and non-fraudulent images, the Marketplace Fraud Image Classifier can learn to effectively distinguish between legitimate and suspicious content, ultimately enhancing the safety and trust in online marketplaces.

Model Architecture

EasyOCR

EasyOCR (Easy Optical Character Recognition) is a powerful library for extracting text from images. It supports a wide range of languages and is known for its accuracy and ease of use. In this project, EasyOCR is employed to extract text from the input images.

Sage

Sage is a grammar correction library that helps improve the quality of extracted text. It analyzes the text and suggests corrections based on grammatical rules and context. By integrating Sage, the project ensures that the extracted text is grammatically correct and easier to process for further analysis.

RuBERT

RuBERT (Russian BERT) is a pre-trained language model based on BERT (Bidirectional Encoder Representations from Transformers). It is specifically designed for the Russian language and is trained on a large corpus of Russian text. RuBERT is used in this project for understanding the meaning of the extracted text and classifying it as either fraudulent or non-fraudulent.

Features

  • Text Extraction: EasyOCR is used to extract text from input images.
  • Grammar Correction: Sage is employed to correct grammatical errors in the extracted text.
  • Text Understanding: RuBERT is utilized to understand the meaning of the corrected text and classify it as either fraudulent or non-fraudulent.
  • Fraud Detection: The model identifies fraudulent images based on the extracted and analyzed text.

Results

The performance of the Marketplace Fraud Image Classifier was evaluated using various metrics, with a primary focus on the F1 score, which balances precision and recall.

After thorough training and validation, the final model achieved an impressive F1 score of 0.96. This high score indicates that the model is highly effective in distinguishing between fraudulent and non-fraudulent images, demonstrating its ability to minimize false positives and false negatives. While OCR models provide high accuracy, they can be relatively slow, which may not be suitable for some applications. For a more optimal solution, consider my CLIP based solution

Performance Metrics

  • F1 Score: 0.96
  • speed: more then 5 minute/3000 img

About

Utilizing EasyOCR for text extraction, Sage for grammar correction, and RuBERT for understanding and classifying the text.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published