我想从这个image中提取文本。我试着删除矩形轮廓,所以我开始检测形成盒子的水平线和垂直线。但我发现了一个问题,一些字符像素被错误地识别为垂直线。以获得一个没有矩形框的干净图像,只包含行文本,所以我可以应用pytesseract进行文本提取。
你能提供任何建议来移除矩形框吗?
谢谢你!
import cv2
from PIL import Image
import matplotlib.pylab as plt
image = io.imread("sample.png")
result = image.copy()
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]
#Remove horizontal lines
horizontal_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (40,1))
remove_horizontal = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, horizontal_kernel, iterations=2)
cnts = cv2.findContours(remove_horizontal, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]
for c in cnts:
cv2.drawContours(result, [c], -1, (255,255,255), 5)
plt.imshow(result)
# Remove vertical lines
vertical_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1,40))
remove_vertical = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, vertical_kernel, iterations=2)
cnts = cv2.findContours(remove_vertical, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]
for c in cnts:
cv2.drawContours(result, [c], -1, (255,255,255), 5)
plt.imshow(result)
1条答案
按热度按时间5hcedyr01#
您可以尝试在图像中查找连接的组件,并过滤掉那些太宽或太高的组件。例如: