pdf에서 txt로 변환 2[pyocr]

11977 단어 파이썬 pyocr PDF

소개

마지막으로 pdfminer를 사용하여 pdf에서 txt로 변환했습니다.
그러나, 대상 pdf의 문제인가, 작동하지 않았다.
이번에는 pyocr를 통해 문제 해결에 종사한다.

마지막 기사 htps : // 코 m / pt / ms / 4180035bd0cd789c858

목적

PDF에서 텍스트를 추출합니다.

사용한 것

이번에는 pyocr을 사용하여 텍스트를 추출하는 것을 생각했습니다.
그러나 pyocr은 이미지에서 텍스트 추출을 위해 pdf2image를 사용하여 pdf에서 image로 변환도 수행했습니다.

[pyocr]htps : //기 tぁb. g의 음. rg / rld / openpape r rk / ppcr
[pdf2image]htps : // 기주 b. 코 m / 벨 ゔぁ l / pdf 2 속눈썹

tesseract와 pyocr의 도입은 아래의 기사를 참고로 했다.
htps : // 이 m / 나베치 6011 / ms / 3, 367 또는 94dbd208에 fc7
htps : // 기주 b. 코 m / 테세라 ct-cr / 테세라 ct / 우키

처리 흐름

Please input pdf name 다음에 대상 pdf 파일 이름 입력

pdf 파일을 이미지로 변환

결과를 출력하기위한 result 디렉토리 작성

pyocr에서 이미지에서 텍스트 추출

결과의 txt 파일에 결과 출력

이다.
프로그램은 기사의 끝에 보여준다.

결과

입력: object1.pdf

출력

object1.txt

8 CLASSES AND OBJECT-ORIENTED PROGRAMMING

We now turn our attention to our last major topic related to writing programs in
Python: using classes to organize programs around modules and data
abstractions.

Classes can be used in many different ways. In this book we emphasize using
them in the context of object-oriented programming. The key to object-
oriented programming is thinking about objects as collections of both data and
the methods that operate on that data.

The ideas underlying object-oriented programming are about forty years old, and
have been widely accepted and practiced over the last twenty years or so, In the
mid-1970s people began to write articles explaining the benefits of this approach
to programming. About the same time, the programming languages SmallTalk
(at Xerox PARC) and CLU (at MIT) provided linguistic support for the ideas. But
it wasn’t until the arrival of C++ and Java that it really took off in practice.

We have been implicitly relying on object-oriented programming throughout
most of this book. Back in Section 2.1.1 we said “Objects are the core things
that Python programs manipulate. Every object has a type that defines the
kinds of things that programs can do with objects of that type.” Since Chapter
5, we have relied heavily upon built-in types such as list and dict and the
methods associated with those types. But just as the designers of a
programming language can build in only a small fraction of the useful functions,
they can only build in only a small fraction of the useful types. We have already
looked at a mechanism that allows programmers to define new functions; we
now look at a mechanism that allows programmers to define new types.



8.1

Abstract Data Types and Classes

The notion of an abstract data type is quite simple. An abstract data type is a
set of objects and the operations on those objects. These are bound together so
that one can pass an object from one part of a program to another, and in doing
so provide access not only to the data attributes of the object but also to
operations that make it easy to manipulate that data.

The specifications of those operations define an interface between the abstract
data type and the rest of the program. The interface defines the behavior of the
operations—what they do, but not how they do it. The interface thus provides
an abstraction barrier that isolates the rest of the program from the data
structures, algorithms, and code involved in providing a realization of the type
abstraction.

Programming is about managing complexity in a way that facilitates change.
There are two powerful mechanisms available for accomplishing this:
decomposition and abstraction. Decomposition creates structure in a program,
and abstraction suppresses detail. The key is to suppress the appropriate

10행째의 years or so, In the　의 , 가 . 가 되었지만, 그 이외는 맞고 있었다.
뭐, 완벽하게 할 수 있었다고 할 수 있을 것이다.

전회의 pdfminer로는 할 수 없었던 것도, pyocr(tesseract)를 이용하는 것으로 할 수 있었다!

이번 입력 이미지로 되었기 때문에, 인쇄물을 스캔하여 만든 pdf에서도 텍스트 추출할 수 있다고 생각한다.
또한 이번에는이 추출한 텍스트를 google 번역에 붙일 예정이지만,
다음 번에는 googletrans을 사용하여 프로그램에서 일본어로 바꾸고 싶습니다.

프로그램

이 프로그램은 github에 게시되었습니다 htps : // 기주 b. 코 m / pt 쓰 / pdf2 xt

실행하는 것은 다음의 pdf2text_pyocr.py이다.
입력된 PDF 파일을 convert_from_path로 image로 변경.
그리고 그 image를 1장씩 pyocr_read에 건네준다.

pdf2text_pyocr.py


from pdf2image import convert_from_path
from pyocr_read import pyocr_read

path = input("Please input pdf name\n")
images = convert_from_path(path)

i = 0
path,e = path.split(".")
pdf2read = pyocr_read(path)

for image in images:
    pdf2read.oneshot_read(image)
    i += 1

다음 pyocr_read.py는 위의 pdf2text_pyocr에서 호출됩니다.
init()에서는 pyocr 도구가 결정되고 결과를 저장하는 디렉토리가 생성됩니다.
또한 인식 할 언어를 결정합니다. "Available languages"에 표시된 언어를 선택할 수 있습니다.
예를 들어 영어는 eng, 일본어는 jpn
그리고 pdf2text_pyocr에서받은 이미지에서 pyocr로 텍스트를 추출로 출력을 파일에 씁니다.

pyocr_read.py

import pyocr
import pyocr.builders
import os

class pyocr_read(object):
    def __init__(self,path):
        self.path = path
        tools = pyocr.get_available_tools()
        if len(tools) == 0:
            print("No OCR tool found")
            sys.exit(1)
        self.tool = tools[0]

        langs = self.tool.get_available_languages()
        print("Available languages: %s" % ", ".join(langs))
        self.lang = input("Please input language you want to recognize : ")

        if os.path.exists("./result") != True:
            os.mkdir("./result")
        return

    def oneshot_read(self,img):
        txt = self.tool.image_to_string(img, lang=self.lang, builder=pyocr.builders.TextBuilder())
        print(txt)
        file = open("./result/"+ self.path + ".txt",mode = "a",encoding = "utf-8")
        file.write(txt+"\n")

Reference

이 문제에 관하여(pdf에서 txt로 변환 2[pyocr]), 우리는 이곳에서 더 많은 자료를 발견하고 링크를 클릭하여 보았다 https://qiita.com/ptxyasu/items/858277074d9aa656f226

텍스트를 자유롭게 공유하거나 복사할 수 있습니다.하지만 이 문서의 URL은 참조 URL로 남겨 두십시오.

우수한 개발자 콘텐츠 발견에 전념 (Collection and Share based on the CC Protocol.)

PHP로 출력되는 웹 페이지를 서버 측에서 PDF로 다운로드

AsciiDoc으로 작성된 문서를 PDF로 변환

좋은 웹페이지 즐겨찾기

개발자 우수 사이트 수집

개발자가 알아야 할 필수 사이트 100선 추천 우리는 당신을 위해 100개의 자주 사용하는 개발자 학습 사이트를 정리했습니다