[PDFminer] PDF에서 텍스트 추출

12090 단어 Python3 Python

python 3 지원PDFMiner.six 사용

설치하다.

$ pip install pdfminer.six

명령이 움직이지 않을 때

wget https://pypi.python.org/packages/source/p/pdfminer.six/pdfminer.six-20160202.zip
unzip pdfminer.six-20160202.zip
cd pdfminer.six-20160202
python setup.py install

anaconda의 상황

도해

참조: Programming with PDFMiner
카테고리
기능
PDFParser
PDF 파일에서 데이터 가져오기
PDFDocument
가져온 데이터 저장
PDFPageInterpreter
처리 페이지
PDFDevice
필요한 형식으로 변환

프로세스 처리

배치

견본

참조: http://gihyo.jp/book/2017/978-4-7741-8367-1
이번 분석Oculus 모범 사례 PDF에서 텍스트 파일로 출력해 봅시다.
print_pdf_textboxes.py

import sys

from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LAParams, LTContainer, LTTextBox
from pdfminer.pdfinterp import PDFPageInterpreter, PDFResourceManager
from pdfminer.pdfpage import PDFPage


def find_textboxes_recursively(layout_obj):
    """
    再帰的にテキストボックス（LTTextBox）を探して、テキストボックスのリストを取得する。
    """
    # LTTextBoxを継承するオブジェクトの場合は1要素のリストを返す。
    if isinstance(layout_obj, LTTextBox):
        return [layout_obj]

    # LTContainerを継承するオブジェクトは子要素を含むので、再帰的に探す。
    if isinstance(layout_obj, LTContainer):
        boxes = []
        for child in layout_obj:
            boxes.extend(find_textboxes_recursively(child))

        return boxes

    return []  # その他の場合は空リストを返す。

# Layout Analysisのパラメーターを設定。縦書きの検出を有効にする。
laparams = LAParams(detect_vertical=True)

# 共有のリソースを管理するリソースマネージャーを作成。
resource_manager = PDFResourceManager()

# ページを集めるPageAggregatorオブジェクトを作成。
device = PDFPageAggregator(resource_manager, laparams=laparams)

# Interpreterオブジェクトを作成。
interpreter = PDFPageInterpreter(resource_manager, device)

# 出力用のテキストファイル
output_txt = open('output.txt', 'w')

def print_and_write(txt):
    print(txt)
    output_txt.write(txt)
    output_txt.write('\n')

with open(sys.argv[1], 'rb') as f:
    # PDFPage.get_pages()にファイルオブジェクトを指定して、PDFPageオブジェクトを順に取得する。
    # 時間がかかるファイルは、キーワード引数pagenosで処理するページ番号（0始まり）のリストを指定するとよい。
    for page in PDFPage.get_pages(f):
        print_and_write('\n====== ページ区切り ======\n')
        interpreter.process_page(page)  # ページを処理する。
        layout = device.get_result()  # LTPageオブジェクトを取得。

        # ページ内のテキストボックスのリストを取得する。
        boxes = find_textboxes_recursively(layout)

        # テキストボックスの左上の座標の順でテキストボックスをソートする。
        # y1（Y座標の値）は上に行くほど大きくなるので、正負を反転させている。
        boxes.sort(key=lambda b: (-b.y1, b.x0))

        for box in boxes:
            print_and_write('-' * 10)  # 読みやすいよう区切り線を表示する。
            print_and_write(box.get_text().strip())  # テキストボックス内のテキストを表示する。

output_txt.close()

실행

$ python print_pdf_textboxes.py 対象のPDF

output.txt


====== ページ区切り ======

----------
Oculus Best Practices
----------
Version 310-30000-02

====== ページ区切り ======

----------
2 | Introduction | Best Practices
----------
Copyrights and Trademarks
----------
© 2017 Oculus VR, LLC. All Rights Reserved.
----------
OCULUS VR, OCULUS, and RIFT are trademarks of Oculus VR, LLC. (C) Oculus VR, LLC. All rights reserved.
BLUETOOTH is a registered trademark of Bluetooth SIG, Inc. All other trademarks are the property of their
respective owners. Certain materials included in this publication are reprinted with the permission of the
copyright holder.
----------
2 |  |

====== ページ区切り ======

----------
Best Practices | Contents | 3
----------
Contents
----------
Introduction to Best Practices..............................................................................4
----------
Binocular Vision, Stereoscopic Imaging and Depth Cues................................. 10
----------
Field of View and Scale.....................................................................................13
----------
Rendering Techniques....................................................................................... 15
----------
Motion................................................................................................................ 17
----------
Tracking.............................................................................................................. 20
----------
Simulator Sickness..............................................................................................23
----------
User Interface..................................................................................................... 30
----------
User Input and Navigation.................................................................................34
----------
Closing Thoughts............................................................................................... 36

====== ページ区切り ======

----------
4 | Introduction to Best Practices | Best Practices
----------
Introduction to Best Practices
----------
VR is an immersive medium. It creates the sensation of being entirely transported into a virtual (or real, but
digitally reproduced) three-dimensional world, and it can provide a far more visceral experience than screen-
based media. These best practices are intended to help developers produce content that provides a safe and
enjoyable consumer experience on Oculus hardware. Developers are responsible for ensuring their content
conforms to all standards and industry best practices on safety and comfort, and for keeping abreast of all
relevant scientific literature on these topics.
・
・
・

덤

Google 문서에서 Google 드라이브 파일을 열면 텍스트로 변환될 것 같습니다.이것은 정확하고 좋다.
PDF 또는 사진 파일을 텍스트로 변환

Reference

이 문제에 관하여([PDFminer] PDF에서 텍스트 추출), 우리는 이곳에서 더 많은 자료를 발견하고 링크를 클릭하여 보았다 https://qiita.com/mczkzk/items/894110558fb890c930b5

텍스트를 자유롭게 공유하거나 복사할 수 있습니다.하지만 이 문서의 URL은 참조 URL로 남겨 두십시오.

우수한 개발자 콘텐츠 발견에 전념 (Collection and Share based on the CC Protocol.)

Jenn의 기사를 HackMD로 편안하게 디자인

C#: 공동 및 역변

좋은 웹페이지 즐겨찾기

개발자 우수 사이트 수집

개발자가 알아야 할 필수 사이트 100선 추천 우리는 당신을 위해 100개의 자주 사용하는 개발자 학습 사이트를 정리했습니다