PDFをOCRでテキスト変換してみた（Cloud Vision）

PythonでGCPのCloud Visionを利用して、PDFをテキスト変換してみました。

はじめに

最近、PythonやNode.jsのモジュールを利用して、PDFからテキスト抽出をやってみたのですが、結果が文字化けしているファイルがちらほらありました。
例えばこちらのPDFで２ページ目などをテキストエディタにコピペすると、何も出なかったり意味不明な文字列が表示されるかと思います。

どうやらPDFの作り方によるようで、OCRを使ったほうが良さそうです。
PDFが文字化けする原因と対処法
※Google Driveを使った方法だと、あまり認識してくれませんでした。

GCPの「Cloud Vision」というサービスが認識率が良いらしいので使ってみました。
GCP/Azure/AWSのOCRサービスの比較

他にもCloud Visionを使ってみた記事はありましたが、単に実行しただけだとよく分からないJsonファイル¹が出力されてテキストファイルとしては使えず、もう１ステップ必要でした。

注意

多くのPDFを処理させると、それなりに料金が掛かります。それなりに処理時間も掛かります。²
例えば、100ページのPDFを100ファイル対象とすると10,000ユニットとなり、毎月1,000ユニットが無料なので、9 x $1.5 = 1,500円³ほど。
Cloud Vision APIの料金
※とはいえ、GCPの無料枠で$300クレジット³付くので、残っていれば無料で結構使えます。

回避するなら、例えば以下のような方法が考えられます。

Azure Computer Visionの方が無料枠が多いのでそちらを使う。
Windows.Media.OcrというWindows標準で無料で使えるOCRが優秀らしい。（ただ使い方に癖がありそう）
分割処理して毎月の無料枠に収める。

準備

1. GCPの設定

「Cloud Vision API」を有効化して、「サービスアカウントキー」を発行する必要があります。

キーはダウンロードして環境変数に格納先のパスを設定します。

1

export GOOGLE_APPLICATION_CREDENTIALS="[PATH]"

2. 実行環境の設定

3. GCP上にPDFを格納

Cloud Strageに格納したファイルが対象となる（出力先もCloud Strage）ので、バケットを作ってファイルアップロードしておきます。
こんな感じで階層構造にして一括で処理するようにしました。

test
- pdf
  - test1.pdf
  - test2.pdf
- json

Pythonコード

1. Json出力する

長いですが、ループするようにした以外ほぼサンプルコードのままです。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94


from google.cloud import storage

# The name for the new bucket
bucket_name = "バケット名を入れる"

input_prefix = "test/pdf/"
output_prefix = "test/json/"


def async_detect_document(gcs_source_uri, gcs_destination_uri):
    """OCR with PDF/TIFF as source files on GCS"""
    import re
    from google.cloud import vision
    from google.cloud import storage
    from google.protobuf import json_format

    # Supported mime_types are: 'application/pdf' and 'image/tiff'
    mime_type = "application/pdf"

    # How many pages should be grouped into each json output file.
    batch_size = 100

    client = vision.ImageAnnotatorClient()

    feature = vision.types.Feature(
        type=vision.enums.Feature.Type.DOCUMENT_TEXT_DETECTION
    )

    gcs_source = vision.types.GcsSource(uri=gcs_source_uri)
    input_config = vision.types.InputConfig(gcs_source=gcs_source, mime_type=mime_type)

    gcs_destination = vision.types.GcsDestination(uri=gcs_destination_uri)
    output_config = vision.types.OutputConfig(
        gcs_destination=gcs_destination, batch_size=batch_size
    )

    async_request = vision.types.AsyncAnnotateFileRequest(
        features=[feature], input_config=input_config, output_config=output_config
    )

    operation = client.async_batch_annotate_files(requests=[async_request])

    print("Waiting for the operation to finish.")
    operation.result(timeout=300)

    # Once the request has completed and the output has been
    # written to GCS, we can list all the output files.
    storage_client = storage.Client()

    match = re.match(r"gs://([^/]+)/(.+)", gcs_destination_uri)
    bucket_name = match.group(1)
    prefix = match.group(2)

    bucket = storage_client.get_bucket(bucket_name)

    # List objects with the given prefix.
    blob_list = list(bucket.list_blobs(prefix=prefix))
    print("Output files:")
    for blob in blob_list:
        print(blob.name)

    # Process the first output file from GCS.
    # Since we specified batch_size=2, the first response contains
    # the first two pages of the input file.
    output = blob_list[0]

    json_string = output.download_as_string()
    response = json_format.Parse(json_string, vision.types.AnnotateFileResponse())

    # The actual response for the first page of the input file.
    first_page_response = response.responses[0]
    annotation = first_page_response.full_text_annotation

    # Here we print the full text from the first page.
    # The response contains more information:
    # annotation/pages/blocks/paragraphs/words/symbols
    # including confidence scores and bounding boxes
    print(u"Full text:\n{}".format(annotation.text))


# Instantiates a client
storage_client = storage.Client()

files = list(storage_client.list_blobs(bucket_name, prefix=input_prefix))

for i, file in enumerate(files[1:], 1):
    print(file.name)
    filename = file.name.rsplit("/", 1)[-1]
    print(str(i) + " " + filename)

    async_detect_document(
        "gs://" + bucket_name + "/" + input_prefix + filename,
        "gs://" + bucket_name + "/" + output_prefix + filename + "/",
    )

test/json/xxx.pdf/ 配下にJsonが出力されます。
※ちょっと分かり辛いですが、ページ数が多いと１つのPDFで複数ファイルになる事があるのでこうしてます。

ダウンロードして開いてみても文字の羅列で読めません。

2. テキストに出力する

Cloud Strage上のJsonファイルを元に、ローカルディレクトリにテキスト出力します。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38


import ast
from google.cloud import storage

bucket_name = "バケット名を入れる"
json_prefix = "test/json/"

# ローカルの出力先（予めディレクトリを作っておく）
text_dir = "outdata"

storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)


def ext_txt(filename):
    blob = bucket.blob(filename)
    content = blob.download_as_string()

    # NOTE: dict型←string型←bytes型と変換してる
    data = ast.literal_eval(content.decode())

    out_filename = filename.rsplit("/", 2)[-2]
    out_path = text_dir + "/" + out_filename + ".txt"

    # NOTE: jsonが分割されてることがあるので追記型で
    with open(out_path, "a") as f:
        for response in data["responses"]:
            if "fullTextAnnotation" in response:
                fulltext = response["fullTextAnnotation"]
                print(fulltext["text"])
                print(fulltext["text"], file=f)


files = list(storage_client.list_blobs(bucket_name, prefix=json_prefix))

for i, file in enumerate(files[1:], 1):
    filename = file.name
    print(str(i) + " " + filename)
    ext_txt(filename)

冒頭でコピペできなかったPDFも、このように出力できます。

1
2
3
4
5
6
7
8


A3-03
サービスメッシュは本当に必要なのか
何を解決するのか
Yasuhiro Hara
Solutions Architect
Amazon Web Services Japan
aws SUMMIT
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.

実際は文字の位置情報など、色々な情報を出力してくれているのですが見て面食らいました。 ↩︎
50ページ×200ファイルくらいで２時間ほど掛かりました。サンプルそのまま＋シリアル実行なので、書き方を変えれば高速化できるかもしれません。 ↩︎
いずれも、2020.5.23現在。正確には公式の資料を参照してください。 ↩︎ ↩︎