【PowerShell】 SudachiPy로 형태소 해석하기

SudachiPy 라는 훌륭한 형태소 해석을 발견했기 때문에 평상시 파워의 PowerShell 로부터 호출할 수 있도록 해 보았습니다.

완성된 것

캐릭터 라인을 파이프 해 하면(자), line 프로퍼티에 입력한 캐릭터 라인, parsed 프로퍼티에 해석 결과를 가지는 오브젝트를 돌려줍니다.

코드

주요 해석 처리를 Python으로 쓰고 PowerShell에서 호출하는 구조입니다.
캐릭터 라인의 입출력에는 커멘드 라인 인수나 print 에서의 표준 출력을 사용하는 것도 손입니다만, 이하와 같은 문제가 있으므로 임시 파일을 이용하기로 합니다.

인수의 상한

수백행 정도가 한계?

문자열 이스케이프

인용부호나 탭 문자가 포함되는 경우의 처리가 번잡하다.

문자 코드 문제

Windows 환경에서는 CP932 로 표현할 수 없는 문자를 print 하려고 하면 UnicodeEncodeError 가 발생해 버린다.

피하려면 해당 문자를 무시하거나 ?로 바꿀 수밖에 없다.

파이썬 측에서 처리

Python 은 Scoop 경유로 입수해 두면 패스 주위를 좋은 느낌으로 처리해 주어 편합니다. 사전 준비로 SudachiPy 와 fire 을 pip 로 설치해 둡시다.

pip install sudachipy
pip install fire

「텍스트 파일의 내용을 행 마다 형태소 해석해, 그 결과를 다른 텍스트 파일에 출력한다」라고 하는 처리를 함수에 정리해 fire.Fire() 로 cli 툴화합니다.

sudachi_tokenizer.py

import fire
import re
from sudachipy import tokenizer
from sudachipy import dictionary

def main(input_file_path, output_file_path, ignore_paren = False):
    tokenizer_obj = dictionary.Dictionary().create()
    mode = tokenizer.Tokenizer.SplitMode.C

    with open(input_file_path, "r", encoding="utf-8") as input_file:
        all_lines = input_file.read()
    lines = all_lines.splitlines()

    json_style_list = []
    for line in lines:
        if not line:
            json_style_list.append({"line": "", "parsed": []})
        else:
            if ignore_paren:
                target = re.sub(r"\(.+?\)|\[.+?\]|（.+?）|［.+?］", "", line)
            else:
                target = line
            tokens = tokenizer_obj.tokenize(target, mode)
            parsed = []
            for t in tokens:
                parsed.append({
                    "surface": t.surface(),
                    "pos": t.part_of_speech()[0],
                    "reading": t.reading_form(),
                    "c_type": t.part_of_speech()[4],
                    "c_form": t.part_of_speech()[5]
                })
            json_style_list.append({"line": line, "parsed": parsed})

    with open(output_file_path, mode = "w", encoding="utf-8") as output_file:
        output_file.write(str(json_style_list))

if __name__ == "__main__":
    fire.Fire(main)

업무상, 둥근 팔렌 （）

PowerShell 측 처리

위의 () 와 같은 디렉토리에 아래와 같은 ［］ 파일을 작성해, [] 로부터 읽어내는 것으로 콘솔로부터 cmdlet를 사용할 수 있게 됩니다.

function Invoke-SudachiTokenizer {
    param (
        [switch]$ignoreParen
    )

    try {
        (Get-Command "sudachipy.exe" -ErrorAction Stop) > $null
    }
    catch {
        Write-Host "  'sudachipy' is not found in this computer!`n  install by pip: " -ForegroundColor Magenta -NoNewline
        Write-Host "pip install sudachipy"
        Write-Host "https://github.com/WorksApplications/SudachiPy" -ForegroundColor White
        return
    }

    $outputTmp = New-TemporaryFile
    $inputTmp = New-TemporaryFile
    $input | Out-File -Encoding utf8NoBOM -FilePath $inputTmp.FullName

    $sudachiPath = "$($PSScriptRoot)\python\sudachi_tokenizer.py"
    $command = 'python -B "{0}" "{1}" "{2}"' -f $sudachiPath, $inputTmp.FullName, $outputTmp.FullName
    if ($ignoreParen) {
        $command += ' --ignore_paren=True'
    }

    Invoke-Expression -Command $command
    $parsed = Get-Content -Path $outputTmp.FullName -Encoding utf8NoBOM

    @($inputTmp, $outputTmp) | Remove-Item

    return $($parsed | ConvertFrom-Json)
}

Python 에서 리스트에 사전형을 정리하면 json 형식의 배열과 같은 형식이 되므로 PowerShell sudachi_tokenizer.py 에서 객체로 변환하고 있습니다.

Reference

이 문제에 관하여(【PowerShell】 SudachiPy로 형태소 해석하기), 우리는 이곳에서 더 많은 자료를 발견하고 링크를 클릭하여 보았다 https://qiita.com/AWtnb/items/eb778aba1cc2e335e581

텍스트를 자유롭게 공유하거나 복사할 수 있습니다.하지만 이 문서의 URL은 참조 URL로 남겨 두십시오.

우수한 개발자 콘텐츠 발견에 전념 (Collection and Share based on the CC Protocol.)

좋은 웹페이지 즐겨찾기

개발자 우수 사이트 수집

개발자가 알아야 할 필수 사이트 100선 추천 우리는 당신을 위해 100개의 자주 사용하는 개발자 학습 사이트를 정리했습니다