스시 이미지 Scraper를 만들어 보았습니다.

동기

SushiGAN ~ 인공 지능은 초밥을 잡을 수 있습니까? ~ 라고 하는 기사를 보고 「스시 화상 제공할까」라고 생각한 것이 계기. 이 기사처럼 "실용성에는 물음표가 붙지만 Deep Learning에서 재미있는 일을 해봤다!"라는 종류의 기사에는 강하게 찬동한다

했던 일

초밥집 홈페이지에서 초밥 이미지를 수집하는 프로그램을 만들어 보았다. 스시 가게의 홈페이지는 대부분 정적이므로 스크래핑이 쉬웠습니다.

사용 라이브러리

스테디셀러 Beautiful Soup

코드

sushicraper.py

import requests
from bs4 import BeautifulSoup
import re
import shutil
import os
import multiprocessing
from itertools import repeat

"""
Each sushi scraper function takes a soup object and returns image source urls and names
"""


def download_img(img_src, img_name, save_dir):
    print('Downloading', img_src)
    try:
        r = requests.get(img_src, stream=True)
        if r.status_code == 200:
            img_path = os.path.join(save_dir, img_name)
            with open(img_path, 'wb') as f:
                r.raw.decode_content = True
                shutil.copyfileobj(r.raw, f)
    except Exception as e:
        print('Could not download the image due to an error:', img_src)
        print(e)


def multi_download(img_srcs, img_names, save_dir):
    num_cpu = multiprocessing.cpu_count()
    with multiprocessing.Pool(num_cpu) as p:
        p.starmap(download_img, zip(img_srcs, img_names, repeat(save_dir)))


def sushibenkei(soup):
    img_srcs = [x['src'] for x in soup.select('div.sushiitem > img')]
    regex = re.compile(r'[0-9]+円')
    parser = lambda x: regex.sub('', x.replace('\n', '').replace('\u3000', ''))
    img_names = [x.text + '.png' for x in soup.select('div.sushiitem')]
    img_names = list(map(parser, img_names))
    return img_srcs, img_names


def oshidorizushi(soup):
    img_srcs = [x['src'] for x in soup.select('div.menu-a-item > img')]
    img_names = [x.text + '.jpg' for x in soup.select('p.menu-a-name')]
    return img_srcs, img_names


def nigiri_chojiro(soup):
    uls = soup.select('ul.menu-list')
    img_srcs = ['https:' + li.find('img')['src'] for ul in uls for li in ul.find_all('li')]
    img_names = [li.find('dt').text for ul in uls for li in ul.find_all('li')]
    parser = lambda x: x.split('／')[0].lower().replace(' ', '_') + '.jpg'
    img_names = list(map(parser, img_names))
    return img_srcs, img_names


def nigiri_no_tokubei(soup):
    img_srcs = [x['href'] for x in soup.select('a.item_link')]
    img_names = [x.text + '.jpg' for x in soup.select('dt.item_title')]
    return img_srcs, img_names


def sushi_value(soup):
    img_srcs = [x['src'] for x in soup.select('img.attachment-full')]
    img_names = [x['alt'] + '.jpg' for x in soup.select('img.attachment-full')]
    return img_srcs, img_names


def daiki_suisan(soup):
    regex = re.compile(r'.+grandmenu/menu.+.jpg')
    img_srcs = [x['src'] for x in soup.find_all('img', {'src': regex})]
    img_names = [x['alt'] + '.jpg' for x in soup.find_all('img', {'src': regex})]
    return img_srcs, img_names


def main():
    img_dir = 'images'
    if not os.path.exists(img_dir): os.mkdir(img_dir)

    funcs_urls = [
        (sushibenkei, 'http://www.sushibenkei.co.jp/sushimenu/'),
        (oshidorizushi, 'http://www.echocom.co.jp/menu'),
        (nigiri_chojiro, 'https://www.chojiro.jp/menu/menu.php?pid=1'),
        (nigiri_no_tokubei, 'http://www.nigirinotokubei.com/menu/551/'),
        (sushi_value, 'https://www.sushi-value.com/menu/'),
        (daiki_suisan, 'http://www.daiki-suisan.co.jp/sushi/grandmenu/'),
    ]

    for func, url in funcs_urls:
        soup = BeautifulSoup(requests.get(url).text, 'lxml')
        img_srcs, img_names = func(soup)
        save_dir = os.path.join(img_dir, func.__name__)
        if not os.path.exists(save_dir): os.mkdir(save_dir)
        multi_download(img_srcs, img_names, save_dir)


if __name__ == '__main__':
    main()

실행 결과

스시 가게의 매장 이름마다 폴더가 작성되고 그 안에 이미지가 수집됩니다 ↓

연극

SushiNet 라든지 Sushi-Mask-RCNN 라든가 작성하면 재미있을 것 같다

Reference

이 문제에 관하여(스시 이미지 Scraper를 만들어 보았습니다.), 우리는 이곳에서 더 많은 자료를 발견하고 링크를 클릭하여 보았다 https://qiita.com/harupy/items/660b60f89cf943733d76

텍스트를 자유롭게 공유하거나 복사할 수 있습니다.하지만 이 문서의 URL은 참조 URL로 남겨 두십시오.

우수한 개발자 콘텐츠 발견에 전념 (Collection and Share based on the CC Protocol.)

좋은 웹페이지 즐겨찾기

개발자 우수 사이트 수집

개발자가 알아야 할 필수 사이트 100선 추천 우리는 당신을 위해 100개의 자주 사용하는 개발자 학습 사이트를 정리했습니다