Python 파충류 의 Scrapy 환경 구축 사례 강좌
어떻게 Scrapy 환경 을 구축 합 니까?
먼저 Python 환경 을 설치 해 야 합 니 다.Python 환경 구축 은 다음 과 같 습 니 다.https://blog.csdn.net/alice_tl/article/details/76793590
다음은 스 크 래 피 설치.
1.Scrapy 를 설치 하고 터미널 에서 pip install Scrapy 를 사용 합 니 다(가장 좋 은 것 은 외국 환경 입 니 다)
진행 알림 은 다음 과 같 습 니 다:
alicedeMacBook-Pro:~ alice$ pip install Scrapy
Collecting Scrapy
Using cached https://files.pythonhosted.org/packages/5d/12/a6197eaf97385e96fd8ec56627749a6229a9b3178ad73866a0b1fb377379/Scrapy-1.5.1-py2.py3-none-any.whl
Collecting w3lib>=1.17.0 (from Scrapy)
Using cached https://files.pythonhosted.org/packages/37/94/40c93ad0cadac0f8cb729e1668823c71532fd4a7361b141aec535acb68e3/w3lib-1.19.0-py2.py3-none-any.whl
Collecting six>=1.5.2 (from Scrapy)
xxxxxxxxxxxxxxxxxxxxx
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/setuptools/dist.py", line 380, in fetch_build_egg
return cmd.easy_install(req)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/setuptools/command/easy_install.py", line 632, in easy_install
raise DistutilsError(msg)
distutils.errors.DistutilsError: Could not find suitable distribution for Requirement.parse('incremental>=16.10.1')
----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /private/var/folders/v1/9x8s5v8x74v86vnpqyttqy280000gn/T/pip-install-U_6VZF/Twisted/
Twisted 가 없 는 오류 알림:Command "python setup.py egg_info" failed with error code 1 in /private/var/folders/v1/9x8s5v8x74v86vnpqyttqy280000gn/T/pip-install-U_6VZF/Twisted/
2,설치 Twisted,터미널 입력:sudo pip install twisted==13.1.0
alicedeMacBook-Pro:~ alice$ pip install twisted==13.1.0
Collecting twisted==13.1.0
Downloading https://files.pythonhosted.org/packages/10/38/0d1988d53f140ec99d37ac28c04f341060c2f2d00b0a901bf199ca6ad984/Twisted-13.1.0.tar.bz2 (2.7MB)
100% || 2.7MB 398kB/s
Requirement already satisfied: zope.interface>=3.6.0 in /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python (from twisted==13.1.0) (4.1.1)
Requirement already satisfied: setuptools in /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python (from zope.interface>=3.6.0->twisted==13.1.0) (18.5)
Installing collected packages: twisted
Running setup.py install for twisted ... error
Complete output from command /usr/bin/python -u -c "import setuptools, tokenize;__file__='/private/var/folders/v1/9x8s5v8x74v86vnpqyttqy280000gn/T/pip-install-inJwZ2/twisted/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r
', '
');f.close();exec(compile(code, __file__, 'exec'))" install --record /private/var/folders/v1/9x8s5v8x74v86vnpqyttqy280000gn/T/pip-record-OmuVWF/install-record.txt --single-version-externally-managed --compile:
running install
running build
running build_py
creating build
creating build/lib.macosx-10.13-intel-2.7
creating build/lib.macosx-10.13-intel-2.7/twisted
copying twisted/copyright.py -> build/lib.macosx-10.13-intel-2.7/twisted
copying twisted/_version.py -> build/li
3.sudo pip install scrapy 를 다시 사용 하여 설치 합 니 다.오류 알림 이 있 습 니 다.이번 에는 lxml 를 설치 하지 않 은 오류 알림 입 니 다.Could not find a version that satisfies the requirement lxml (from Scrapy) (from versions: )
No matching distribution found for lxml (from Scrapy)
alicedeMacBook-Pro:~ alice$ sudo pip install Scrapy
The directory '/Users/alice/Library/Caches/pip/http' or its parent directory is not owned by the current user and the cache has been disabled. Please check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
The directory '/Users/alice/Library/Caches/pip' or its parent directory is not owned by the current user and caching wheels has been disabled. check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
Collecting Scrapy
Downloading https://files.pythonhosted.org/packages/5d/12/a6197eaf97385e96fd8ec56627749a6229a9b3178ad73866a0b1fb377379/Scrapy-1.5.1-py2.py3-none-any.whl (249kB)
100% || 256kB 210kB/s
Collecting w3lib>=1.17.0 (from Scrapy)
xxxxxxxxxxxx
Downloading https://files.pythonhosted.org/packages/90/50/4c315ce5d119f67189d1819629cae7908ca0b0a6c572980df5cc6942bc22/Twisted-18.7.0.tar.bz2 (3.1MB)
100% || 3.1MB 59kB/s
Collecting lxml (from Scrapy)
Could not find a version that satisfies the requirement lxml (from Scrapy) (from versions: )
No matching distribution found for lxml (from Scrapy)
4,설치 lxml,사용:sudo pip install lxml
alicedeMacBook-Pro:~ alice$ sudo pip install lxml
The directory '/Users/alice/Library/Caches/pip/http' or its parent directory is not owned by the current user and the cache has been disabled. Please check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
The directory '/Users/alice/Library/Caches/pip' or its parent directory is not owned by the current user and caching wheels has been disabled. check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
Collecting lxml
Downloading https://files.pythonhosted.org/packages/a1/2c/6b324d1447640eb1dd240e366610f092da98270c057aeb78aa596cda4dab/lxml-4.2.4-cp27-cp27m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (8.7MB)
100% || 8.7MB 187kB/s
Installing collected packages: lxml
Successfully installed lxml-4.2.4
5、다시 scrapy 설치,sudo 사용 pip install scrapy,설치 성공
alicedeMacBook-Pro:~ alice$ sudo pip install Scrapy
The directory '/Users/alice/Library/Caches/pip/http' or its parent directory is not owned by the current user and the cache has been disabled. Please check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
The directory '/Users/alice/Library/Caches/pip' or its parent directory is not owned by the current user and caching wheels has been disabled. check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
Collecting Scrapy
Downloading https://files.pythonhosted.org/packages/5d/12/a6197eaf97385e96fd8ec56627749a6229a9b3178ad73866a0b1fb377379/Scrapy-1.5.1-py2.py3-none-any.whl (249kB)
100% || 256kB 11.5MB/s
Collecting w3lib>=1.17.0 (from Scrapy)
xxxxxxxxx
Requirement already satisfied: lxml in /Library/Python/2.7/site-packages (from Scrapy) (4.2.4)
Collecting functools32; python_version < "3.0" (from parsel>=1.1->Scrapy)
Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='pypi.org', port=443): Read timed out. (read timeout=15)",)': /simple/functools32/
Downloading https://files.pythonhosted.org/packages/4b/2a/0276479a4b3caeb8a8c1af2f8e4355746a97fab05a372e4a2c6a6b876165/idna-2.7-py2.py3-none-any.whl (58kB)
100% || 61kB 66kB/s
Installing collected packages: w3lib, cssselect, functools32, parsel, queuelib, PyDispatcher, attrs, pyasn1-modules, service-identity, zope.interface, constantly, incremental, Automat, idna, hyperlink, PyHamcrest, Twisted, Scrapy
Running setup.py install for functools32 ... done
Running setup.py install for PyDispatcher ... done
Found existing installation: zope.interface 4.1.1
Uninstalling zope.interface-4.1.1:
Successfully uninstalled zope.interface-4.1.1
Running setup.py install for zope.interface ... done
Running setup.py install for Twisted ... done
Successfully installed Automat-0.7.0 PyDispatcher-2.0.5 PyHamcrest-1.9.0 Scrapy-1.5.1 Twisted-18.7.0 attrs-18.1.0 constantly-15.1.0 cssselect-1.0.3 functools32-3.2.3.post2 hyperlink-18.0.0 idna-2.7 incremental-17.5.0 parsel-1.5.0 pyasn1-modules-0.2.2 queuelib-1.5.0 service-identity-17.0.0 w3lib-1.19.0 zope.interface-4.5.0
6.scrapy 가 설치 되 었 는 지 확인 하고 scrapy--version 을 입력 하 십시오.scrapy 의 버 전 정보 가 나타 납 니 다.예 를 들 어 scrapy 1.5.1-no active procject 가 나타 나 면 됩 니 다.
alicedeMacBook-Pro:~ alice$ scrapy --version
Scrapy 1.5.1 - no active project
Usage:
scrapy <command> [options] [args]
Available commands:
bench Run quick benchmark test
fetch Fetch a URL using the Scrapy downloader
genspider Generate new spider using pre-defined templates
runspider Run a self-contained spider (without creating a project)
settings Get settings values
shell Interactive scraping console
startproject Create new project
version Print Scrapy version
view Open URL in browser, as seen by Scrapy
[ more ] More commands available when run from project directory
Use "scrapy <command> -h" to see more info about a command
PS:org 네트워크 에 정상적으로 접근 하거나 sudo 관리자 권한 으로 설치 하지 않 으 면 유사 한 오류 알림 이 발생 합 니 다.Exception:
Traceback (most recent call last):
File "/Library/Python/2.7/site-packages/pip/_internal/basecommand.py", line 141, in main
status = self.run(options, args)
File "/Library/Python/2.7/site-packages/pip/_internal/commands/install.py", line 299, in run
resolver.resolve(requirement_set)
Exception:
Traceback (most recent call last):
File "/Library/Python/2.7/site-packages/pip/_internal/basecommand.py", line 141, in main
status = self.run(options, args)
File "/Library/Python/2.7/site-packages/pip/_internal/commands/install.py", line 299, in run
resolver.resolve(requirement_set)
File "/Library/Python/2.7/site-packages/pip/_internal/resolve.py", line 102, in resolve
self._resolve_one(requirement_set, req)
File "/Library/Python/2.7/site-packages/pip/_internal/resolve.py", line 256, in _resolve_one
abstract_dist = self._get_abstract_dist_for(req_to_install)
File "/Library/Python/2.7/site-packages/pip/_internal/resolve.py", line 209, in _get_abstract_dist_for
self.require_hashes
File "/Library/Python/2.7/site-packages/pip/_internal/operations/prepare.py", line 283, in prepare_linked_requirement
progress_bar=self.progress_bar
File "/Library/Python/2.7/site-packages/pip/_internal/download.py", line 836, in unpack_url
progress_bar=progress_bar
File "/Library/Python/2.7/site-packages/pip/_internal/download.py", line 673, in unpack_http_url
progress_bar)
File "/Library/Python/2.7/site-packages/pip/_internal/download.py", line 897, in _download_http_url
_download_url(resp, link, content_file, hashes, progress_bar)
File "/Library/Python/2.7/site-packages/pip/_internal/download.py", line 617, in _download_url
hashes.check_against_chunks(downloaded_chunks)
File "/Library/Python/2.7/site-packages/pip/_internal/utils/hashes.py", line 48, in check_against_chunks
for chunk in chunks:
File "/Library/Python/2.7/site-packages/pip/_internal/download.py", line 585, in written_chunks
for chunk in chunks:
File "/Library/Python/2.7/site-packages/pip/_internal/download.py", line 574, in resp_read
decode_content=False):
File "/Library/Python/2.7/site-packages/pip/_vendor/urllib3/response.py", line 465, in stream
data = self.read(amt=amt, decode_content=decode_content)
File "/Library/Python/2.7/site-packages/pip/_vendor/urllib3/response.py", line 430, in read
raise IncompleteRead(self._fp_bytes_read, self.length_remaining)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/contextlib.py", line 35, in __exit__
self.gen.throw(type, value, traceback)
File "/Library/Python/2.7/site-packages/pip/_vendor/urllib3/response.py", line 345, in _error_catcher
raise ReadTimeoutError(self._pool, None, 'Read timed out.')
ReadTimeoutError: HTTPSConnectionPool(host='files.pythonhosted.org', port=443): Read timed out.
매 뉴 얼 에 따라 Scrapy 의 환경 을 만 들 었 습 니 다.Scrapy 파충류 운행 흔 한 오류 및 해결
첫 번 째 Spider 코드 에 따라 연습 하여 저장 합 니 다. tutorial/spiders 디 렉 터 리 아래 dmoz_spider.py 파일 중:
import scrapy
class DmozSpider(scrapy.Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
filename = response.url.split("/")[-2]
with open(filename, 'wb') as f:
f.write(response.body)
terminal 에서 실행:scrapy crawl dmoz,파충류 시작 시도오류 알림 1:
Scrapy 1.6.0 - no active project
Unknown command: crawl
alicedeMacBook-Pro:~ alice$ scrapy crawl dmoz
Scrapy 1.6.0 - no active project
Unknown command: crawl
Use "scrapy" to see available commands
이 유 는 명령 행 startprocject 를 사용 할 때 scrapy.cfg 가 자동 으로 생 성 되 기 때 문 입 니 다.명령 행 cmd 를 사용 하여 파충 류 를 시작 할 때 crawl 은 cmd 현재 디 렉 터 리 에 있 는 scrapy.cfg 파일 을 검색 하고 공식 문서 에서 도 설명 합 니 다.scrapy.cfg 파일 을 찾 지 못 하면 이 프로젝트 가 없다 고 생각 합 니 다.솔 루 션:따라서 cd 는 이 dmoz 프로젝트 의 루트 디 렉 터 리,즉 scrapy.cfg 파일 이 있 는 디 렉 터 리 에 들 어가 명령 scrapy crawl dmoz 를 실행 합 니 다.
정상 적 인 상황 에서 얻 은 출력 은:
2014-01-23 18:13:07-0400 [scrapy] INFO: Scrapy started (bot: tutorial)
2014-01-23 18:13:07-0400 [scrapy] INFO: Optional features available: ...
2014-01-23 18:13:07-0400 [scrapy] INFO: Overridden settings: {}
2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled extensions: ...
2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled downloader middlewares: ...
2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled spider middlewares: ...
2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled item pipelines: ...
2014-01-23 18:13:07-0400 [dmoz] INFO: Spider opened
2014-01-23 18:13:08-0400 [dmoz] DEBUG: Crawled (200)
2014-01-23 18:13:09-0400 [dmoz] DEBUG: Crawled (200)
그러나 실제로는 그렇지 않다
오류 알림 2:
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/spiderloader.py", line 71, in load
raise KeyError("Spider not found: {}".format(spider_name))
KeyError: 'Spider not found: dmoz'
alicedeMacBook-Pro:tutorial alice$ scrapy crawl dmoz
2019-04-19 09:28:23 [scrapy.utils.log] INFO: Scrapy 1.6.0 started (bot: tutorial)
2019-04-19 09:28:23 [scrapy.utils.log] INFO: Versions: lxml 4.3.3.0, libxml2 2.9.9, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.7.0, Python 3.7.3 (v3.7.3:ef4ec6ed12, Mar 25 2019, 16:39:00) - [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)], pyOpenSSL 18.0.0 (OpenSSL 1.1.0i 14 Aug 2018), cryptography 2.3.1, Platform Darwin-17.3.0-x86_64-i386-64bit
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/spiderloader.py", line 69, in load
return self._spiders[spider_name]
KeyError: 'dmoz'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/spiderloader.py", line 71, in load
raise KeyError("Spider not found: {}".format(spider_name))
KeyError: 'Spider not found: dmoz'
원인:포 지 셔 닝 디 렉 터 리 가 올 바 르 지 않 습 니 다.dmoz 가 있 는 디 렉 터 리 에 들 어가 야 합 니 다.솔 루 션:간단 합 니 다.디 렉 터 리 를 다시 확인 하고 들 어가 면 됩 니 다.
오류 알림 3:
File "/Library/Python/2.7/site-packages/twisted/internet/_sslverify.py", line 15, in
from OpenSSL._util import lib as pyOpenSSLlib
ImportError: No module named _util
alicedeMacBook-Pro:tutorial alice$ scrapy crawl dmoz
2018-08-06 22:25:23 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: tutorial)
2018-08-06 22:25:23 [scrapy.utils.log] INFO: Versions: lxml 4.2.4.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.7.0, Python 2.7.10 (default, Jul 15 2017, 17:16:57) - [GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.31)], pyOpenSSL 0.13.1 (LibreSSL 2.2.7), cryptography unknown, Platform Darwin-17.3.0-x86_64-i386-64bit
2018-08-06 22:25:23 [scrapy.crawler] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'tutorial.spiders', 'SPIDER_MODULES': ['tutorial.spiders'], 'ROBOTSTXT_OBEY': True, 'BOT_NAME': 'tutorial'}
Traceback (most recent call last):
File "/usr/local/bin/scrapy", line 11, in <module>
sys.exit(execute())
File "/Library/Python/2.7/site-packages/scrapy/cmdline.py", line 150, in execute
_run_print_help(parser, _run_command, cmd, args, opts)
File "/Library/Python/2.7/site-packages/scrapy/cmdline.py", line 90, in _run_print_help
func(*a, **kw)
File "/Library/Python/2.7/site-packages/scrapy/cmdline.py", line 157, in _run_command
t/ssl.py", line 230, in <module>
from twisted.internet._sslverify import (
File "/Library/Python/2.7/site-packages/twisted/internet/_sslverify.py", line 15, in <module>
from OpenSSL._util import lib as pyOpenSSLlib
ImportError: No module named _util
인터넷 에서 오랫동안 조사 한 자 료 는 여전히 풀 리 지 않 는 다.일부 블 로 거들 은 pyOpenSSL 이나 Scrapy 의 설치 에 문제 가 있다 며 pyOpenSSL 과 Scrapy 를 다시 설 치 했 지만 같은 오 류 를 보고 해 어떻게 해결 해 야 할 지 몰 랐 다.뒤에 pyOpenSSL 과 Scrapy 를 다시 달 았 습 니 다.해 결 된 것 같 습 니 다~
2019-04-19 09:46:37 [scrapy.core.engine] INFO: Spider opened
2019-04-19 09:46:37 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-04-19 09:46:39 [scrapy.core.engine] DEBUG: Crawled (403) <GET http://www.dmoz.org/robots.txt> (referer: None)
2019-04-19 09:46:39 [scrapy.core.engine] DEBUG: Crawled (403) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None)
2019-04-19 09:46:40 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>: HTTP status code is not handled or not allowed
2019-04-19 09:46:40 [scrapy.core.engine] DEBUG: Crawled (403) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: None)
2019-04-19 09:46:40 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/>: HTTP status code is not handled or not allowed
2019-04-19 09:46:40 [scrapy.core.engine] INFO: Closing spider (finished)
2019-04-19 09:46:40 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 737,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 3,
'downloader/response_bytes': 2103,
'downloader/response_count': 3,
'downloader/response_status_count/403': 3,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2019, 4, 19, 1, 46, 40, 570939),
'httperror/response_ignored_count': 2,
'httperror/response_ignored_status_count/403': 2,
'log_count/DEBUG': 3,
'log_count/INFO': 9,
'log_count/WARNING': 1,
'memusage/max': 65601536,
'memusage/startup': 65597440,
'response_received_count': 3,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/403': 1,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2019, 4, 19, 1, 46, 37, 468659)}
2019-04-19 09:46:40 [scrapy.core.engine] INFO: Spider closed (finished)
alicedeMacBook-Pro:tutorial alice$
파 이 썬 파충류 의 스 크 래 피 환경 구축 사례 튜 토리 얼 에 관 한 이 글 은 여기까지 소개 되 었 습 니 다.더 많은 파 이 썬 파충류 의 스 크 래 피 환경 구축 내용 은 우리 의 이전 글 을 검색 하거나 아래 의 관련 글 을 계속 찾 아 보 세 요.앞으로 많은 응원 부탁드립니다!이 내용에 흥미가 있습니까?
현재 기사가 여러분의 문제를 해결하지 못하는 경우 AI 엔진은 머신러닝 분석(스마트 모델이 방금 만들어져 부정확한 경우가 있을 수 있음)을 통해 가장 유사한 기사를 추천합니다:
로마 숫자를 정수로 또는 그 반대로 변환그 중 하나는 로마 숫자를 정수로 변환하는 함수를 만드는 것이었고 두 번째는 그 반대를 수행하는 함수를 만드는 것이었습니다. 문자만 포함합니다'I', 'V', 'X', 'L', 'C', 'D', 'M' ; 문자열이 ...
텍스트를 자유롭게 공유하거나 복사할 수 있습니다.하지만 이 문서의 URL은 참조 URL로 남겨 두십시오.
CC BY-SA 2.5, CC BY-SA 3.0 및 CC BY-SA 4.0에 따라 라이센스가 부여됩니다.