Cache

The cache settings will facilitate the restart of the spiders. In fact, after setting the cache directory of the spiders, all HTTP requests will be cached in the directory, so that the engine can directly read the local cache when sending the same request without going through the network.

Note: If you want to collect the latest data, please delete the cache first!

Well, you can config your cache directory in settings.py, just like this:

HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 0
HTTPCACHE_DIR='/path/to/cache'  # set your cache path here!
HTTPCACHE_IGNORE_HTTP_CODES = []
HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
HTTPCACHE_GZIP = True

If you use a relative path, the cache will appear in the .scrapy.

PreviousAPIKeys NextCustomizing your workflow

Last updated 1 year ago