Scrapy feed export storage backend for Azure Storage.
- Python 3.8+
pip install git+https://github.com/scrapy-plugins/scrapy-feedexporter-azure-storage
-
Add this storage backend to the FEED_STORAGES Scrapy setting. For example:
# settings.py FEED_STORAGES = {'azure': 'scrapy_azure_exporter.AzureFeedStorage'}
-
Configure authentication via any of the following settings:
AZURE_CONNECTION_STRING
AZURE_ACCOUNT_URL_WITH_SAS_TOKEN
AZURE_ACCOUNT_URL
&AZURE_ACCOUNT_KEY
- If using this method, specify both of them.
For example,
AZURE_ACCOUNT_URL = "https://<your-storage-account-name>.blob.core.windows.net/" AZURE_ACCOUNT_KEY = "Account key for the Azure account"
-
Configure in the FEEDS Scrapy setting the Azure URI where the feed needs to be exported.
FEEDS = { "azure://<account_name>.blob.core.windows.net/<container_name>/<file_name.extension>": { "format": "json" } }
The overwrite
feed option
is False
by default when using this feed export storage backend.
An extra feed option is also provided, blob_type
, which can be "BlockBlob"
(default) or "AppendBlob"
. See
Understanding blob types.
The feed options overwrite
and blob_type
can be combined to set the write
mode of the feed export:
overwrite=False
andblob_type="BlockBlob"
create the blob if it does not exist, and fail if it exists.overwrite=False
andblob_type="AppendBlob"
append to the blob if it exists and it is anAppendBlob
, and create it otherwise.overwrite=True
overwrites the blob, even if it exists. Theblob_type
must match that of the target blob.
Use the Azure pipeline for Scrapy media pipelines and be able to use Azure Blob Storage.
Just add the pipeline to Scrapy:
ITEM_PIPELINES = {
"scrapy_azure_exporter.AzureFilesPipeline": 1,
}
You can use Azurite as a storage emulator for Azure Blob Storage
and test your application locally. Just append or set the feed storage to azurite
.
# settings.py
FEED_STORAGES = {'azurite': 'scrapy_azure_exporter.AzureFeedStorage'}
And add the Azurite URI to the FEEDS
setting:
FEEDS = {
"azurite://<ip>:<port>/<account_name>/<container_name>/[<file_name.extension>]": {
// ...
}
}
And finally run your Scrapy project as it is usually done for FilesPipeline or ImagesPipeline.