Azure Exporter for Scrapy

Scrapy feed export storage backend for Azure Storage.

Requirements

Python 3.8+

Installation

pip install git+https://github.com/scrapy-plugins/scrapy-feedexporter-azure-storage

Usage

Add this storage backend to the FEED_STORAGES Scrapy setting. For example:

# settings.py
FEED_STORAGES = {'azure': 'scrapy_azure_exporter.AzureFeedStorage'}

Configure authentication via any of the following settings:
- AZURE_CONNECTION_STRING
- AZURE_ACCOUNT_URL_WITH_SAS_TOKEN
- AZURE_ACCOUNT_URL & AZURE_ACCOUNT_KEY - If using this method, specify both of them.
For example,
```
AZURE_ACCOUNT_URL = "https://<your-storage-account-name>.blob.core.windows.net/"
AZURE_ACCOUNT_KEY = "Account key for the Azure account"
```

Configure in the FEEDS Scrapy setting the Azure URI where the feed needs to be exported.

FEEDS = {
    "azure://<account_name>.blob.core.windows.net/<container_name>/<file_name.extension>": {
        "format": "json"
    }
}

Write mode and blob type

The overwrite feed option is False by default when using this feed export storage backend. An extra feed option is also provided, blob_type, which can be "BlockBlob" (default) or "AppendBlob". See Understanding blob types. The feed options overwrite and blob_type can be combined to set the write mode of the feed export:

overwrite=False and blob_type="BlockBlob" create the blob if it does not exist, and fail if it exists.
overwrite=False and blob_type="AppendBlob" append to the blob if it exists and it is an AppendBlob, and create it otherwise.
overwrite=True overwrites the blob, even if it exists. The blob_type must match that of the target blob.

Media pipeline usage

Use the Azure pipeline for Scrapy media pipelines and be able to use Azure Blob Storage.

Just add the pipeline to Scrapy:

ITEM_PIPELINES = {
    "scrapy_azure_exporter.AzureFilesPipeline": 1,
}

Azurite usage

You can use Azurite as a storage emulator for Azure Blob Storage and test your application locally. Just append or set the feed storage to azurite.

# settings.py
FEED_STORAGES = {'azurite': 'scrapy_azure_exporter.AzureFeedStorage'}

And add the Azurite URI to the FEEDS setting:

FEEDS = {
    "azurite://<ip>:<port>/<account_name>/<container_name>/[<file_name.extension>]": {
        // ...
    }
}

And finally run your Scrapy project as it is usually done for FilesPipeline or ImagesPipeline.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
scrapy_azure_exporter		scrapy_azure_exporter
tests		tests
.gitignore		.gitignore
README.md		README.md
setup.py		setup.py
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Azure Exporter for Scrapy

Requirements

Installation

Usage

Write mode and blob type

Media pipeline usage

Azurite usage

About

Releases

Packages

Contributors 2

Languages

scrapy-plugins/scrapy-feedexporter-azure-storage

Folders and files

Latest commit

History

Repository files navigation

Azure Exporter for Scrapy

Requirements

Installation

Usage

Write mode and blob type

Media pipeline usage

Azurite usage

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages