Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTTP cache not working in some cases #125

Open
Gallaecio opened this issue Aug 21, 2023 · 5 comments
Open

HTTP cache not working in some cases #125

Gallaecio opened this issue Aug 21, 2023 · 5 comments

Comments

@Gallaecio
Copy link
Contributor

I have seen 2 people now having trouble with HTTP cache in combination with scrapy-zyte-api.

They set HTTPCACHE_ENABLED to True, and they get NotSupported("Response content isn't text").

I could not reproduce the issue myself with the tutorial spider, so I am not sure why it happens.

If someone affected by this issue can provide a minimal, reproducible example, that would be great!

@Gallaecio
Copy link
Contributor Author

I got someone else last week reporting the same issue. I could not reproduce it either.

I wonder if it is a simple matter of getting a random empty body response due to an undetected ban. The default cache policy will not ignore an HTTP 200 response even if its body is empty, so it will be cached and fail as reported when trying to use response.text (which is a separate Scrapy issue, empty responses should be considered text responses).

In those cases, creating a custom cache policy that ignores empty responses should be relatively easy (subclass the DummyPolicy, override should_cache_response).

@Gallaecio
Copy link
Contributor Author

One thing that does happen is that responses from the cache do not use the original response class, so you cannot access raw_api_response from cached responses.

@Gallaecio
Copy link
Contributor Author

Gallaecio commented Jan 31, 2024

@kmike @wRAR I see 2 possible approaches here:

  • Modify upstream Scrapy to support custom response classes declaring how they are serialized.
  • Implement a custom file system storage class for scrapy-zyte-api, to be configured as HTTPCACHE_STORAGE.

The first approach seems the best to me, but I am not 100% sure how to best implement such an API in a way that plays well with the existing storage classes (DBM and file system) in a backward-compatible way.

The second approach would be easier to implement, and cover most practical scenarios, but make corner case scenarios rather difficult (e.g. supporting additional Response classes beyond those from scrapy-zyte-api, or storing the cache somewhere other than the file system).

Any thoughts?

@jpabbuehl
Copy link

@Gallaecio I'm hitting this exact issue right now, and would love to know if going with option 2 is the preferred option for now. If yes, where to start? i see the raw_api_response is not cached here, so it should be straightforward? https://github.com/scrapy/scrapy/blob/67ab8d4650c1e9212c9508803c7b5265e166cbaa/scrapy/extensions/httpcache.py#L371-L382

@Gallaecio
Copy link
Contributor Author

Gallaecio commented Sep 10, 2024

Yes, you could implement option 2 on your own, subclass FilesystemCacheStorage:

  1. Extend store_response to include the serialization of raw_api_response.
  2. Override retrieve_response to build a scrapy_zyte_api.responses.ZyteAPITextResponse that includes it, instead of a Response that does not.
  3. Configure your custom class as the HTTPCACHE_STORAGE backend.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants