HTTP cache not working in some cases #125

Gallaecio · 2023-08-21T09:19:37Z

I have seen 2 people now having trouble with HTTP cache in combination with scrapy-zyte-api.

They set HTTPCACHE_ENABLED to True, and they get NotSupported("Response content isn't text").

I could not reproduce the issue myself with the tutorial spider, so I am not sure why it happens.

If someone affected by this issue can provide a minimal, reproducible example, that would be great!

The text was updated successfully, but these errors were encountered:

Gallaecio · 2023-11-06T08:47:42Z

I got someone else last week reporting the same issue. I could not reproduce it either.

I wonder if it is a simple matter of getting a random empty body response due to an undetected ban. The default cache policy will not ignore an HTTP 200 response even if its body is empty, so it will be cached and fail as reported when trying to use response.text (which is a separate Scrapy issue, empty responses should be considered text responses).

In those cases, creating a custom cache policy that ignores empty responses should be relatively easy (subclass the DummyPolicy, override should_cache_response).

Gallaecio · 2023-12-29T14:24:02Z

One thing that does happen is that responses from the cache do not use the original response class, so you cannot access raw_api_response from cached responses.

Gallaecio · 2024-01-31T13:40:52Z

@kmike @wRAR I see 2 possible approaches here:

Modify upstream Scrapy to support custom response classes declaring how they are serialized.
Implement a custom file system storage class for scrapy-zyte-api, to be configured as HTTPCACHE_STORAGE.

The first approach seems the best to me, but I am not 100% sure how to best implement such an API in a way that plays well with the existing storage classes (DBM and file system) in a backward-compatible way.

The second approach would be easier to implement, and cover most practical scenarios, but make corner case scenarios rather difficult (e.g. supporting additional Response classes beyond those from scrapy-zyte-api, or storing the cache somewhere other than the file system).

Any thoughts?

jpabbuehl · 2024-09-10T06:58:10Z

@Gallaecio I'm hitting this exact issue right now, and would love to know if going with option 2 is the preferred option for now. If yes, where to start? i see the raw_api_response is not cached here, so it should be straightforward? https://github.com/scrapy/scrapy/blob/67ab8d4650c1e9212c9508803c7b5265e166cbaa/scrapy/extensions/httpcache.py#L371-L382

Gallaecio · 2024-09-10T07:09:57Z

Yes, you could implement option 2 on your own, subclass FilesystemCacheStorage:

Extend store_response to include the serialization of raw_api_response.
Override retrieve_response to build a scrapy_zyte_api.responses.ZyteAPITextResponse that includes it, instead of a Response that does not.
Configure your custom class as the HTTPCACHE_STORAGE backend.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTTP cache not working in some cases #125

HTTP cache not working in some cases #125

Gallaecio commented Aug 21, 2023

Gallaecio commented Nov 6, 2023

Gallaecio commented Dec 29, 2023

Gallaecio commented Jan 31, 2024 •

edited

Loading

jpabbuehl commented Sep 10, 2024

Gallaecio commented Sep 10, 2024 •

edited

Loading

HTTP cache not working in some cases #125

HTTP cache not working in some cases #125

Comments

Gallaecio commented Aug 21, 2023

Gallaecio commented Nov 6, 2023

Gallaecio commented Dec 29, 2023

Gallaecio commented Jan 31, 2024 • edited Loading

jpabbuehl commented Sep 10, 2024

Gallaecio commented Sep 10, 2024 • edited Loading

Gallaecio commented Jan 31, 2024 •

edited

Loading

Gallaecio commented Sep 10, 2024 •

edited

Loading