Implement upload API #1085

psilabs-dev · 2024-10-05T23:37:53Z

Implementation of an archive upload API, that uploads the content of an archive and all metadata, if any, to the LRR server. (#1005)

While it is possible to write a LRR plugin for a website W, it is often the case to use dedicated, feature-rich, and actively maintained downloaders (e.g. PixivUtil2 for Pixiv) that specifically pulls metadata/archives from W. However, moving downloaded content from these downloaders to LRR is still a human process. In the case of PixivUtil2, metadata is stored in a sqlite db, and the storage/language/architecture is different for each downloader.

This API implementation (using multipart/form-data encoding) allows users to perform language-agnostic archive migrations or mirrors from their existing archive servers and downloaders to the LRR server via the API, or write comparatively thin HTTP client wrappers around their downloaders that negotiate with and push to LRR during that application's runtime. The user may wish to attach the SHA1 hash of the file to check for file integrity and avoid corruptions during transit.

PUT /api/archives/upload

Return values (see documentation)

Include in body: file content (binary), filename, title (optional), summary (optional), tags (optional), category ID (optional), plugin calls (?)

Example of a PUT /api/archives/upload in Python with the "example.zip" archive:

import hashlib
import requests

def calculate_sha1(file_path):
    sha1 = hashlib.sha1()
    with open(file_path, 'rb') as fb:
        while chunk := fb.read(8192):
            sha1.update(chunk)
    return sha1.hexdigest()

file_path = "example.zip" # required
file_name = "example.zip" # required, this will be where the file is stored in the server
upload_mime = "application/zip"

# metadata
title = "example title"
tags = "example_date,key:value,artist:example artist"
summary = "example summary"
category_id = None

# calculate checksum (optional)
file_checksum = calculate_sha1(file_path)

data = {
    "title": title,
    "tags": tags,
    "summary": summary,
    "file_checksum": file_checksum
}

with open(file_path, 'rb') as fb:
    files = {'file': (file_name, fb, upload_mime)}
    response = requests.put(
        "http://localhost:3000/api/archives/upload",
        files=files,
        data=data
    )
    print(response.json())

This is a synchronous API implementation. Theoretically, we can also run this API as a Minion job but this is not a long-running task and doesn't justify creating a job, also error handling is more streamlined and the API should not take too many requests to begin with (TBD).

implementation of upload API logic

api exception handling validation and integrity checks

reflect mojo upload object

Thumbnail generation for upload API

psilabs-dev · 2024-10-14T02:22:34Z

Open for review and testing. I've tested this on ~50k cbz archives with metadata (without obvious problems). I've also tested this in combination with/wo automatic metadata plugins, checksum validation, optional metadata (title, tags, summaries), mandatory fields (filename, file data), localhost/reverse proxy.

I'll add that the checksum validation is for when a file is corrupted during transit, not validating a file corrupted at the origin, if you upload a corrupted file it will still be corrupted and not checked.

Difegue

This looks good at a glance to me - I'll have to play with it for a bit.

Would it be possible to get the documentation updated?
This would land in https://github.com/Difegue/LANraragi/blob/dev/tools/Documentation/api-documentation/archive-api.md - Please duplicate an existing {swagger} element and edit it to match. Even if it's not perfect I'll help with suggestions afterwards 👍

lib/LANraragi/Controller/Api/Archive.pm

Difegue · 2024-10-17T02:14:01Z

I guess one additional thought: You mention only one upload can be done at a time with this API, but I'm not seeing any obvious limitations that would prevent running it sequentially.
Is there something that breaks when you call it multiple times in parallel? If so, it might be good to add a lock of some sort to prevent further API calls when an upload is in progress.

I wouldn't put it behind client implementations to manhandle this endpoint to death, so we should take precautions if needed 🤔

psilabs-dev · 2024-10-18T10:02:32Z

Ahh you're right, LRR does support concurrent API calls now that I tested it again. I don't know why last time for some reason my second worker would always hang and lose connections...

After testing concurrent uploads, I'm seeing two (probably fixable) issues with concurrency.

Type 1: when the same archive is uploaded concurrently by two clients, on rare occasions the server will throw a "The file couldn't be renamed in your content folder: No such file or directory" error to the later client. This can happen when the two uploads differ by as much as 30ms.

this is the wierder bug characterized by this response:

<!DOCTYPE html>

<head>
	<title>LANraragi Server Error</title>

	<meta name="viewport" content="width=device-width" />
	<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<meta name="apple-mobile-web-status-bar-style" content="black" />
<meta name="mobile-web-app-capable" content="yes" />
<meta name="apple-mobile-web-app-capable" content="yes" />

	<link type="image/png" rel="icon" href="favicon.ico" />
	<link rel="manifest" href="app.webappmanifest" />
</head>

<body style='background: none repeat scroll 0% 0% brown; color: white; font-family: sans-serif; text-align: center'>

	<img src='/img/flubbed.gif' />
	<br />
	<h2>I flubbed it while rendering your page!</h2>
	It's likely there's a bug with the server at this point.<br />
	<p>
		 at /home/koyomi/lanraragi/script/../lib/LANraragi/Utils/Archive.pm line 336.

	</p>

	<h3>Some more info below if available:</h3>
	<pre>{
  &quot;action&quot; =&gt; &quot;create_archive&quot;,
  &quot;controller&quot; =&gt; &quot;api-archive&quot;
}
</pre>

</body>

</html>

I've noticed that when one worker encounters this error, the other worker will encounter a concurrency error of the first type. This is more rare than the first type of error

psilabs-dev · 2024-10-18T10:04:08Z

This looks good at a glance to me - I'll have to play with it for a bit.

Would it be possible to get the documentation updated? This would land in https://github.com/Difegue/LANraragi/blob/dev/tools/Documentation/api-documentation/archive-api.md - Please duplicate an existing {swagger} element and edit it to match. Even if it's not perfect I'll help with suggestions afterwards 👍

Sure! I'll take a look at this

remove utf downgrade

implementation B

resolve merge conflicts

lib/LANraragi/Controller/Api/Archive.pm

Co-authored-by: Difegue <[email protected]>

psilabs-dev · 2024-11-10T07:50:03Z

Ok, don't think I'll have anything else to add, I'll be doing final builds and stress tests and opening it up for review later.

Difegue · 2024-11-10T20:51:29Z

Sounds good, feel free to undraft once you think it's ready and I'll give it one last passover.

psilabs-dev · 2024-11-12T10:42:41Z

One thing that came to mind before I forget, there are no file size upload limits. I tested whether this handles large files well and I was able to upload a 18gb archive without issue. Although I can't read it due to temp folder limitations.

That said, do we want a 500mb filesize limit?

Difegue · 2024-11-12T16:06:46Z

No, if there's no limitations on the server I wouldn't implement an artificial one in the API.
I imagine people running this behind nginx or similar will encounter the same session size limits they already had with the non-API version. (And they can fix it the same way)

Difegue

One last round of comments from me after reviewing the whole thing anew.

Difegue · 2024-11-14T22:03:29Z

lib/LANraragi/Controller/Api/Archive.pm

+    my $summary         = $self->req->param('summary');
+
+    # return error if archive is not supported.
+    if ( !is_archive($filename) ) {


You don't really need this bit if it's already done in handle_incoming_file, right?

(I'm aware Controller->Upload does the same, but I think I'll eventually deprecate that code path in favor of just using the API in the webclient)

is_archive ensures that the filename has a valid extension. I don't know how removing this will affect the tempdir logic, I'll need to test it out

When the archive check block is removed, an unhandled exception will be thrown when a filename with an invalid extension or no extension is passed:

... fileparse(): need a valid pathname at /home/koyomi/lanraragi/script/../lib/LANraragi/Utils/Archive.pm line 38.

which is the is_pdf subroutine... which is really weird

huh, I guess is_pdf wasn't written to handle invalid filenames at all and the issue just never popped up - good to know.

Difegue · 2024-11-14T22:04:24Z

lib/LANraragi/Controller/Api/Archive.pm

+    my $tempdir                 = tempdir();
+
+    my ( $fn, $path, $ext )     = fileparse( $filename, qr/\.[^.]*/ );
+    my $byte_limit              = LANraragi::Model::Config->enable_cryptofs ? 143 : 255;


I wonder if this shouldn't be enforced at the handle_incoming_file level rather than both in here, in Model->Upload->Download_url, and the old Controller upload.

Might be one for later tho

What was the initial purpose of the byte limit? I only kept it there to mirror, I don't know how this will affect temporary file creation if a filename exceeds this limit.

Well, it's mostly there for synology hardware which has that 255 byte cap, and the notably insane who use cryptoFS on top of that, which limits it down to a further 143 bytes.

I believe this would break even temporary filenames if exceeded.

lib/LANraragi/Controller/Api/Archive.pm

tools/Documentation/api-documentation/archive-api.md

Co-authored-by: Difegue <[email protected]>

holopin-bot · 2024-11-19T11:06:31Z

Congratulations @psilabs-dev, the maintainer of this repository has issued you a holobyte! Here it is: https://holopin.io/holobyte/cm3ocndk629080cmk40k0mke9

This badge can only be claimed by you, so make sure that your GitHub account is linked to your Holopin account. You can manage those preferences here: https://holopin.io/account.
Or if you're new to Holopin, you can simply sign up with GitHub, which will do the trick!

Difegue · 2024-11-19T11:07:43Z

This is good for me - the leftover comments aren't essential and I might get to them at some point during a refactor.

Thanks again for the work!

psilabs-dev added 17 commits October 5, 2024 15:34

set up upload framework

bc72346

initial implementation

89bc617

add title and summary

b1492b8

include category id in upload API

03bd9d2

complete title summary args for handling_incoming_file

6092acf

clean and refactor

f20af34

rename upload to create

e0a9f2e

Merge pull request #1 from psilabs-dev/feature/upload-dev

8a23168

implementation of upload API logic

add exception to handle missing payload

28a0b15

handle missing file better

ca11644

implement sha1 verification

16e87e2

Merge pull request #3 from psilabs-dev/feature/upload-dev

fe3fe09

api exception handling validation and integrity checks

reflect mojo upload object

c9f6544

Merge pull request #4 from psilabs-dev/feature/upload-dev

19b711a

reflect mojo upload object

add thumbnail generation for upload api

d86637d

Merge pull request #5 from psilabs-dev/feature/upload-dev

acecc72

Thumbnail generation for upload API

Merge branch 'dev' into feature/upload

74d5576

psilabs-dev marked this pull request as ready for review October 14, 2024 02:04

Difegue requested changes Oct 17, 2024

View reviewed changes

lib/LANraragi/Controller/Api/Archive.pm Outdated Show resolved Hide resolved

Difegue linked an issue Oct 17, 2024 that may be closed by this pull request

Feature Request: Add api for upload archive #1005

Closed

psilabs-dev marked this pull request as draft October 18, 2024 10:10

psilabs-dev added 5 commits October 19, 2024 16:22

remove utf downgrade

fed1f5b

Merge pull request #6 from psilabs-dev/feature/upload-dev

f0c50fd

remove utf downgrade

Merge branch 'dev' into feature/upload

87cf4ae

standardize api response

1a0fd65

put upload API behind login

6a6f259

psilabs-dev added 5 commits November 1, 2024 02:56

handle_incoming_file return http status code

22a9708

be more explicit with responses

13f677f

Merge pull request #16 from psilabs-dev/feature/upload-dev-b

aa2510c

implementation B

Merge branch 'feature/upload' into dev-resolve-mc

a707fc4

Merge pull request #18 from psilabs-dev/dev-resolve-mc

57077b7

resolve merge conflicts

Difegue requested changes Nov 9, 2024

View reviewed changes

lib/LANraragi/Controller/Api/Archive.pm Outdated Show resolved Hide resolved

psilabs-dev and others added 4 commits November 9, 2024 22:11

Update lib/LANraragi/Controller/Api/Archive.pm

d626832

Co-authored-by: Difegue <[email protected]>

edit response when fail move file

27ae2eb

no need for success status

0fa9d57

upload api docs

5e3ef51

Merge branch 'dev' into feature/upload

5376e0e

this is why i have a chatgpt subscription

3c80f53

psilabs-dev marked this pull request as ready for review November 11, 2024 05:11

psilabs-dev requested a review from Difegue November 11, 2024 05:12

add redis lock response code

0fde3d1

Difegue requested changes Nov 14, 2024

View reviewed changes

psilabs-dev and others added 5 commits November 15, 2024 11:07

Update tools/Documentation/api-documentation/archive-api.md

389511e

Co-authored-by: Difegue <[email protected]>

Update tools/Documentation/api-documentation/archive-api.md

3d1bf1b

Co-authored-by: Difegue <[email protected]>

Update tools/Documentation/api-documentation/archive-api.md

228490a

Co-authored-by: Difegue <[email protected]>

Merge branch 'dev' into feature/upload

7e93d28

replace utf downgrade with redis encode

9d1e06e

Difegue approved these changes Nov 19, 2024

View reviewed changes

Difegue merged commit 53dc4d8 into Difegue:dev Nov 19, 2024
1 check passed

Difegue added the holobyte label Nov 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement upload API #1085

Implement upload API #1085

psilabs-dev commented Oct 5, 2024 •

edited

Loading

psilabs-dev commented Oct 14, 2024 •

edited

Loading

Difegue left a comment

Difegue commented Oct 17, 2024 •

edited

Loading

psilabs-dev commented Oct 18, 2024

psilabs-dev commented Oct 18, 2024

psilabs-dev commented Nov 10, 2024

Difegue commented Nov 10, 2024

psilabs-dev commented Nov 12, 2024

Difegue commented Nov 12, 2024 •

edited

Loading

Difegue left a comment

Difegue Nov 14, 2024

psilabs-dev Nov 15, 2024

psilabs-dev Nov 18, 2024

Difegue Nov 19, 2024

Difegue Nov 14, 2024

psilabs-dev Nov 15, 2024 •

edited

Loading

Difegue Nov 19, 2024

holopin-bot bot commented Nov 19, 2024

Difegue commented Nov 19, 2024

Implement upload API #1085

Implement upload API #1085

Conversation

psilabs-dev commented Oct 5, 2024 • edited Loading

psilabs-dev commented Oct 14, 2024 • edited Loading

Difegue left a comment

Choose a reason for hiding this comment

Difegue commented Oct 17, 2024 • edited Loading

psilabs-dev commented Oct 18, 2024

psilabs-dev commented Oct 18, 2024

psilabs-dev commented Nov 10, 2024

Difegue commented Nov 10, 2024

psilabs-dev commented Nov 12, 2024

Difegue commented Nov 12, 2024 • edited Loading

Difegue left a comment

Choose a reason for hiding this comment

Difegue Nov 14, 2024

Choose a reason for hiding this comment

psilabs-dev Nov 15, 2024

Choose a reason for hiding this comment

psilabs-dev Nov 18, 2024

Choose a reason for hiding this comment

Difegue Nov 19, 2024

Choose a reason for hiding this comment

Difegue Nov 14, 2024

Choose a reason for hiding this comment

psilabs-dev Nov 15, 2024 • edited Loading

Choose a reason for hiding this comment

Difegue Nov 19, 2024

Choose a reason for hiding this comment

holopin-bot bot commented Nov 19, 2024

Difegue commented Nov 19, 2024

psilabs-dev commented Oct 5, 2024 •

edited

Loading

psilabs-dev commented Oct 14, 2024 •

edited

Loading

Difegue commented Oct 17, 2024 •

edited

Loading

Difegue commented Nov 12, 2024 •

edited

Loading

psilabs-dev Nov 15, 2024 •

edited

Loading