Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug fix] Added fit=true parameter when setting a dataframe #1318

Merged
merged 1 commit into from
Dec 1, 2023

Conversation

linglp
Copy link
Contributor

@linglp linglp commented Oct 19, 2023

Related to: https://sagebionetworks.jira.com/browse/FDS-919

Context:

We ran into a problem when generating a manifest for HTAN. See the parameters used for manifest generation below:
schema_url: https://raw.githubusercontent.com/ncihtan/data-models/main/HTAN.model.jsonld
data_type: ImagingLevel1
dataset_id: syn27782965
output_format: google_sheet

The error:

  File "/Users/lpeng/Library/Caches/pypoetry/virtualenvs/schematicpy-9OomxyhV-py3.10/lib/python3.10/site-packages/googleapiclient/_helpers.py", line 130, in positional_wrapper
    return wrapped(*args, **kwargs)
  File "/Users/lpeng/Library/Caches/pypoetry/virtualenvs/schematicpy-9OomxyhV-py3.10/lib/python3.10/site-packages/googleapiclient/http.py", line 938, in execute
    raise HttpError(resp, content, uri=self.uri)
googleapiclient.errors.HttpError: <HttpError 400 when requesting https://sheets.googleapis.com/v4/spreadsheets/1RwqsmRz8Hlub-zZpLP_vy7Nad9_LHLrSvh0Ek0kZFyg/values/%27Sheet1%27%21A11539%3AN15385?valueInputOption=USER_ENTERED&alt=json returned "Range (Sheet1!A11539:N15385) exceeds grid limits. Max rows: 11538, max columns: 25". Details: "Range (Sheet1!A11539:N15385) exceeds grid limits. Max rows: 11538, max columns: 25">

Why this error occurred?

In schematic, we are essentially copying and pasting the dataframe that we generated into google sheet as a final step. The library that we use is called pygsheet, and the function is: set_dataframe (You could read more about this function here). This error happened because the google sheet that we initially generated only has 1000 rows (see documentation here), but since the data frame that we pasted is much bigger, the error occurred.(Refer to the comment here).

Solution

Luckily the fix is simple. We could resize the google sheet that we initially generated to fit the bigger dataframe by using the fit parameter described here in the documentation: https://pygsheets.readthedocs.io/en/stable/worksheet.html?highlight=set_dataframe#pygsheets.Worksheet.set_dataframe

To test this PR

Use the parameter described above to generate a google sheet manifest for HTAN

Cell limit

To understand the cell limit, please consider using the script in the comment

@linglp
Copy link
Contributor Author

linglp commented Oct 19, 2023

To play with cell size:

import os
import pygsheets as ps
from google.oauth2 import service_account
import json
import pandas as pd
from typing import Dict, Any
from schematic.configuration.configuration import CONFIG
import numpy as np
#schematic.configuration.configuration
# If modifying these scopes, delete the file token.pickle.
SCOPES = [
    "https://www.googleapis.com/auth/spreadsheets",
    "https://www.googleapis.com/auth/drive",
]
def build_service_account_creds() -> Dict[str, Any]:
    if "SERVICE_ACCOUNT_CREDS" in os.environ:
        dict_creds=json.loads(os.environ["SERVICE_ACCOUNT_CREDS"])
        credentials = service_account.Credentials.from_service_account_info(dict_creds, scopes=SCOPES)

    else:
        credentials = service_account.Credentials.from_service_account_file(
            CONFIG.service_account_credentials_path, scopes=SCOPES
        )

    return {"creds": credentials}

services_creds = build_service_account_creds()

creds = services_creds["creds"]

gc = ps.authorize(custom_credentials=services_creds["creds"])
sh = gc.create('my google sheet')
wks = sh[0]

# play around the worksheet limit using set_dataframe
# the limit seems to be 10000000 cells when using set_dataframe function
mock_manifest_df = pd.DataFrame(index=np.arange(999999), columns=np.arange(1))
wks.set_dataframe(mock_manifest_df, (1, 1), fit=True)

The above could run without errors. But if you increase the rows by 1, and mock dataframe becomes like:

mock_manifest_df = pd.DataFrame(index=np.arange(1000000), columns=np.arange(1))

You would get an error related to cell limit:

  File "/Users/lpeng/Library/Caches/pypoetry/virtualenvs/schematicpy-9OomxyhV-py3.10/lib/python3.10/site-packages/googleapiclient/http.py", line 938, in execute
    raise HttpError(resp, content, uri=self.uri)
googleapiclient.errors.HttpError: <HttpError 400 when requesting https://sheets.googleapis.com/v4/spreadsheets/1n8CLrthzK3SHSU74Ud3E_Rgq6KfzJSJBVoesM8UM8qQ:batchUpdate?fields=%2A&alt=json returned "Invalid requests[0].updateSheetProperties: This action would increase the number of cells in the workbook above the limit of 10000000 cells.". Details: "Invalid requests[0].updateSheetProperties: This action would increase the number of cells in the workbook above the limit of 10000000 cells.">

@linglp
Copy link
Contributor Author

linglp commented Oct 24, 2023

Also opened an issue in pygsheet repo: nithinmurali/pygsheets#589

Copy link
Contributor

@mialy-defelice mialy-defelice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked that I was able to generate the HTAN manifest.

@mialy-defelice
Copy link
Contributor

@linglp I thought I approved, if you re-request my review ill approve.

@linglp linglp merged commit 7f5fd6d into develop Dec 1, 2023
3 checks passed
@linglp linglp deleted the develop-fix-pygsheet-fds919 branch December 1, 2023 15:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants