Data access: add and clarify the documentation #27

troyraen · 2024-04-26T22:37:23Z

Specific suggestions and questions to think about:

State clearly that all (public) NASA data is accessible from Fornax, regardless of where the data lives (archive's in-house or cloud storage).
Explain when and why users should care about where the datasets live.
Use case 1: "I really care about X dataset and already know how to access it."
- More than likely, they're accessing in-house data. What basic instructions can we provide to help them determine when and why they should go to the effort of looking for it somewhere else (i.e., in cloud storage)?
Use case 2: "I really care about Y targets (stars, galaxies, ...) and am doing a mass search for data in NASA archives."
- Can we provide a listing of all NASA data that is available in cloud storage, and basic info about how to get it? (Similar to IRSA's listing of datasets available in cloud storage and related Cloud Access Intro tutorial.)

Some background:

Currently ~all of the data being put in cloud storage by NASA archives is a copy of what they're already serving from their in-house storage. If the user wants to access the cloud copy, they'll usually have to make an explicit choice to do this. But they've probably never had to make this kind of choice before and the current documentation is not very clear about what is available from where. Users may assume that if the data is available in cloud storage, that's what they'll automatically be accessing without having to do anything different or proactive.

This confusion is compounded when we point users to the NASA-NAVO Workshops Notebooks. It is a very useful overview for NASA data, but AFAIK it doesn't contain any information about accessing data from cloud storage. Since the Fornax documentation emphasizes cloud-hosted data and NAVO documentation doesn't even mention it, this can lead the user to assume that by following the NAVO tutorials they are already accessing cloud-hosted data.

jkrick · 2024-04-29T21:27:14Z

Users may assume that if the data is available in cloud storage, that's what they'll automatically be accessing without having to do anything different or proactive.

Can we make it a thing that all users in Fornax automatically accesses cloud data if it exists? Some environmental variable type idea? Because I think that would help keep much of this information away from the user.

troyraen · 2024-04-30T00:58:09Z

That would be great, though I don't see a solution right now. The ways I know how to tell people to load data all involve having to know the path (and thus, the location) like pd.read_parquet('path-to-catalog') or astropy.io.fits.open('path-to-image'). Also, accessing cloud storage sometimes requires different arguments to handle the different filesystem and/or the permissions/credentials.

I don't think any one solution will ever work for all use cases because there are so many different ways to access data. But potentially in a more narrow context... I know there is ongoing cloud-access related work happening in the astropy + VO universe. Maybe some option for this is envisioned? I'm not up on the details enough to know.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data access: add and clarify the documentation #27

Data access: add and clarify the documentation #27

troyraen commented Apr 26, 2024

jkrick commented Apr 29, 2024

troyraen commented Apr 30, 2024 •

edited

Loading

Data access: add and clarify the documentation #27

Data access: add and clarify the documentation #27

Comments

troyraen commented Apr 26, 2024

jkrick commented Apr 29, 2024

troyraen commented Apr 30, 2024 • edited Loading

troyraen commented Apr 30, 2024 •

edited

Loading