Migrate treenode module. #8757

flamingbear · 2024-02-15T21:04:11Z

Migrate datatree's treenode module the base class representing node of a tree, with methods for accessing nodes and traversal.

completes step 2 datatree/treenode.py Track merging datatree into xarray #8572
Tests added or updated
[N/A] User visible changes (including notable bug fixes) are documented in whats-new.rst
Internal Changes (including notable bug fixes) are documented in whats-new.rst
[N/A] New functions/methods are listed in api.rst

xarray/tests/datatree/test_treenode.py

PR (#8702) added nbytes representation in DataArrays and Dataset repr, this adds it to the datatree tests.

Moves treenode.py and test_treenode.py. Updates some typing. Updates imports from treenode.

Add test tree structure for easier understanding.

There must be a better way, but I don't know it. particularly the list comprehension casts.

This test was broken becuase only the root node was being tested and none of the previous nodes were represented in the __str__.

doc/whats-new.rst

TomNicholas · 2024-02-16T18:33:23Z

Thanks @flamingbear . I am just wondering if you have any overall thoughts on the design / implementation here? e.g. the TreeNode -> NamedNode -> DataTree inheritance hierarchy, or the way that the tree is constructed as connections between individual class instances (rather than e.g. representing the entire tree internally via a single nested dict, for example).

EDIT: If you think it's all fine then that's great, I'm just trying to create an explicit opportunity for you to ask "why didn't you do it this other way?"

xarray/tests/datatree/test_treenode.py

flamingbear · 2024-02-16T21:14:12Z

overall thoughts on the design / implementation here?

So my overall thoughts as I was moving it / grokking it was that the separation of concerns between the movement, linking of nodes and parenting rules vs the finding by names and paths as pretty clear. The hooks for before and after attaching children seem thoughtful. I haven't dived deep into the datatree code yet but this all makes sense so far. If you were to scrap this for a nested dictionary, wouldn't you end up with pretty complicated code to handle moving/adding nodes? My gut says this is a solid choice to represent tree data.

keewis · 2024-02-16T09:56:38Z

xarray/core/treenode.py

@@ -25,7 +21,7 @@ class InvalidTreeError(Exception):


 class NotFoundInTreeError(ValueError):
-    """Raised when operation can't be completed because one node is part of the expected tree."""
+    """Raised when operation can't be completed because one node is not part of the expected tree."""


 class NodePath(PurePosixPath):


If we're exposing it (and keep it as a pure path, which I think we should), should we move this class to a separate module?

Are we exposing NodePath? I suggested it, but after talking with @etienneschalk decided maybe it wasn't particularly helpful xarray-contrib/datatree#205.

What's the logic for making it a pure path? The posix path is to ensure the slashes are in a consistent direction (remember this isn't a real filepath so shouldn't change automatically on windows systems).

I am not sure about this, but I believe the difference is that PurePosixPath doesn't have methods that require read/write access (like touch, remove, mkdir, iterdir, ...), while PosixPath does.

The main difference between filesystems and datatree is that there's only a single filesystem instance at a time, while you can have multiple DataTree objects at the same time. This means that a NodePath inheriting from PosixPath would have to be created from the DataTree object and would require a lot of discipline to not be confusing.

On the other hand, a NodePath that inherits from PurePosixPath would help a bit with path manipulations (name, root, joinpath(), parts, parent, parents, ...) that we would otherwise have to use posixpath for. I didn't check if posixpath has the same functionality as PurePosixPath, though, and they might also not be as convenient to use.

Sorry, I wrote my above comment too quickly - we both agree it should stay as a pure path.

We could also consider adding extra methods to this NodePath class if it simplifies other parts of the code.

Related issue: xarray-contrib/datatree#283 (Consistency between DataTree methods and pathlib.PurePath methods)

To me the graal would be to have datatree more and more compatible with PurePosixPath (what NodePath already does by inheriting from it, but we can imagine having an API accepting both strings and PurePosixPath, without the need to expose NodePath publicly).

As @keewis mentioned , there is a single instance of the filesystem whereas there can be multiple DataTrees. So a single DataTree could act like a "small filesystem on its own" and implement the PosixPath methods.

NodePath ~ PurePosixPath DataTree ~ PosixPath

The idea here is to delegate to pathlib the diffcult task of choosing the best names for methods that help manipulating trees. For an audience of developers that are already used to writing scripts with pathlib, this would make working with datatree automatic! And this also gives plenty of ideas of features. Also, with hierarchical-formats such as Zarr that rely on the filesystem, openable by datatree, the proximity is innate.

Concrete example: working with multi-resolution rasters (a usecase of datatree). By using f-strings and glob, you can write resolution independant code once, like you would have done by manipulating a data structure on the file system

The schema on the pathlib doc (arrows are inheritance):

flowchart BT PurePosixPath --> PurePath Path --> PurePath PureWindowsPath --> PurePath PosixPath --> PurePosixPath PosixPath --> Path WindowsPath --> PureWindowsPath WindowsPath --> Path

Loading

How I see it:

flowchart BT subgraph Pure PurePosixPath --> PurePath NodePath --> PurePosixPath end subgraph Concrete DataTree --> NodePath PosixPath --> PurePosixPath end

Loading

(I removed Path for simplification)

I am not 100% sure it fully makes sense... The nuance is that when you instanciate a Path, it is always implicit that it is bound to the filesystem, while each DataTree instantiation creates its own data representation.

Thank you @etienneschalk for all your thoughts here!

So a single DataTree could act like a "small filesystem on its own" and implement the PosixPath methods.

I am not 100% sure it fully makes sense... The nuance is that when you instanciate a Path, it is always implicit that it is bound to the filesystem, while each DataTree instantiation creates its own data representation.

I strongly feel that DataTree / NodePath objects should not be associated with concrete filesystem paths.

However this

To me the graal would be to have datatree more and more compatible with PurePosixPath (what NodePath already does by inheriting from it, but we can imagine having an API accepting both strings and PurePosixPath, without the need to expose NodePath publicly).

seems like a reasonable idea, and I think we should continue the discussion of that in xarray-contrib/datatree#283.

For now I don't think there is anything we need to do in this PR specifically - we can revisit the API choices / accepting path-like types in a future PR.

TomNicholas · 2024-02-19T17:22:19Z

personally, would probably rename test_io.py to something related to

datatree io and move all of the tests into xarray/testsat the top level. That seems fine to me. Integrate the IO tests because the backends are integrated. Separate the tests of tree functionality so that it's easy to test different parts of the codebase independently during development (that's why having a separate test_treenode.py is useful). But whether test_treenode.py lives in a subdirectory or not just doesn't really matter I think.

…

On Mon, Feb 19, 2024, 11:02 AM Justus Magin ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ On xarray/tests/datatree/test_treenode.py <#8757 (comment)>: let's discuss this in the meeting tomorrow, but yeah, I don't have any strong opinions. — Reply to this email directly, view it on GitHub <#8757 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AISNPI366CTJ6KUSLPD4PILYUOAS7AVCNFSM6AAAAABDK6TBVGVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMYTQOBYHA4DONJVGY> . You are receiving this because you were mentioned.Message ID: ***@***.***>

Question is I did update below the released line to give Tom some credit. I hope that's is allowable.

doc/whats-new.rst

Integrated tests.

flamingbear · 2024-02-20T17:56:07Z

So that failure looks like a common one as opposed to something from this PR. Is there anything specific to address in the PR before it can be merged?

TomNicholas · 2024-02-20T18:41:24Z

But whether test_treenode.py lives in a subdirectory or not just doesn't really matter
I think.

We made an executive decision in the datatree meeting today to not bother with subdirectories (for tests or code). We can always reorganise this later if we want to.

TomNicholas · 2024-02-20T18:43:21Z

Is there anything specific to address in the PR before it can be merged?

I don't think so - @keewis what say you?

* main: (31 commits) correctly encode/decode _FillValues/missing_values/dtypes for packed data (pydata#8713) Expand use of `.oindex` and `.vindex` (pydata#8790) Return a dataclass from Grouper.factorize (pydata#8777) [skip-ci] Fix upstream-dev env (pydata#8839) Add dask-expr for windows envs (pydata#8837) [skip-ci] Add dask-expr dependency to doc.yml (pydata#8835) Add `dask-expr` to environment-3.12.yml (pydata#8827) Make list_chunkmanagers more resilient to broken entrypoints (pydata#8736) Do not attempt to broadcast when global option ``arithmetic_broadcast=False`` (pydata#8784) try to get the `upstream-dev` CI to complete again (pydata#8823) Bump the actions group with 1 update (pydata#8818) Update documentation for clarity (pydata#8817) DOC: link to zarr.convenience.consolidate_metadata (pydata#8816) Refactor Grouper objects (pydata#8776) Grouper object design doc (pydata#8510) Bump the actions group with 2 updates (pydata#8804) tokenize() should ignore difference between None and {} attrs (pydata#8797) fix: remove Coordinate from __all__ in xarray/__init__.py (pydata#8791) Fix non-nanosecond casting behavior for `expand_dims` (pydata#8782) Migrate treenode module. (pydata#8757) ...

* main: (42 commits) correctly encode/decode _FillValues/missing_values/dtypes for packed data (pydata#8713) Expand use of `.oindex` and `.vindex` (pydata#8790) Return a dataclass from Grouper.factorize (pydata#8777) [skip-ci] Fix upstream-dev env (pydata#8839) Add dask-expr for windows envs (pydata#8837) [skip-ci] Add dask-expr dependency to doc.yml (pydata#8835) Add `dask-expr` to environment-3.12.yml (pydata#8827) Make list_chunkmanagers more resilient to broken entrypoints (pydata#8736) Do not attempt to broadcast when global option ``arithmetic_broadcast=False`` (pydata#8784) try to get the `upstream-dev` CI to complete again (pydata#8823) Bump the actions group with 1 update (pydata#8818) Update documentation for clarity (pydata#8817) DOC: link to zarr.convenience.consolidate_metadata (pydata#8816) Refactor Grouper objects (pydata#8776) Grouper object design doc (pydata#8510) Bump the actions group with 2 updates (pydata#8804) tokenize() should ignore difference between None and {} attrs (pydata#8797) fix: remove Coordinate from __all__ in xarray/__init__.py (pydata#8791) Fix non-nanosecond casting behavior for `expand_dims` (pydata#8782) Migrate treenode module. (pydata#8757) ...

flamingbear commented Feb 15, 2024

View reviewed changes

xarray/tests/datatree/test_treenode.py Outdated Show resolved Hide resolved

flamingbear and others added 6 commits February 15, 2024 14:18

Update the formating tests

065db0e

PR (#8702) added nbytes representation in DataArrays and Dataset repr, this adds it to the datatree tests.

Migrate treenode module

f167b99

Moves treenode.py and test_treenode.py. Updates some typing. Updates imports from treenode.

Update NotFoundInTreeError description.

6bd492f

Reformat some comments

b62a21a

Add test tree structure for easier understanding.

Updates whats-new.rst

32053b6

mypy typing. (terrible?)

32e7453

There must be a better way, but I don't know it. particularly the list comprehension casts.

flamingbear force-pushed the mhs/migrate_treenode branch from 182a5ac to 32e7453 Compare February 15, 2024 21:19

TomNicholas added the topic-DataTree Related to the implementation of a DataTree class label Feb 15, 2024

flamingbear added 2 commits February 16, 2024 10:57

Adds __repr__ to NamedNode and updates test

7f2a178

This test was broken becuase only the root node was being tested and none of the previous nodes were represented in the __str__.

Merge remote-tracking branch 'pydata/main' into mhs/migrate_treenode

b4fb773

flamingbear commented Feb 16, 2024

View reviewed changes

doc/whats-new.rst Outdated Show resolved Hide resolved

flamingbear marked this pull request as ready for review February 16, 2024 18:04

flamingbear added 3 commits February 16, 2024 12:06

Adds quotes to NamedNode __str__ representation.

548536b

swaps " for ' in NamedNode __str__ representation.

b9685d7

Adding Tom in so he gets blamed properly.

59da654

flamingbear commented Feb 16, 2024

View reviewed changes

xarray/tests/datatree/test_treenode.py Outdated Show resolved Hide resolved

keewis reviewed Feb 17, 2024

View reviewed changes

flamingbear added 2 commits February 19, 2024 12:27

Merge remote-tracking branch 'pydata/main' into mhs/migrate_treenode

b821fca

resolve conflict whats-new.rst

4530973

Question is I did update below the released line to give Tom some credit. I hope that's is allowable.

flamingbear commented Feb 19, 2024

View reviewed changes

doc/whats-new.rst Outdated Show resolved Hide resolved

flamingbear added 2 commits February 19, 2024 15:08

Moves test_treenode.py to xarray/tests.

c03d373

Integrated tests.

Merge branch 'main' into mhs/migrate_treenode

ccd5374

TomNicholas approved these changes Feb 20, 2024

View reviewed changes

flamingbear and others added 10 commits February 20, 2024 16:27

refactors backend tests for datatree IO

4830cd4

Add explicit engine back in test_to_zarr

e8459bc

Removes OrderedDict from treenode

4238ce2

Renames tests/test_io.py -> tests/test_backends_datatree.py

db264c0

typo

b4acb0d

Add types

0f4b38a

Merge pull request #7 from flamingbear/mhs/migrate-datatree-tests

e7ea2b4

Pass mypy for 3.9

f2f327f

Merge branch 'main' into mhs/migrate_treenode

9cbaf3b

Merge branch 'main' into mhs/migrate_treenode

9db7040

shoyer approved these changes Feb 27, 2024

View reviewed changes

TomNicholas merged commit dfdd631 into pydata:main Feb 27, 2024
29 checks passed

flamingbear deleted the mhs/migrate_treenode branch February 28, 2024 22:02

TomNicholas mentioned this pull request Apr 9, 2024

Track merging datatree into xarray #8572

Closed

27 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrate treenode module. #8757

Migrate treenode module. #8757

flamingbear commented Feb 15, 2024 •

edited

Loading

TomNicholas commented Feb 16, 2024 •

edited

Loading

flamingbear commented Feb 16, 2024

keewis Feb 16, 2024 •

edited

Loading

TomNicholas Feb 18, 2024

keewis Feb 18, 2024 •

edited

Loading

TomNicholas Feb 19, 2024

etienneschalk Feb 19, 2024

TomNicholas Feb 20, 2024

TomNicholas commented Feb 19, 2024 via email

flamingbear commented Feb 20, 2024

TomNicholas commented Feb 20, 2024

TomNicholas commented Feb 20, 2024

Migrate treenode module. #8757

Migrate treenode module. #8757

Conversation

flamingbear commented Feb 15, 2024 • edited Loading

TomNicholas commented Feb 16, 2024 • edited Loading

flamingbear commented Feb 16, 2024

keewis Feb 16, 2024 • edited Loading

Choose a reason for hiding this comment

TomNicholas Feb 18, 2024

Choose a reason for hiding this comment

keewis Feb 18, 2024 • edited Loading

Choose a reason for hiding this comment

TomNicholas Feb 19, 2024

Choose a reason for hiding this comment

etienneschalk Feb 19, 2024

Choose a reason for hiding this comment

TomNicholas Feb 20, 2024

Choose a reason for hiding this comment

TomNicholas commented Feb 19, 2024 via email

flamingbear commented Feb 20, 2024

TomNicholas commented Feb 20, 2024

TomNicholas commented Feb 20, 2024

flamingbear commented Feb 15, 2024 •

edited

Loading

TomNicholas commented Feb 16, 2024 •

edited

Loading

keewis Feb 16, 2024 •

edited

Loading

keewis Feb 18, 2024 •

edited

Loading