Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A Dataless Cube #4447

Open
3 tasks
bjlittle opened this issue Dec 2, 2021 · 10 comments · May be fixed by #6253
Open
3 tasks

A Dataless Cube #4447

bjlittle opened this issue Dec 2, 2021 · 10 comments · May be fixed by #6253
Assignees
Labels
Dragon 🐉 https://github.com/orgs/SciTools/projects/19?pane=info

Comments

@bjlittle
Copy link
Member

bjlittle commented Dec 2, 2021

✨ Feature Request

I think it's healthy to challenge established norms...

I want the ability to create a dataless cube. By this I mean the ability to create a hyper-space defined only by metadata i.e., no data payload

Once data is added to the cube, then the dimensionality is established and locked down, as we traditionally know and accept.

Motivation

Such hyper-spaces could be used in various ways e.g.,

  • as a factory to manufacture fully formed cubes for test data
  • as the target hyper-space in a regridding or interpolation transformation

I'm sure there are more concrete use cases... Please do share them on this issue if you know or any 🙏

Traditionally, there are many situations where a cube enforcing that it must have data is simply an inconvenience. Given the natural progression of model resolutions it seems "just wrong" to abuse dask to create lazy data that will never be used. It reeks of something not being quite right to me.

Let's do something about that 😉

Please up vote this issue if you'd like to see this happen 👍

Steps

  • Complete work described in A Dataless Cube #4447 (comment)
  • Write up remaining 'visions' for a dataless Cube in separate issues. Encourage the users that have upvoted this issue to vote on one/more of the new issues if that outstanding work is important to them
  • Close this issue
@DPeterK
Copy link
Member

DPeterK commented Dec 2, 2021

@bjlittle supermegahypercubes! That is, a cube that describes how huge numbers of incoming datasets would tile together to make an n-dimensional hyperstructure - think, for example, of representing an entire model run in a single object. This would ideally be represented as a metadata-only cube, with individual data payloads very much fetched on demand only, given the vast quantities of data such an object would represent.

We've considered this idea from a variety of different perspectives in the Informatics Lab, and we think it has legs. We've also given the idea a bunch of different names, but supermegahypercubes is the best, most whimsical and original name we came up with for the concept 🙂

@pp-mo
Copy link
Member

pp-mo commented Dec 19, 2021

@bjlittle are you including here the idea that possibly only some of the data might be "filled", with some of it left unidentified.
So, that might be closer to the idea previously suggested which I think was maybe called a "hypercube", probably in the Informatics Lab ? IIRC that was certainly raised before but we never managed to get around to seriously considering it.
( @DPeterK I can't find an issue link for this -- maybe can you help ? )

P.S. as a name, for that idea at least, I think "hypothicube" is neater (though for language purists that should probably be "hypothecube" 😉 )

@edmundhenley-mo
Copy link

edmundhenley-mo commented Mar 17, 2022

@bjlittle - re your concrete use-cases:
If useful to see some (~pedestrian, non hyp[er|o]cube-y) code-in-wild examples of target hyperspace for interpolation/regridding, I've got a couple here (sorry, only viewable internally@MO).
Almost certainly not optimal, but guessing poss still useful to see non-expert usage!

  • Adding a np.zeros .data cargo when defining target cube for a model->model regrid, cargo immediately getting discarded when interpolation/regridding applied. Used here. In this case just a very small cargo, so not especially wasteful to create/discard, but can see that in other cases would be!
  • Alternatively, a similar ~pointless NaNing of data here, for some obs->model comparisons, where the target cube (to eventually accept some interpolated observations) was based off another (model data). Subsequent wrangling of metadata to remove entries irrelevant in the obs data case.

@edmundhenley-mo
Copy link

@pp-mo - dunno re issue, but wonder if you're recalling the part-filled example in Jacob's hypotheticube article? Or poss another Informatic Lab article? (here's @DPeterK 's one on supermegahypercubes

@philip-brohan
Copy link

I feed streams of cubes through Machine Learning software (TensorFlow - TF). This requires throwing away the metadata and operating only on the data arrays, and then laboriously reconstructing metadata around the output data. It would be great to be able to cut a cube into data and metadata components, process them seperately and recombine them later.

@trexfeathers trexfeathers added the Dragon 🐉 https://github.com/orgs/SciTools/projects/19?pane=info label Jul 10, 2023
@trexfeathers trexfeathers moved this to 📌 Prioritised in 🐉 Dragon Taming Aug 3, 2023
@scitools-ci scitools-ci bot removed this from 🚴 Peloton Dec 15, 2023
@trexfeathers trexfeathers moved this from 📌 Prioritised to 🛡 Championed in 🐉 Dragon Taming Feb 23, 2024
@pp-mo
Copy link
Member

pp-mo commented Jul 1, 2024

In Dragon Taming ™️ discussion today, I suggested that we should AFAP "contain" code changes within the DataManager class, i.e. no or minimal change should be required in Cube code.

Just as a hint for implementation, it is also very simple to make a lazy array which has no data, so can participate normally in any lazy operations, but can't be fetched.
You just need an object which supports : shape, dtype, ndim and __getitem__, and you wrap it with dask.array.from_array :
I've written code like this a few times, now !

Here's a simple working example.

import dask.array as da
import numpy as np

class FakeArray:
    def __init__(self, shape, dtype):
        if not isinstance(dtype, np.dtype):
            dtype = np.dtype(dtype)
        self.dtype = dtype
        self.shape = shape
        self.ndim = len(shape)  # Dask requires ndim as well as shape, for some reason

    def __getitem__(self, keys):
        raise ValueError("FakeArray cannot be read.")

def lazy_fake(shape, dtype=np.float64):
    """A functional lazy array with known shape and dtype, but no actual data."""
    arr = FakeArray(shape, dtype)
    # Note: must pass 'meta' to from_array, to prevent it making a test data access
    meta = np.zeros((), dtype=arr.dtype)
    return da.from_array(arr, meta=meta)
>>> my_fake = lazy_fake((3, 4), 'i2')
>>> print('fake = ', my_fake)
fake =  dask.array<array, shape=(3, 4), dtype=int16, chunksize=(3, 4), chunktype=numpy.ndarray>
>>> print('fake.meta = ', repr(my_fake._meta))
fake.meta =  array([], shape=(0, 0), dtype=int16)
>>> print('fake[0] = ', my_fake[0])
fake[0] =  dask.array<getitem, shape=(4,), dtype=int16, chunksize=(4,), chunktype=numpy.ndarray>
fake[0] =  dask.array<getitem, shape=(4,), dtype=int16, chunksize=(4,), chunktype=numpy.ndarray>
Traceback (most recent call last):
  File "/home/h05/itpp/Support/periods/period_20240710_ugridsprintx1/dev/fake_arrays.py", line 29, in <module>
    print(my_fake.compute())
          ^^^^^^^^^^^^^^^^^
  File "/tmp/persistent/newconda-envs/iris3/lib/python3.11/site-packages/dask/base.py", line 342, in compute
    (result,) = compute(self, traverse=False, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/persistent/newconda-envs/iris3/lib/python3.11/site-packages/dask/base.py", line 628, in compute
    results = schedule(dsk, keys, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/h05/itpp/Support/periods/period_20240710_ugridsprintx1/dev/fake_arrays.py", line 14, in __getitem__
    raise ValueError("FakeArray cannot be read.")
ValueError: FakeArray cannot be read.
>>> 

@edmundhenley-mo
Copy link

To clarify my (mis)understanding of what you mean @pp-mo - the DataManager class is in user-space code?
i.e. user-written and maintained, not part of iris?

@pp-mo
Copy link
Member

pp-mo commented Jul 1, 2024

To clarify my (mis)understanding of what you mean @pp-mo - the DataManager class is in user-space code? i.e. user-written and maintained, not part of iris?

Ah no, not that actually.
The DataManager is absolutely a part of Iris.
It encapsulates the different types of array content that we can have in a cube.data or coord.points/bounds + gives them a common API. For now, that basically means real or lazy array.

So I was just hoping that, since we have already have this class encapsulating the possible array types, it would be neat if we can support "dataless" purely by extending what a DataManager can do, rather than by making a bunch of changes elsewhere, e.g. in the Cube class.

@pp-mo
Copy link
Member

pp-mo commented Jul 1, 2024

P.S. further clarification (hopefully)
My previous code example is also suggesting that it might be possible to implement dataless content as "just a special lazy array".
It's not yet clear if it can be quite that simple, though.
And even if it can, we might still want to distinguish "dataless" content in a more definite way.

@ESadek-MO
Copy link
Contributor

We have looked into dataless cubes. We've decided that the first step into dataless cubes is to create a cube with coords, but no data.
You can create a cube with nothing in it, but creating an empty cube with coords throws an error; coords need dimensions.

This is checked via ndims , which has no setter. This is calculated in the dataManager, using shape.

We believe that shape should be settable, but only (and non-optionally) if data hasn't been set. This will require changing the DataManager.

DataManager(data, shape:optional):

@trexfeathers trexfeathers moved this from 🛡 Championed to 🦎 Tamed in 🐉 Dragon Taming Sep 26, 2024
@HGWright HGWright assigned ESadek-MO and trexfeathers and unassigned HGWright Dec 5, 2024
@ESadek-MO ESadek-MO linked a pull request Dec 11, 2024 that will close this issue
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dragon 🐉 https://github.com/orgs/SciTools/projects/19?pane=info
Projects
Status: 🦎 Tamed
Status: No status
Development

Successfully merging a pull request may close this issue.

8 participants