A Dataless Cube #4447

bjlittle · 2021-12-02T15:48:11Z

✨ Feature Request

I think it's healthy to challenge established norms...

I want the ability to create a dataless cube. By this I mean the ability to create a hyper-space defined only by metadata i.e., no data payload

Once data is added to the cube, then the dimensionality is established and locked down, as we traditionally know and accept.

Motivation

Such hyper-spaces could be used in various ways e.g.,

as a factory to manufacture fully formed cubes for test data
as the target hyper-space in a regridding or interpolation transformation

I'm sure there are more concrete use cases... Please do share them on this issue if you know or any 🙏

Traditionally, there are many situations where a cube enforcing that it must have data is simply an inconvenience. Given the natural progression of model resolutions it seems "just wrong" to abuse dask to create lazy data that will never be used. It reeks of something not being quite right to me.

Let's do something about that 😉

Please up vote this issue if you'd like to see this happen 👍

Steps

Complete work described in A Dataless Cube #4447 (comment)
Write up remaining 'visions' for a dataless Cube in separate issues. Encourage the users that have upvoted this issue to vote on one/more of the new issues if that outstanding work is important to them
Close this issue

DPeterK · 2021-12-02T17:06:35Z

@bjlittle supermegahypercubes! That is, a cube that describes how huge numbers of incoming datasets would tile together to make an n-dimensional hyperstructure - think, for example, of representing an entire model run in a single object. This would ideally be represented as a metadata-only cube, with individual data payloads very much fetched on demand only, given the vast quantities of data such an object would represent.

We've considered this idea from a variety of different perspectives in the Informatics Lab, and we think it has legs. We've also given the idea a bunch of different names, but supermegahypercubes is the best, most whimsical and original name we came up with for the concept 🙂

pp-mo · 2021-12-19T10:33:08Z

@bjlittle are you including here the idea that possibly only some of the data might be "filled", with some of it left unidentified.
So, that might be closer to the idea previously suggested which I think was maybe called a "hypercube", probably in the Informatics Lab ? IIRC that was certainly raised before but we never managed to get around to seriously considering it.
( @DPeterK I can't find an issue link for this -- maybe can you help ? )

P.S. as a name, for that idea at least, I think "hypothicube" is neater (though for language purists that should probably be "hypothecube" 😉 )

edmundhenley-mo · 2022-03-17T16:10:31Z

@bjlittle - re your concrete use-cases:
If useful to see some (~pedestrian, non hyp[er|o]cube-y) code-in-wild examples of target hyperspace for interpolation/regridding, I've got a couple here (sorry, only viewable internally@MO).
Almost certainly not optimal, but guessing poss still useful to see non-expert usage!

Adding a np.zeros .data cargo when defining target cube for a model->model regrid, cargo immediately getting discarded when interpolation/regridding applied. Used here. In this case just a very small cargo, so not especially wasteful to create/discard, but can see that in other cases would be!
Alternatively, a similar ~pointless NaNing of data here, for some obs->model comparisons, where the target cube (to eventually accept some interpolated observations) was based off another (model data). Subsequent wrangling of metadata to remove entries irrelevant in the obs data case.

edmundhenley-mo · 2022-03-17T16:59:46Z

@pp-mo - dunno re issue, but wonder if you're recalling the part-filled example in Jacob's hypotheticube article? Or poss another Informatic Lab article? (here's @DPeterK 's one on supermegahypercubes

philip-brohan · 2022-10-06T09:31:00Z

I feed streams of cubes through Machine Learning software (TensorFlow - TF). This requires throwing away the metadata and operating only on the data arrays, and then laboriously reconstructing metadata around the output data. It would be great to be able to cut a cube into data and metadata components, process them seperately and recombine them later.

pp-mo · 2024-07-01T11:05:21Z

In Dragon Taming ™️ discussion today, I suggested that we should AFAP "contain" code changes within the DataManager class, i.e. no or minimal change should be required in Cube code.

Just as a hint for implementation, it is also very simple to make a lazy array which has no data, so can participate normally in any lazy operations, but can't be fetched.
You just need an object which supports : shape, dtype, ndim and __getitem__, and you wrap it with dask.array.from_array :
I've written code like this a few times, now !

Here's a simple working example.

import dask.array as da
import numpy as np

class FakeArray:
    def __init__(self, shape, dtype):
        if not isinstance(dtype, np.dtype):
            dtype = np.dtype(dtype)
        self.dtype = dtype
        self.shape = shape
        self.ndim = len(shape)  # Dask requires ndim as well as shape, for some reason

    def __getitem__(self, keys):
        raise ValueError("FakeArray cannot be read.")

def lazy_fake(shape, dtype=np.float64):
    """A functional lazy array with known shape and dtype, but no actual data."""
    arr = FakeArray(shape, dtype)
    # Note: must pass 'meta' to from_array, to prevent it making a test data access
    meta = np.zeros((), dtype=arr.dtype)
    return da.from_array(arr, meta=meta)

>>> my_fake = lazy_fake((3, 4), 'i2')
>>> print('fake = ', my_fake)
fake =  dask.array<array, shape=(3, 4), dtype=int16, chunksize=(3, 4), chunktype=numpy.ndarray>
>>> print('fake.meta = ', repr(my_fake._meta))
fake.meta =  array([], shape=(0, 0), dtype=int16)
>>> print('fake[0] = ', my_fake[0])
fake[0] =  dask.array<getitem, shape=(4,), dtype=int16, chunksize=(4,), chunktype=numpy.ndarray>
fake[0] =  dask.array<getitem, shape=(4,), dtype=int16, chunksize=(4,), chunktype=numpy.ndarray>
Traceback (most recent call last):
  File "/home/h05/itpp/Support/periods/period_20240710_ugridsprintx1/dev/fake_arrays.py", line 29, in <module>
    print(my_fake.compute())
          ^^^^^^^^^^^^^^^^^
  File "/tmp/persistent/newconda-envs/iris3/lib/python3.11/site-packages/dask/base.py", line 342, in compute
    (result,) = compute(self, traverse=False, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/persistent/newconda-envs/iris3/lib/python3.11/site-packages/dask/base.py", line 628, in compute
    results = schedule(dsk, keys, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/h05/itpp/Support/periods/period_20240710_ugridsprintx1/dev/fake_arrays.py", line 14, in __getitem__
    raise ValueError("FakeArray cannot be read.")
ValueError: FakeArray cannot be read.
>>>

edmundhenley-mo · 2024-07-01T11:31:59Z

To clarify my (mis)understanding of what you mean @pp-mo - the DataManager class is in user-space code?
i.e. user-written and maintained, not part of iris?

pp-mo · 2024-07-01T15:01:33Z

To clarify my (mis)understanding of what you mean @pp-mo - the DataManager class is in user-space code? i.e. user-written and maintained, not part of iris?

Ah no, not that actually.
The DataManager is absolutely a part of Iris.
It encapsulates the different types of array content that we can have in a cube.data or coord.points/bounds + gives them a common API. For now, that basically means real or lazy array.

So I was just hoping that, since we have already have this class encapsulating the possible array types, it would be neat if we can support "dataless" purely by extending what a DataManager can do, rather than by making a bunch of changes elsewhere, e.g. in the Cube class.

pp-mo · 2024-07-01T15:05:41Z

P.S. further clarification (hopefully)
My previous code example is also suggesting that it might be possible to implement dataless content as "just a special lazy array".
It's not yet clear if it can be quite that simple, though.
And even if it can, we might still want to distinguish "dataless" content in a more definite way.

ESadek-MO · 2024-09-25T15:05:20Z

We have looked into dataless cubes. We've decided that the first step into dataless cubes is to create a cube with coords, but no data.
You can create a cube with nothing in it, but creating an empty cube with coords throws an error; coords need dimensions.

This is checked via ndims , which has no setter. This is calculated in the dataManager, using shape.

We believe that shape should be settable, but only (and non-optionally) if data hasn't been set. This will require changing the DataManager.

DataManager(data, shape:optional):

bjlittle added the New: Feature label Dec 2, 2021

trexfeathers removed the New: Feature label Jun 15, 2022

trexfeathers added this to 🚴 Peloton Jun 23, 2023

trexfeathers added the Dragon 🐉 https://github.com/orgs/SciTools/projects/19?pane=info label Jul 10, 2023

trexfeathers added this to 🐉 Dragon Taming Jul 10, 2023

trexfeathers moved this to 📌 Prioritised in 🐉 Dragon Taming Aug 3, 2023

scitools-ci bot removed this from 🚴 Peloton Dec 15, 2023

scitools-ci bot added this to 🚴 Peloton Dec 15, 2023

trexfeathers mentioned this issue Feb 22, 2024

Convenience function for making a regridding target Cube of specified shape #5770

Open

trexfeathers assigned HGWright and ESadek-MO Feb 23, 2024

trexfeathers moved this from 📌 Prioritised to 🛡 Championed in 🐉 Dragon Taming Feb 23, 2024

trexfeathers unassigned ESadek-MO Sep 26, 2024

trexfeathers moved this from 🛡 Championed to 🦎 Tamed in 🐉 Dragon Taming Sep 26, 2024

HGWright assigned ESadek-MO and trexfeathers and unassigned HGWright Dec 5, 2024

ESadek-MO linked a pull request Dec 11, 2024 that will close this issue

Implement dataless cubes #6253

Open

4 tasks

trexfeathers assigned bjlittle and unassigned trexfeathers Dec 18, 2024

pp-mo mentioned this issue Dec 18, 2024

A User Guide page on creating a Cube from scratch #6267

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A Dataless Cube #4447

A Dataless Cube #4447

bjlittle commented Dec 2, 2021 •

edited by trexfeathers

Loading

DPeterK commented Dec 2, 2021

pp-mo commented Dec 19, 2021 •

edited

Loading

edmundhenley-mo commented Mar 17, 2022 •

edited

Loading

edmundhenley-mo commented Mar 17, 2022

philip-brohan commented Oct 6, 2022

pp-mo commented Jul 1, 2024 •

edited

Loading

edmundhenley-mo commented Jul 1, 2024

pp-mo commented Jul 1, 2024

pp-mo commented Jul 1, 2024

ESadek-MO commented Sep 25, 2024

A Dataless Cube #4447

A Dataless Cube #4447

Comments

bjlittle commented Dec 2, 2021 • edited by trexfeathers Loading

✨ Feature Request

Motivation

Steps

DPeterK commented Dec 2, 2021

pp-mo commented Dec 19, 2021 • edited Loading

edmundhenley-mo commented Mar 17, 2022 • edited Loading

edmundhenley-mo commented Mar 17, 2022

philip-brohan commented Oct 6, 2022

pp-mo commented Jul 1, 2024 • edited Loading

edmundhenley-mo commented Jul 1, 2024

pp-mo commented Jul 1, 2024

pp-mo commented Jul 1, 2024

ESadek-MO commented Sep 25, 2024

bjlittle commented Dec 2, 2021 •

edited by trexfeathers

Loading

pp-mo commented Dec 19, 2021 •

edited

Loading

edmundhenley-mo commented Mar 17, 2022 •

edited

Loading

pp-mo commented Jul 1, 2024 •

edited

Loading