-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scheduler.get_comm_cost a significant portion of runtime in merge benchmarks #6899
Comments
Thats clearly this line distributed/distributed/scheduler.py Line 2624 in 1d0701b
which redirects to distributed/distributed/scheduler.py Lines 512 to 521 in 1d0701b
essentially this performs a set difference between an actual set and a key-view of a dictionary. I assume the key view is converted internally to an actual set such that all keys are rehashed. Just guessing here, though |
Might actually be better off simply looping here for dts in ts.dependencies:
if dts not in ws.has_what:
nbytes += dts.nbytes
|
@wence- off-topic, but you might find this handy for sharing speedscope profiles: https://gist.github.com/gjoseph92/7bfed4d5c372c619af03f9d22e260353 |
I will try this out. |
It also could be reasonable to have a cutoff for both values, and if they're too large, we switch to some O(1) estimate. |
OK, here's the result of my experiments (this is with Total rows/worker: Current distributed
|
@wence- probably should be a separate issue, but I'd be curious to see what the next thing is that pops out on the profile once |
diff --git a/dask/sizeof.py b/dask/sizeof.py
index f31b0660e..a36874778 100644
--- a/dask/sizeof.py
+++ b/dask/sizeof.py
@@ -141,10 +141,12 @@ def register_pandas():
@sizeof.register(pd.DataFrame)
def sizeof_pandas_dataframe(df):
p = sizeof(df.index)
- for name, col in df.items():
- p += col.memory_usage(index=False)
- if col.dtype == object:
- p += object_size(col._values)
+ mgr = df._mgr
+ blocks = mgr.blocks
+ n = len(df)
+ for i in mgr.blknos:
+ dtype = blocks[i].dtype
+ p += n * dtype.itemsize
return int(p) + 1000
@sizeof.register(pd.Series) Produces the following results (this is now with the TCP rather than UCX protocol and all pandas dataframes)
|
Just checking in here, did #6931 close this issue, or is there more that folks would like to do? |
Yes, it did, I'm about to follow up more coherently to @gjoseph92's last query with a separate issue. |
I've been profiling distributed workflows in an effort to understand where there are potential performance improvements to be made (this is ongoing with @gjoseph92 amongst others). I'm particularly interested in scale-out scenarios, where the number of workers is large. As well as that scenario, I've also been looking at cases where the number of works is quite small, but dataframes have many partitions: this produces many tasks at a scale where debugging/profiling is a bit more manageable.
The benchmark setup I have builds two dataframes and then merges them on a key column with a specified matching fraction. Each worker gets P partitions with N rows per partition. I use 8 workers. I'm using cudf dataframes (so the merge itself is fast, which means that I notice sequential overheads sooner).
Attached two speedscope plots (and data) of py-spy based profiling of the scheduler in a scenario with eight workers, P=100, and N=500,000. In a shuffle, the total number of tasks peaks at about 150,000 per the dashboard. The second profile is very noisy since I'm using benfred/py-spy#497 to avoid filtering out python builtins (so that we can see in more detail what is happening). Interestingly, at this scale we don't see much of a pause in GC (but I am happy to try out more scenarios that might be relevant to #4987).
In this scenario, a single merge takes around 90s, if I do the minimal thing of letting
Scheduler.get_comm_cost
return 0
immediately, this drops to around 50s (using pandas it drops from 170s to around 130s). From the detailed profile, we can see that the majority of this time is spent inset.difference
. I'm sure there's a more reasonable fix that isn't quite such a large hammer.merge-scheduler-100-chunks-per-worker-no-filter.json.gz
merge-scheduler-100-chunks-per-worker.json.gz
(cc @pentschev, @quasiben, and @rjzamora)
The text was updated successfully, but these errors were encountered: