Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FR] PredictNumItersNeeded() 1.4 correction factor #1848

Open
ruben-laso opened this issue Sep 6, 2024 · 23 comments
Open

[FR] PredictNumItersNeeded() 1.4 correction factor #1848

ruben-laso opened this issue Sep 6, 2024 · 23 comments

Comments

@ruben-laso
Copy link

Problem description
In the function PredictNumItersNeeded() there is this 1.4 correction factor.
This causes the time running the experiment to exceed by ~40% the time specified by --benchmark_min_time.
Of course, --benchmark_min_time denotes the minimum amount of time to run the benchmark, but an overrun of 40% seems excessive.
This is particularly relevant in supercomputers, where CPU time is expensive.

  • Why is this estimation done, instead of stopping the iterations when the accumulated "iteration time" exceeds the target time?
  • Is there a reason for selecting 1.4 as a correction factor?
  • In cases where the execution times are not stable, could this prediction be wrong by a large margin?

Suggested solution
I suggest either removing the correction factor or making it configurable (with a default value of 1.0).

Example
As shown in the following output (executed with --benchmark_min_time=1s) the real execution time is ~1.4s: $7039 \times 198481 = 1397107759$, $8775 \times 160984 = 1412634600$, ...

------------------------------------------------------------------------------------------------------------------------
Benchmark                                                              Time             CPU   Iterations UserCounters...
------------------------------------------------------------------------------------------------------------------------
GNU-TBB/std::adjacent_difference/double/1024/manual_time            7039 ns         7022 ns       198481 bytes_per_second=6.50968Gi/s
GNU-TBB/std::adjacent_find/double/1024/manual_time                  8775 ns         8702 ns       160984 bytes_per_second=2.61075Gi/s
GNU-TBB/std::all_of/double/1024/manual_time                         7704 ns         7529 ns       199585 bytes_per_second=2.97387Gi/s
GNU-TBB/std::any_of/double/1024/manual_time                         4707 ns         4625 ns       301878 bytes_per_second=4.86754Gi/s

When executing the same code with --benchmark_min_time=0.71s ($1/1.4 \simeq 0.71$), the execution times are much closer to 1s: $7681 \times 134530 = 1033324930$, $9265 \times 103410 = 958093650$, ...

------------------------------------------------------------------------------------------------------------------------
Benchmark                                                              Time             CPU   Iterations UserCounters...
------------------------------------------------------------------------------------------------------------------------
GNU-TBB/std::adjacent_difference/double/1024/manual_time            7681 ns         7615 ns       134530 bytes_per_second=5.96565Gi/s
GNU-TBB/std::adjacent_find/double/1024/manual_time                  9265 ns         9175 ns       103410 bytes_per_second=2.47288Gi/s
GNU-TBB/std::all_of/double/1024/manual_time                         7996 ns         7572 ns       110333 bytes_per_second=2.86519Gi/s
GNU-TBB/std::any_of/double/1024/manual_time                         4656 ns         4680 ns       192865 bytes_per_second=4.92025Gi/s
@ruben-laso ruben-laso changed the title [FR] PredictNumItersNeeded 1.4 correction factor [FR] PredictNumItersNeeded() 1.4 correction factor Sep 6, 2024
@dmah42
Copy link
Member

dmah42 commented Sep 6, 2024

the multiplier is applied when we've run for less than min time to figure out how many iterations we need to get over the minimum time. we could go with a smaller multiplier, but it will take longer to get up to the minimum time, which (i believe) nets out at the same amount of computation at the end of the day.

@ruben-laso
Copy link
Author

The confusing point is that the library will (internally) aim to run the benchmark for "min_time" + 40%.
Users might expect the benchmark to run just enough iterations to exceed the 'min_time' threshold.

What would happen if the estimation is wrong? Imagine an extreme situation where the first iteration(s) have very low performance due to a cold cache, but it is very fast after that.

What would be the implications of setting this number to 1.0? Would the function be called several times until the "min_time" is reached?

@dmah42
Copy link
Member

dmah42 commented Sep 6, 2024

if it was set to 1.0 then we would never hit mintime.

is it true we'd run for min_time + 40%? i think if the first run we try runs longer than mintime then we'd be done. the only time the multiplier would kick in is if an attempt runs for shorter than that time, at which point we'd increase the time of the next run by 40%. the only time we'd run for a time mintime * 1.4 is if the initial run is almost-but-not-quite mintime.

@LebedevRI
Copy link
Collaborator

LebedevRI commented Sep 6, 2024

As

double multiplier = GetMinTimeToApply() * 1.4 / std::max(i.seconds, 1e-9);
notes, we:

  1. Run N (start=1) iteration
  2. See if that is >= min time, if it is then we're done
  3. If not, is it less-or-equal than 10% of min time? in which case we goto step 1 with N*=10
  4. But if it is greater than 10% of min time, then we run somewhere between 10xN and 14xN iterations

So yes, we do need that fudge factor, and yes, we clearly can run for 1.4x of the min time.
In reality even more because of those initial runs with exponentially larger iteration times.

I'm guessing the factor of 1.4 is "derived" from statistical distribution of the iteration timings.
If we underestimate it, and the actual iteration times happen to be faster than the previous ones,
we may fall slightly short of the min time (say 0.9x of min time), we'll need to re-run again,
and that will end up being ~1.9x overshoot, which more wasteful than 1.4x overshoot.

@ruben-laso
Copy link
Author

According to my calculations, in step 4 you would run somewhere between $1.4\times N$ and $14\times N$.

Since GetMinTimeToApply() $>$ i.seconds, we always have GetMinTimeToApply() / i.seconds $>1$.
Given that i.seconds / GetMinTimeToApply() $> 0.1$, we have GetMinTimeToApply() / i.seconds $< 10$.
Without any correction factor, that means that multiplier would be $1 <$ multiplier $< 10$.
When a correction factor is applied, $1.4$ in this case, the range shifts to $1.4 <$ multiplier $< 14$.

Let's follow the example that we fall slightly short of the min time, 0.9x of the min time. Without any correction factor, we could just hit the 1.0x of min time with (roughly) multiplier $= 1/0.9 \simeq 1.11 \ll 1.4$.

@LebedevRI
Copy link
Collaborator

According to my calculations, in step 4 you would run somewhere between 1.4 × N and 14 × N .

Since GetMinTimeToApply() > i.seconds, we always have GetMinTimeToApply() / i.seconds > 1 . Given that i.seconds / GetMinTimeToApply() > 0.1 , we have GetMinTimeToApply() / i.seconds < 10 . Without any correction factor, that means that multiplier would be 1 < multiplier < 10 . When a correction factor is applied, 1.4 in this case, the range shifts to 1.4 < multiplier < 14 .

Right.

Let's follow the example that we fall slightly short of the min time, 0.9x of the min time. Without any correction factor, we could just hit the 1.0x of min time with (roughly) multiplier = 1 / 0.9 ≃ 1.11 ≪ 1.4 .

But you do see the problem, right? We will then have spent 1.9x min_time.

@LebedevRI
Copy link
Collaborator

LebedevRI commented Sep 6, 2024

Very quick-n-dirty back of the napkin proof:

julia code
using Unitful
using Distributions
using Random

MIN_TIME = 1u"s"
IDEAL_NUM_ITERATIONS = 1000
ITERATION_MEAN_TIME = MIN_TIME / IDEAL_NUM_ITERATIONS

ITERATION_CoV = 0.01
ITERATION_STDEV = ITERATION_CoV * ITERATION_MEAN_TIME

ITERATION_STEP = 10
# TIME_FUDJE = 1.1

SYSTEM_BACKGROUND_JITTER_LIKELYHOOD = 0.25
SYSTEM_BACKGROUND_JITTER_RUNTIME_IMPACT = 0.10

function run_test(;TIME_FUDJE) 
    iterations_total = missing
    time_total = missing
    
    iterations = missing
    time = missing

    while true
        prev_iterations = iterations
        if iterations isa Missing
            iterations = 1
        elseif (time / MIN_TIME) <= (ITERATION_STEP/100)
            iterations *= ITERATION_STEP
        else
            multiplier = MIN_TIME * TIME_FUDJE / max(time, 1e-9u"s")
            @show multiplier
            iterations *= multiplier
        end
        iterations = convert(Int64, ceil(iterations))
        @assert (prev_iterations isa Missing) || iterations > prev_iterations
        
        ds = Bernoulli(SYSTEM_BACKGROUND_JITTER_LIKELYHOOD)
        di = Normal(ustrip(u"s", ITERATION_MEAN_TIME), ustrip(u"s", ITERATION_STDEV))
        
        background_jitter_scale = [ (!b ? 1 : (1 + SYSTEM_BACKGROUND_JITTER_RUNTIME_IMPACT)) for b in rand(ds, iterations)]
        iteration_times = background_jitter_scale .* (rand(di, iterations).*u"s")
        time = sum(iteration_times)
        
        time_total = sum(skipmissing([time, time_total]))
        iterations_total = sum(skipmissing([iterations, iterations_total]))

        if time > MIN_TIME
            break
        end
    end
    @assert time > MIN_TIME
    return (iterations_total, time_total)
end

@show run_test(TIME_FUDJE=1.4)
@show run_test(TIME_FUDJE=1.0)
multiplier = 13.672021839106499
run_test(TIME_FUDJE = 1.4) = (1479, 1.5175746366315466 s)
multiplier = 9.686340899121179
multiplier = 1.0067496719348774
run_test(TIME_FUDJE = 1.0) = (2056, 2.1082144456986898 s)

So 1.0 fudge factor ends up being slower.

NOTE: that model is not meant to be authoritative.
I'm not sure if modelling system background load as a Bernoulli distribution
is the right approach, but i do believe that it is a right first-order approximation.

@LebedevRI
Copy link
Collaborator

LebedevRI commented Sep 6, 2024

Some plotting:

julia code
module MyUnits
    using Unitful
    #export pt, mm, cm
    @unit fudgeFactor "×" FudgeFactor 1 false
    @unit iterations "iters" Iterations 1 false

    #const mm = u"mm"
    #const cm = u"cm"

    Unitful.register(@__MODULE__)
    function __init__()
        return Unitful.register(@__MODULE__)
    end               
end

using Unitful
using Distributions
using Random
using Measurements
using Plots

plotlyjs()

MIN_TIME = 1u"s"
IDEAL_NUM_ITERATIONS = 100
ITERATION_MEAN_TIME = MIN_TIME / IDEAL_NUM_ITERATIONS

ITERATION_CoV = 1u"percent"
ITERATION_STDEV = ITERATION_CoV * ITERATION_MEAN_TIME

ITERATION_STEP = 10
AUTHORITATIVE_TIME = 10u"percent"

SYSTEM_BACKGROUND_JITTER_LIKELYHOOD = 25u"percent"
SYSTEM_BACKGROUND_JITTER_RUNTIME_IMPACT = 0.10

NUM_REPETITIONS = 1000

function run_test_repetition(;TIME_FUDJE) 
    iterations_total = missing
    time_total = missing
    
    iterations = missing
    time = missing

    while true
        prev_iterations = iterations
        if iterations isa Missing
            iterations = 1
        elseif ((time / MIN_TIME)*100u"percent") <= AUTHORITATIVE_TIME
            iterations *= ITERATION_STEP
        else
            multiplier = MIN_TIME * TIME_FUDJE / max(time, 1e-9u"s")
            iterations *= multiplier
        end
        iterations = convert(Int64, ceil(iterations))
        @assert (prev_iterations isa Missing) || iterations > prev_iterations
        
        ds = Bernoulli(ustrip(u"percent", SYSTEM_BACKGROUND_JITTER_LIKELYHOOD)/100)
        di = Normal(ustrip(u"s", ITERATION_MEAN_TIME), ustrip(u"s", ITERATION_STDEV))
        
        background_jitter_scale = [ (!b ? 1 : (1 + SYSTEM_BACKGROUND_JITTER_RUNTIME_IMPACT)) for b in rand(ds, iterations)]
        iteration_times = background_jitter_scale .* (rand(di, iterations).*u"s")
        time = sum(iteration_times)
        
        iterations_total = sum(skipmissing([iterations, iterations_total]))
        time_total = sum(skipmissing([time, time_total]))

        if time > MIN_TIME
            break
        end
    end
    @assert time > MIN_TIME
    return (iterations_total, time_total)
end

function run_test(TIME_FUDJE)
    v = [run_test_repetition(TIME_FUDJE=TIME_FUDJE) for r in 1:NUM_REPETITIONS]
    all_iterations = getfield.(v, 1)
    all_times = getfield.(v, 2)
   
    iterations_mean = mean(all_iterations)
    time_mean = mean(all_times)
    
    iterations_std = std(all_iterations; mean=iterations_mean)
    time_std = std(all_times; mean=time_mean)
    
    return (measurement(iterations_mean, iterations_std),
            measurement(time_mean, time_std))
end

FUDGE_MIN = 1.0
FUDGE_MAX = 1.5
FUDGE_STEP = 0.01

TIME_FUDJES = collect(range(FUDGE_MIN,stop=FUDGE_MAX,length=1+ceil(Int64, (FUDGE_MAX-FUDGE_MIN)/FUDGE_STEP)))
RESULTS = run_test.(TIME_FUDJES)
TIME_FUDJES = TIME_FUDJES.*u"fudgeFactor"
# ITERATIONS = getfield.(RESULTS, 1).*u"iterations"
TIMES = getfield.(RESULTS, 2)

plot(TIME_FUDJES, TIMES, xlabel="fudge factor", ylabel="total time", size=(800,600))

Without touching the rest of the params there ^,
with 10ms iterations (x 100 repetitions), we get:

1

but with 1ms iterations (x 1000 repetitions), we get:

2

Maybe the factor of 1.4 is too much, but i'm certain that 1.0 is just wrong.

@LebedevRI
Copy link
Collaborator

LebedevRI commented Sep 7, 2024

And some more fancy-ness (most things are now sampled from distributions
instead of being hardcoded. again, not 100% confident in particular choices):

julia code
module MyUnits
    using Unitful

    @unit fudgeFactor "×" FudgeFactor 1 false
    @unit iterations "iters" Iterations 1 false

    Unitful.register(@__MODULE__)
    function __init__()
        return Unitful.register(@__MODULE__)
    end               
end

using Base.Threads
using Base.Iterators
using Unitful
using Distributions
using Random
using Measurements
using ProgressMeter
using Plots

plotlyjs()

MIN_TIME = 1u"s"
ITERATION_STEP = 10
AUTHORITATIVE_TIME = 10u"percent"

#dist_ITERATION_MEAN_TIME = LogUniform(ustrip(u"s", 1u"μs"), ustrip(u"s", 10u"s"))
#dist_ITERATION_CoV = LogUniform(0.1/100, 10/100)
#dist_SYSTEM_BACKGROUND_JITTER_LIKELYHOOD = LogUniform(0.1/100, 10/100)
#dist_SYSTEM_BACKGROUND_JITTER_RUNTIME_IMPACT = LogUniform(0.1/100, 10/100)

dist_ITERATION_MEAN_TIME = censored(LogNormal(0, 2.5), ustrip(u"s", 1u"μs"), ustrip(u"s", 1u"s"))
dist_ITERATION_CoV = censored(LogNormal(1, 1), 0.1/100, 10/100)
dist_SYSTEM_BACKGROUND_JITTER_LIKELYHOOD = censored(LogNormal(2.5, 1), 0.1/100, 10/100)
dist_SYSTEM_BACKGROUND_JITTER_RUNTIME_IMPACT = censored(LogNormal(2.5, 1), 0.1/100, 10/100)

function run_test_repetition_impl(;TIME_FUDJE,
                                    ITERATION_MEAN_TIME,
                                    ITERATION_STDEV,
                                    SYSTEM_BACKGROUND_JITTER_LIKELYHOOD,
                                    SYSTEM_BACKGROUND_JITTER_RUNTIME_IMPACT)
    dist_SYSTEM_BACKGROUND_JITTER = Bernoulli(SYSTEM_BACKGROUND_JITTER_LIKELYHOOD)
    dist_ITERATION_TIME = Normal(ITERATION_MEAN_TIME, ITERATION_STDEV)
     
    iterations_total = missing
    time_total = missing
    
    iterations = missing
    time = missing

    while true
        prev_iterations = iterations
        if iterations isa Missing
            iterations = 1
        elseif ((time / MIN_TIME)*100u"percent") <= AUTHORITATIVE_TIME
            iterations *= ITERATION_STEP
        else
            multiplier = MIN_TIME * TIME_FUDJE / max(time, 1e-9u"s")
            iterations *= multiplier
        end
        iterations = convert(Int64, ceil(iterations))
        @assert (prev_iterations isa Missing) || iterations > prev_iterations

        background_jitter_scale = [ (!b ? 1 : (1 + SYSTEM_BACKGROUND_JITTER_RUNTIME_IMPACT)) for b in rand(dist_SYSTEM_BACKGROUND_JITTER, iterations)]
        iteration_times = background_jitter_scale .* (rand(dist_ITERATION_TIME, iterations).*u"s")
        time = sum(iteration_times)
        
        iterations_total = sum(skipmissing([iterations, iterations_total]))
        time_total = sum(skipmissing([time, time_total]))

        if time > MIN_TIME
            break
        end
    end
    @assert time > MIN_TIME
    real_iteration_mean_time = time_total / iterations_total
    ideal_iteration_count = ceil(Int64, MIN_TIME / real_iteration_mean_time)
    ideal_run_time = real_iteration_mean_time * ideal_iteration_count
    time_overspent = time_total / ideal_run_time
    return (iterations_total, time_total, time_overspent)
end

NUM_INNER_REPETITIONS = 100

function run_test_repetition(;TIME_FUDJE) 
    ITERATION_MEAN_TIME = rand(dist_ITERATION_MEAN_TIME)
    ITERATION_CoV = rand(dist_ITERATION_CoV)
    SYSTEM_BACKGROUND_JITTER_LIKELYHOOD = rand(dist_SYSTEM_BACKGROUND_JITTER_LIKELYHOOD)
    SYSTEM_BACKGROUND_JITTER_RUNTIME_IMPACT = rand(dist_SYSTEM_BACKGROUND_JITTER_RUNTIME_IMPACT)

    ITERATION_STDEV = ITERATION_CoV * ITERATION_MEAN_TIME

    v = [run_test_repetition_impl(TIME_FUDJE=TIME_FUDJE,
                                  ITERATION_MEAN_TIME=ITERATION_MEAN_TIME,
                                  ITERATION_STDEV=ITERATION_STDEV,
                                  SYSTEM_BACKGROUND_JITTER_LIKELYHOOD=SYSTEM_BACKGROUND_JITTER_LIKELYHOOD,
                                  SYSTEM_BACKGROUND_JITTER_RUNTIME_IMPACT=SYSTEM_BACKGROUND_JITTER_RUNTIME_IMPACT)
                                for r in 1:NUM_INNER_REPETITIONS]
    next!(p)
    return v
end

NUM_REPETITIONS = 10^5

function run_test(TIME_FUDJE)
    chunk_size = max(1, ceil(Int64, NUM_REPETITIONS // nthreads()))
    data_chunks = partition(1:NUM_REPETITIONS, chunk_size)
    tasks = map(data_chunks) do chunk
        @spawn begin
            state = []
            for x in chunk
                push!(state, run_test_repetition(TIME_FUDJE=TIME_FUDJE))
            end
            return state
        end
    end
    
    v = fetch.(tasks)
    v = vcat(v...)
    v = vcat(v...)
    all_iterations = getfield.(v, 1)
    all_times = getfield.(v, 2)
    all_times_overspent = getfield.(v, 3)
   
    iterations_mean = mean(all_iterations)
    time_mean = mean(all_times)
    time_overspent_mean = mean(all_times_overspent)

    iterations_std = std(all_iterations; mean=iterations_mean)
    time_std = std(all_times; mean=time_mean)
    time_overspent_std = std(all_times_overspent; mean=time_overspent_mean)
    
    return (measurement(iterations_mean, iterations_std),
            measurement(time_mean, time_std),
            measurement(time_overspent_mean, time_overspent_std))
end

FUDGE_MIN = 1.0
FUDGE_MAX = 1.5
FUDGE_STEP = 0.01

TIME_FUDJES = collect(range(FUDGE_MIN,stop=FUDGE_MAX,length=1+ceil(Int64, (FUDGE_MAX-FUDGE_MIN)/FUDGE_STEP)))

chunk_size = max(1, ceil(Int64, length(TIME_FUDJES) // nthreads()))
data_chunks = partition(TIME_FUDJES, chunk_size)
p = Progress(NUM_REPETITIONS*length(TIME_FUDJES); dt=1)
tasks = map(data_chunks) do chunk
    @spawn begin
        state = []
        for x in chunk
            push!(state, run_test(x))
        end
        return state
    end
end
RESULTS = fetch.(tasks)
RESULTS = vcat(RESULTS...)
TIME_FUDJES = TIME_FUDJES.*u"fudgeFactor"
# ITERATIONS = getfield.(RESULTS, 1).*u"iterations"
TIMES = getfield.(RESULTS, 2)
TIMES_OVERSPENT = getfield.(RESULTS, 3)

scatter(TIME_FUDJES, Measurements.value.(TIMES_OVERSPENT), #series_annotations = ann,
        xticks=TIME_FUDJES, yticks=0:0.005:3, xlabel="fudge factor", ylabel="time overspenditure (factor)", size=(1920,1080))
EDIT: the distributions there ^ are wrong, these might be better: (graphs not updated)
using SpecialFunctions

function lognormal_σ(;μ,v,num_σ=3)
    p = erf(num_σ/sqrt(2))
    return -- log(v))/(sqrt(2)*erfinv(2*p-1))
end

# 0σ = 10ms, +3σ = 1s
# NOTE: in reality, mean should be 1ms, but that makes this simulation much slower...
μ=log(10/1000)
dist_ITERATION_MEAN_TIME = censored(LogNormal(μ, lognormal_σ=μ,v=1000/1000)), ustrip(u"s", 1u"μs"), ustrip(u"s", 1u"s"))

# 0σ = 1%, +3σ = 25%
μ=log(1/100)
dist_ITERATION_CoV = censored(LogNormal(μ, lognormal_σ=μ,v=25/100)), 0.1/100, 25/100)

# 0σ = 10%, +3σ = 25%
μ=log(10/100)
dist_SYSTEM_BACKGROUND_JITTER_LIKELYHOOD = censored(LogNormal(μ, lognormal_σ=μ,v=25/100)), 0.1/100, 50/100)

# 0σ = 5%, +3σ = 25%
μ=log(5/100)
dist_SYSTEM_BACKGROUND_JITTER_RUNTIME_IMPACT = censored(LogNormal(μ, lognormal_σ=μ,v=25/100)), 0.1/100, 50/100)

fancy

Again, i'm not quite confident that my "statistical model" is sufficiently precise for anything,
but maybe 1.4 could be lowered somewhat, perhaps to 1.10, theoretically.

@ruben-laso
Copy link
Author

I agree that the value 1.0 falls into the theoretical (non-real) world.

Looking at your plots, I also agree that something in the 1.02 - 1.10 range might be more sensible.

In any case, I think the best option could be to have an environment variable (or a program argument like --benchmark_min_time_thres) so the user can tweak this if needed instead of a hardcoded constant value.

@LebedevRI
Copy link
Collaborator

@dmah42 question: does google retain benchmark .json data for long-term performance tracking?
If it does, we maybe could at least derive (with-noise) dist_ITERATION_MEAN_TIME from that real-world data.

We don't encode not nearly enough info into that json (we are missing min_time and use_real_time),
so it's would to be somewhat lossy, but i think we could side-step that issue.
I think, only the following parts of said json output would be necessary:

{
"run_name": <?>,
"run_type": <?>,
"threads": <?>, % we were scaling by this factor until recently
"iterations": <?>,
"aggregate_name": <?>,
"real_time": <?>,
"cpu_time": <?>,
"time_unit": <?>
}

I'm guessing you'll want to censor "run_name", i think it would be fine to drop everything but /min_time:<>and/use_real_time` from that string.

Out of all the benchmarks that are tracked, have all of them been run at least once since #1836?
I think it would be least confusing to only look at the benchmarks //with// that change,
(or without it).

Is that possible?

@LebedevRI
Copy link
Collaborator

LebedevRI commented Sep 9, 2024

@dmah42 (consider my previous question to be a theoretical question, not a request.)

@ruben-laso

I agree that the value 1.0 falls into the theoretical (non-real) world.

Looking at your plots, I also agree that something in the 1.02 - 1.10 range might be more sensible.

In any case, I think the best option could be to have an environment variable (or a program argument like --benchmark_min_time_thres) so the user can tweak this if needed instead of a hardcoded constant value.

That feels very much placebo-y, as far as i can tell, even with that fudge at 1.1, we still spend 1.5x the time
we ideally need to (as per the final iteration time), which is not significantly better than what we'd spent at 1.4:

julia code
module MyUnits
    using Unitful

    @unit fudgeFactor "×" FudgeFactor 1 false
    @unit iterations "iters" Iterations 1 false

    Unitful.register(@__MODULE__)
    function __init__()
        return Unitful.register(@__MODULE__)
    end               
end

using Base.Threads
using Base.Iterators
using Unitful
using Distributions
using Random
using Measurements
using ProgressMeter
using Plots
using SpecialFunctions
using DataFrames

plotlyjs()

MIN_TIME = 1u"s"

#dist_ITERATION_MEAN_TIME = LogUniform(ustrip(u"s", 1u"μs"), ustrip(u"s", 10u"s"))
#dist_ITERATION_CoV = LogUniform(0.1/100, 10/100)
#dist_SYSTEM_BACKGROUND_JITTER_LIKELYHOOD = LogUniform(0.1/100, 10/100)
#dist_SYSTEM_BACKGROUND_JITTER_RUNTIME_IMPACT = LogUniform(0.1/100, 10/100)

function lognormal_σ(;μ,v,num_σ=3)
    p = erf(num_σ/sqrt(2))
    return -- log(v))/(sqrt(2)*erfinv(2*p-1))
end

# 0σ = 1ms, +3σ = 1s
# NOTE: in reality, mean should be 1ms, but that makes this simulation much slower...
μ=log(1/1000)
dist_ITERATION_MEAN_TIME = censored(LogNormal(μ, lognormal_σ=μ,v=1000/1000)), ustrip(u"s", 1u"μs"), ustrip(u"s", 1u"s"))

# 0σ = 10%, +3σ = 25%
μ=log(10/100)
dist_ITERATION_CoV = censored(LogNormal(μ, lognormal_σ=μ,v=25/100)), 0.1/100, 25/100)

struct BenchmarkResult
    AUTHORITATIVE_TIME::Float64
    ITERATION_COUNT_MULTIPLIER::Int64
    TIME_FUDJE::Float64
    time_overspent::Any
end

function run_test_repetition_impl(; AUTHORITATIVE_TIME,
                                    ITERATION_COUNT_MULTIPLIER,
                                    TIME_FUDJE)
    ITERATION_MEAN_TIME = rand(dist_ITERATION_MEAN_TIME)
    ITERATION_CoV = rand(dist_ITERATION_CoV)

    ITERATION_STDEV = ITERATION_CoV * ITERATION_MEAN_TIME

    dist_ITERATION_TIME = Normal(ITERATION_MEAN_TIME, ITERATION_STDEV)
     
    iterations_total = missing
    time_total = missing
    
    iterations = missing
    time = missing

    while true
        prev_iterations = iterations
        if iterations isa Missing
            iterations = 1
        elseif (time / MIN_TIME) <= AUTHORITATIVE_TIME
            iterations *= ITERATION_COUNT_MULTIPLIER
        else
            multiplier = MIN_TIME * TIME_FUDJE / max(time, 1e-9u"s")
            iterations *= multiplier
        end
        iterations = convert(Int64, ceil(iterations))
        @assert (prev_iterations isa Missing) || iterations > prev_iterations

        iteration_times = rand(dist_ITERATION_TIME,iterations).*u"s"
        time = sum(iteration_times)
        
        iterations_total = sum(skipmissing([iterations, iterations_total]))
        time_total = sum(skipmissing([time, time_total]))

        if time >= MIN_TIME
            break
        end
    end
    @assert time > MIN_TIME
    real_iteration_mean_time = time_total / iterations_total
    ideal_iteration_count = ceil(Int64, MIN_TIME / real_iteration_mean_time)
    ideal_run_time = real_iteration_mean_time * ideal_iteration_count
    time_overspent = time_total / ideal_run_time
    return time_overspent
end

NUM_BENCHMARK_REPETITIONS = 10^6

function run_test_configuration(; AUTHORITATIVE_TIME,
                                  ITERATION_COUNT_MULTIPLIER,
                                  TIME_FUDJE) 
    chunk_size = max(1, ceil(Int64, NUM_BENCHMARK_REPETITIONS // nthreads()))
    data_chunks = partition(1:NUM_BENCHMARK_REPETITIONS, chunk_size)
    tasks = map(data_chunks) do chunk
        @spawn begin
            state = []
            for x in chunk
                push!(state, run_test_repetition_impl(
                                  AUTHORITATIVE_TIME=AUTHORITATIVE_TIME,
                                  ITERATION_COUNT_MULTIPLIER=ITERATION_COUNT_MULTIPLIER,
                                  TIME_FUDJE=TIME_FUDJE))
                next!(p)
            end
            return state
        end
    end
    
    v = fetch.(tasks)
    v = vcat(v...)
    v = vcat(v...)

    time_overspent = v

    time_overspent_mean = mean(time_overspent)
    time_overspent_std = std(time_overspent; mean=time_overspent_mean)
    time_overspent = measurement(time_overspent_mean, time_overspent_std)
    #time_overspent = time_overspent_mean + 3*time_overspent_std
    
    v = BenchmarkResult(AUTHORITATIVE_TIME,
                        ITERATION_COUNT_MULTIPLIER,
                        TIME_FUDJE,
                        time_overspent)
    return v
end

function run_test(CONFIGURATIONS)    
    chunk_size = max(1, ceil(Int64, length(CONFIGURATIONS) // nthreads()))
    data_chunks = partition(CONFIGURATIONS, chunk_size)
    tasks = map(data_chunks) do chunk
        @spawn begin
            state = []
            for x in chunk
                push!(state, run_test_configuration(AUTHORITATIVE_TIME=x[1],ITERATION_COUNT_MULTIPLIER=x[2],TIME_FUDJE=x[3]))
            end
            return state
        end
    end

    v = fetch.(tasks)
    v = vcat(v...)
    v = vcat(v...)
    return v
end

#AUTHORITATIVE_TIME = [10/100]
#ITERATION_COUNT_MULTIPLIER = [10]
#TIME_FUDGE = [1.4]

AUTHORITATIVE_TIME = [10/100]
ITERATION_COUNT_MULTIPLIER = [10]
TIME_FUDGE = [1.0,1.1,1.4]

#AUTHORITATIVE_TIME = 1/100:1/100:5/100
#ITERATION_COUNT_MULTIPLIER = 2:1:20
#TIME_FUDGE = 1.0:0.01:1.1

CONFIGURATIONS = [(x,y,z) for x in AUTHORITATIVE_TIME
                          for y in ITERATION_COUNT_MULTIPLIER
                          for z in TIME_FUDGE]
NUM_CONFIGURATIONS = length(CONFIGURATIONS)

p = Progress(NUM_BENCHMARK_REPETITIONS*NUM_CONFIGURATIONS; dt=1)
R = run_test(CONFIGURATIONS)

X = [r.ITERATION_COUNT_MULTIPLIER for r in R]
Y = [r.AUTHORITATIVE_TIME for r in R]
Y2 = [r.TIME_FUDJE for r in R]
Z = [r.time_overspent for r in R]

df = DataFrame(ITERATION_COUNT_MULTIPLIER=X, AUTHORITATIVE_TIME=Y, TIME_FUDJE=Y2, time_overspent=Z)
sort!(df, [:time_overspent])
sort!(df, [order(:time_overspent), order(:AUTHORITATIVE_TIME), order(:ITERATION_COUNT_MULTIPLIER, rev=true), order(:TIME_FUDJE, rev=true)])
@show first(df, 100)
@assert false

Z = Measurements.value.(Z)

#heatmap(X,Y,Z; size=(800,600),
#        xlabel="ITERATION_COUNT_MULTIPLIER (x)", ylabel="AUTHORITATIVE_TIME (%)", zlabel="time_overspent (x)",
#        xticks=ITERATION_COUNT_MULTIPLIER, yticks=AUTHORITATIVE_TIME)

scatter(X,Y,Z; zcolor=Z, size=(800,600),
        xlabel="ITERATION_COUNT_MULTIPLIER (x)", ylabel="AUTHORITATIVE_TIME (%)", zlabel="time_overspent (x)",
        xticks=ITERATION_COUNT_MULTIPLIER, yticks=AUTHORITATIVE_TIME)

#scatter(Y2,Z; size=(800,600),
#        xlabel="TIME_FUDGE", ylabel="time_overspent (x)",
#        xticks=TIME_FUDGE)
first(df, 100) = 3×4 DataFrame
 Row │ ITERATION_COUNT_MULTIPLIER  AUTHORITATIVE_TIME  TIME_FUDJE  time_overspent
     │ Int64                       Float64             Float64     Measurement…
─────┼────────────────────────────────────────────────────────────────────────────
   1 │                         10                 0.1         1.1       1.53±0.28
   2 │                         10                 0.1         1.4       1.82±0.28
   3 │                         10                 0.1         1.0       2.04±0.85

We could also lower AUTHORITATIVE_TIME and maybe change ITERATION_COUNT_MULTIPLIER,
and lower the total time spent to as low as 1.2x of the ideal,
but really does feel ad-hoc to change anything without better statistical model...

@dmah42
Copy link
Member

dmah42 commented Sep 9, 2024

the right solution to this is to run until we reduce the confidence interval for the iterations within a run to a particular threshold (or max time to avoid infinite runs). anything other than this is just fine-tuning and is unlikely to scale across all the various use-cases.

if someone wants to reduce 1.4 to 1.2 or 1.1 or whatever then create a PR but i really don't think it matters all that much (and thank you @LebedevRI for basically proving that :D ).

@LebedevRI
Copy link
Collaborator

LebedevRI commented Sep 9, 2024

the right solution to this is to run until we reduce the confidence interval for the iterations within a run to a particular threshold (or max time to avoid infinite runs). anything other than this is just fine-tuning and is unlikely to scale across all the various use-cases.

if someone wants to reduce 1.4 to 1.2 or 1.1 or whatever then create a PR but i really don't think it matters all that much (and thank you @LebedevRI for basically proving that :D ).

Yeah, that's my general understanding, too. I haven't yet tried modelling said different convergence approach.
Trivia bit: this time_overspent factor for current implementation seems to best described by inverse gaussian distribution.

@ruben-laso
Copy link
Author

Hello again,

I was playing with the internals of the library, and I think I now understand how it works.
From what I understood, every time we get a new prediction from PredictNumItersNeeded() we "start from scratch".
Whatever measurements we had before are forgotten.

Let's use an example:
Imagine we want to benchmark the function foo() for 1s. At some point, PredictNumItersNeeded() says that we should run the function 900 times. Unfortunately, we remained at 0.9s, and PredictNumItersNeeded() should be called again, saying that we should have run foo() 1000 times to get to 1s.
Now, instead of doing 1000-900 = 100 iterations to fill the gap, we run the 1000 iterations in a single go so, at the end of the process, we run foo() >1900 times.

@LebedevRI That fits with what you mentioned here

Is this correct?

If so, isn't this approach very inefficient? We have to choose to overestimate or start from scratch a lot of times

@dmah42
Copy link
Member

dmah42 commented Sep 11, 2024

yes that is correct, we start from scratch. i suppose we could change it to do an incremental run and check the running total of iterations and time instead. it would be a reasonably complex change but would improve the efficiency.

efficiency (in terms of how much we run) hasn't been a concern as we've focused on making the inner timing loop as fast as possible to get out of the way of the code under test.

@LebedevRI
Copy link
Collaborator

@LebedevRI That fits with what you mentioned here

Is this correct?

That is what i said, yes.

I'm still looking at that model, i need better estimates, but i think we might
be able to get closer to efficiency sweet-spot.

yes that is correct, we start from scratch. i suppose we could change it to do an incremental run and check the running total of iterations and time instead. it would be a reasonably complex change but would improve the efficiency.

Can we, though? I was under distinct impression that the current way it's done is intentional,
and it is a design guarantee that the benchmark-under-test will run for
at least min_time consecutive seconds, and we will report the measurements from that specific run.

@ruben-laso
Copy link
Author

As an idea, why not estimate the number of iterations to get to 10% of the min time?

The way it is now, you multiply the iterations by 10 until surpassing the 10% threshold. Let's say you are between 2-9% of your min time. Then, the multiplier will be 10, and you will move to the range of 20 to 90% of your min time.

Instead of multiplying by 10 "blindly", what about making an estimation of the iterations to get to, let's say 15%? In that case, you "wasted" only ~15% of CPU time, but it is big enough to be considered "significant" in the current algorithm and make a better prediction later (reducing the 40% overestimation to something smaller).

Maybe not the best solution, but I think this would alleviate the problem and should be easy to implement.

@dmah42
Copy link
Member

dmah42 commented Sep 11, 2024

Can we, though? I was under distinct impression that the current way it's done is intentional, and it is a design guarantee that the benchmark-under-test will run for at least min_time consecutive seconds, and we will report the measurements from that specific run.

it's intentional in that we built it that way. but like i said above, if efficiency had been a concern we probably would have built it differently. this was the simplest approach at the time that worked.

why do you think the consecutive aspect matters, or reporting from a single run matters? it shouldn't, right?

@ruben-laso
Copy link
Author

I don't think it is really relevant in terms of the report, but it is in terms of computation time.

Not only about the CPU time (and energy) spent without a real outcome but also the human time you need to wait to have the full report.
In my case, I have to benchmark a lot of cases and functions, and the full set of benchmarks might take several hours to complete. Saving time would be significant.

For example, below you can see the output of a little example to prove my point.
Here I store:

  • LastRunTime: execution time of the last batch (after the last call to PredictNumItersNeeded()). This number is approx Time * Iterations.
  • TimesCalled: times that the benchmark was executed. E.g.: 8 would mean that we tried 8 different numbers for iters within the library.
  • TotalRunTime: the total amount of time running each function. This is the summation of all the calls to VectorLookup, not only the reported ones.

You can see that for having a 1s benchmark (actually ~1.4s), we had to run each function up to 2.50s. That's a 150% overhead!
Several times we had to run each function more than 2s.
In my opinion, this overhead is far from negligible.

------------------------------------------------------------------------------------------------
Benchmark                 Time             CPU   Iterations LastRunTime TimesCalled TotalRunTime
------------------------------------------------------------------------------------------------
VectorLookup/10         504 ns          504 ns      2752537     1.38822           8      1.96039
VectorLookup/20        3217 ns         3217 ns       453119     1.45753           7      1.80181
VectorLookup/30       10233 ns        10227 ns       140387     1.43665           7      2.54488
VectorLookup/40       22553 ns        22548 ns        61905     1.39615           6      1.64711
VectorLookup/50       42935 ns        42933 ns        31745     1.36298           6      1.85481
VectorLookup/60       73800 ns        73790 ns        19437     1.43445           6      2.23461
VectorLookup/70      119281 ns       119281 ns        11924     1.42231           5      1.55267
VectorLookup/80      173258 ns       173253 ns         8347     1.44619           5      1.63244
VectorLookup/90      244639 ns       244630 ns         5709     1.39665           5      1.66971
VectorLookup/100     336135 ns       336133 ns         4196     1.41043           5      1.78169
VectorLookup/110     429649 ns       429649 ns         3255     1.39851           5       1.8789
VectorLookup/120     575516 ns       575507 ns         2514     1.44685           5      2.06573
VectorLookup/130     720009 ns       719941 ns         1956     1.40834           5      2.20416
VectorLookup/140     965566 ns       965355 ns         1560     1.50629           5      2.50385
VectorLookup/150    1517023 ns      1516741 ns          911     1.38201           4      1.55215
VectorLookup/160    1603279 ns      1603062 ns          856     1.37241           4      1.55395
VectorLookup/170    1655967 ns      1655238 ns          837     1.38605           4      1.57171
VectorLookup/180    1932482 ns      1932454 ns          720     1.39139           4      1.60685
VectorLookup/190    2259932 ns      2259914 ns          621     1.40342           4      1.65399
VectorLookup/200    2631596 ns      2631435 ns          532     1.40001           4      1.69203

@LebedevRI
Copy link
Collaborator

As i have said, i'm still looking at the current approach, we may be able to do better without any drastic changes.

@LebedevRI
Copy link
Collaborator

LebedevRI commented Sep 11, 2024

julia code
module MyUnits
    using Unitful

    @unit fudgeFactor "×" FudgeFactor 1 false
    @unit iterations "iters" Iterations 1 false

    Unitful.register(@__MODULE__)
    function __init__()
        return Unitful.register(@__MODULE__)
    end               
end

module Recyclers
    using Base.Threads: SpinLock

    struct Recycler{V}
        garbage_bin::Vector{V} # Guarded by: lock
        lock::SpinLock
    
        function Recycler{V}() where {V}
            garbage_bin = Vector{V}()
            lock = SpinLock()
    
            return new(garbage_bin, lock)
        end
    end
    
    function Base.get!(recycler::Recycler{V}) where {V}
        Base.@lock recycler.lock begin
            if !isempty(recycler.garbage_bin)
                return pop!(recycler.garbage_bin)
            end
        end
        return V()
    end
    
    function Base.put!(recycler::Recycler{V}, trash::V) where {V}
        Base.@lock recycler.lock begin
            push!(recycler.garbage_bin, trash)
        end
    end
end

using Base.Threads
using Base.Iterators
using Unitful
using Distributions
using Random
using Measurements
using ProgressMeter
using Plots
using SpecialFunctions
using DataFrames

Recycler = Recyclers.Recycler

plotlyjs()

MIN_TIME = 1u"s"

#dist_ITERATION_MEAN_TIME = LogUniform(ustrip(u"s", 1u"μs"), ustrip(u"s", 10u"s"))
#dist_ITERATION_CoV = LogUniform(0.1/100, 10/100)
#dist_SYSTEM_BACKGROUND_JITTER_LIKELYHOOD = LogUniform(0.1/100, 10/100)
#dist_SYSTEM_BACKGROUND_JITTER_RUNTIME_IMPACT = LogUniform(0.1/100, 10/100)

function tmapreduce(f, op, itr; tasks_per_thread::Int = 1, kwargs...)
    chunk_size = max(1, length(itr) ÷ (tasks_per_thread * nthreads()))
    tasks = map(Iterators.partition(itr, chunk_size)) do chunk
        @spawn mapreduce(f, op, chunk; kwargs...)
    end
    mapreduce(fetch, op, tasks; kwargs...)
end

p(num_σ) = erf(num_σ/sqrt(2))

function lognormal_σ(;μ,v,num_σ=3)
    return -- log(v))/(sqrt(2)*erfinv(2*p(num_σ)-1))
end

# 0σ = 1ms, +3σ = 1s
# NOTE: in reality, mean should be 1ms, but that makes this simulation much slower...
MEAN_TIME=1/1000
μ=log(MEAN_TIME)
dist_ITERATION_MEAN_TIME = censored(LogNormal(μ, lognormal_σ=μ,v=1000/1000)), ustrip(u"s", 1u"μs"), ustrip(u"s", 1u"s"))

# 0σ = 10%, +3σ = 25%
μ=log(10/100)
dist_ITERATION_CoV = censored(LogNormal(μ, lognormal_σ=μ,v=25/100)), 0.1/100, 25/100)

struct BenchmarkResult
    ITERATION_COUNT_MULTIPLIER::Int64
    AUTHORITATIVE_TIME::Float64
    TIME_FUDGE::Float64
    time_overspent::Distribution
end

function run_test_configuration_instance!(iteration_times::Vector{Float64},
                                          rng::AbstractRNG;
                                          dist_ITERATION_TIME::Distribution,
                                          AUTHORITATIVE_TIME::Number,
                                          ITERATION_COUNT_MULTIPLIER::Number,
                                          TIME_FUDGE::Number)::Float64
    iterations_total = missing
    time_total = missing
    
    iterations = missing
    time = missing

    while true
        prev_iterations = iterations
        if iterations isa Missing
            iterations = 1
        else
            ratio = iterations / time
            if (time / MIN_TIME) <= AUTHORITATIVE_TIME
                iterations *= ITERATION_COUNT_MULTIPLIER
            else
                multiplier = MIN_TIME * TIME_FUDGE / max(time, 1e-9u"s")
                iterations *= multiplier
            end
        end
        iterations = convert(Int64, ceil(iterations))
        @assert (prev_iterations isa Missing) || iterations > prev_iterations

        resize!(iteration_times, iterations)
        rand!(rng, dist_ITERATION_TIME, iteration_times)
        time = sum(iteration_times)*u"s"

        iterations_total = sum(skipmissing([iterations, iterations_total]))
        time_total = sum(skipmissing([time, time_total]))

        if time >= MIN_TIME
            break
        end
    end
    @assert time > MIN_TIME
    real_iteration_mean_time = time_total / iterations_total
    ideal_iteration_count = ceil(Int64, MIN_TIME / real_iteration_mean_time)
    ideal_run_time = real_iteration_mean_time * ideal_iteration_count
    time_overspent = time_total / ideal_run_time
    return time_overspent
end

NUM_BENCHMARK_INSTANCES = 10^5
NUM_BENCHMARK_INSTANCES = nthreads() * ceil(Int64, NUM_BENCHMARK_INSTANCES // nthreads())

function run_test_configuration(iteration_times_cache::Recycler{Vector{Float64}}, config, instances)::BenchmarkResult
    AUTHORITATIVE_TIME, ITERATION_COUNT_MULTIPLIER, TIME_FUDGE = config

    v::Vector{Float64} = tmapreduce(vcat, enumerate(instances), tasks_per_thread=1) do instance
        rng = Xoshiro(instance[1])
        iteration_times = get!(iteration_times_cache)
        state = run_test_configuration_instance!(
                    iteration_times, rng,
                    dist_ITERATION_TIME=instance[2],
                    AUTHORITATIVE_TIME=AUTHORITATIVE_TIME,
                    ITERATION_COUNT_MULTIPLIER=ITERATION_COUNT_MULTIPLIER,
                    TIME_FUDGE=TIME_FUDGE)
        put!(iteration_times_cache, iteration_times)
        push!(progress, 1)
        return state
    end

    dist::Distribution = fit(InverseGaussian, v)

    return BenchmarkResult(ITERATION_COUNT_MULTIPLIER,
                           AUTHORITATIVE_TIME,
                           TIME_FUDGE,
                           dist)
end

function run_test(CONFIGURATIONS)::Vector{BenchmarkResult}
    instances = [
        begin
            ITERATION_MEAN_TIME = rand(dist_ITERATION_MEAN_TIME)
            ITERATION_CoV = rand(dist_ITERATION_CoV)
                
            ITERATION_STDEV = ITERATION_CoV * ITERATION_MEAN_TIME
                
            Normal(ITERATION_MEAN_TIME, ITERATION_STDEV)
        end
        for i in 1:NUM_BENCHMARK_INSTANCES]

    iteration_times_cache = Recycler{Vector{Float64}}()

    return tmapreduce(vcat, CONFIGURATIONS, tasks_per_thread=1) do configuration
        run_test_configuration(iteration_times_cache, configuration, view(instances, :))
    end
end

#AUTHORITATIVE_TIME = [10/100]
#ITERATION_COUNT_MULTIPLIER = [10]
#TIME_FUDGE = [1.4]

#AUTHORITATIVE_TIME = 1/100:1/100:20/100
#ITERATION_COUNT_MULTIPLIER = 2:1:20
#TIME_FUDGE = 1.0:0.1:2.0

AUTHORITATIVE_TIME = vcat([0.01/100, 0.1/100], 1/100:1/100:10/100)
ITERATION_COUNT_MULTIPLIER = 2:1:10
TIME_FUDGE = reverse(1.05:0.05:1.4)

CONFIGURATIONS = [(a,b,c) for a in AUTHORITATIVE_TIME
                          for b in ITERATION_COUNT_MULTIPLIER
                          for c in TIME_FUDGE]
NUM_CONFIGURATIONS = length(CONFIGURATIONS)

progress = Channel{Union{UInt8,Nothing}}(nthreads()^2, spawn=true) do ch
    p = Progress(NUM_BENCHMARK_INSTANCES*NUM_CONFIGURATIONS; dt=1.0)

    if false
        progress_refresh = Channel{Nothing}(0, spawn=true) do ch
            while true
                state = timedwait(() -> isready(ch), 1.0, pollint=0.1)
                if state == :timed_out
                    next!(p, step=0, force=true)
                elseif state == :ok
                    @assert take!(ch) === nothing
                    return
                end
            end
        end    
    end
    
    while (step = take!(ch)) !== nothing
        next!(p, step=convert(Int, step))
    end
    #push!(progress_refresh, nothing)
end;

@timev R = run_test(CONFIGURATIONS)
@show R
push!(progress, nothing)

A = [r.ITERATION_COUNT_MULTIPLIER for r in R]
B = [r.AUTHORITATIVE_TIME for r in R]
C = [r.TIME_FUDGE for r in R]
V = [r.time_overspent for r in R]

#--

ITERATION_COUNT_MULTIPLIER = collect(ITERATION_COUNT_MULTIPLIER)
AUTHORITATIVE_TIME = collect(AUTHORITATIVE_TIME)
TIME_FUDGE = collect(TIME_FUDGE)

V_2D::Vector{Matrix{Any}} = fill(fill(missing, (length(AUTHORITATIVE_TIME), length(TIME_FUDGE))), length(ITERATION_COUNT_MULTIPLIER))
idx(v,s) = findfirst(x->x==v, s)
for r in R
    v = r.time_overspent
    V_2D[idx(r.ITERATION_COUNT_MULTIPLIER, ITERATION_COUNT_MULTIPLIER)][idx(r.AUTHORITATIVE_TIME, AUTHORITATIVE_TIME), idx(r.TIME_FUDGE, TIME_FUDGE)] = v
end

V_2D_val = [ quantile.(m, p(3)) for m in V_2D ]
V_2D_std = [ std.(m) for m in V_2D ]

function normalize(m, min, max)
    m = m .- min
    max -= min
    m = m ./ max
    return m
end

function normbounds(m)
    b = extrema.(m)
    q = extrema([(b...)...])
    return normalize.(b, q...)
end

normbounds_val = normbounds(V_2D_val)
normbounds_std = normbounds(V_2D_std)

print("\n")

#--

using ColorSchemes
using Plots
using Plots.PlotMeasures

plotlyjs()

function cntr(v, normbounds, digits, stepmult, color, title)
    best = findmin(v)
   
    color = cgrad([get(color, p) for p in normbounds])

    step = 10.0^(-digits)/stepmult
    levels = round(minimum(v), RoundDown, digits=digits)-step:step:round(maximum(v), RoundUp, digits=digits)+step

    c = contour(AUTHORITATIVE_TIME,TIME_FUDGE,transpose(v), title=title, levels=levels, contour_labels=true, fill=true, color=color, alpha = 0.25, bottom_margin=40px)
    c = scatter!([AUTHORITATIVE_TIME[best[2][1]]],[TIME_FUDGE[best[2][2]]], mark=:diamond, color=:gold, legend=false)
    return c
end

plots = []
for i in 1:length(ITERATION_COUNT_MULTIPLIER)
    push!(plots, cntr(V_2D_val[i], normbounds_val[i], 1, 2, cgrad(:diverging_gwr_55_95_c38_n256), "+3σ, ITERATION_COUNT_MULTIPLIER="*string(ITERATION_COUNT_MULTIPLIER[i])))
    push!(plots, cntr(V_2D_std[i], normbounds_std[i], 2, 1, cgrad(:diverging_bky_60_10_c30_n256), "std, ITERATION_COUNT_MULTIPLIER="*string(ITERATION_COUNT_MULTIPLIER[i])))
end

# size = (1400,400*length(ITERATION_COUNT_MULTIPLIER))
size = (1920, 4*1080)
plot(plots...,
     size=size, layout = grid(length(ITERATION_COUNT_MULTIPLIER), 2),
     xlabel="AUTHORITATIVE_TIME", ylabel="TIME_FUDGE (x)",
     xticks=AUTHORITATIVE_TIME, yticks=TIME_FUDGE)

(We are currently at top-right corner of the bottom graph)
newplot

One thing that i haven't thought about is, there may be costly setup code
outside of the benchmarked section, and it's timings are not at all modelled here.
Without accounting for that, decreasing ITERATION_COUNT_MULTIPLIER is perhaps a foodgun.

@ruben-laso
Copy link
Author

ruben-laso commented Sep 12, 2024

One thing that i haven't thought about is, there may be costly setup code outside of the benchmarked section, and it's timings are not at all modelled here. Without accounting for that, decreasing ITERATION_COUNT_MULTIPLIER is perhaps a foodgun.

@LebedevRI that's a very good point, in that case the aim should be to reduce the amount of times the function PredictNumItersNeeded() is called, right?

I implemented this idea:

Instead of multiplying by 10 "blindly", what about making an estimation of the iterations to get to, let's say 15%? In that case, you "wasted" only ~15% of CPU time, but it is big enough to be considered "significant" in the current algorithm and make a better prediction later (reducing the 40% overestimation to something smaller).

3-phase approach to determine the number of iterations needed.

  1. Multiply by 10 while we are below 1% of the min time to reach a "semi-significant time".
  2. Use "semi-significant time" to find the iterations needed to reach 10% of the min time (+5% buffer) -> "significant time".
  3. Use "significant time" to find the iterations to reach 100% of the min time + (5% buffer)
New implementation of `PredictNumItersNeeded()`
IterationCount BenchmarkRunner::PredictNumItersNeeded(
    const IterationResults& i) const {
  // Try a 3-phase approach to determine the number of iterations needed.
  // 1. Multiply by 10 while we are below 1% of the min time -> semi-significant
  // time.
  // 2. Find the iterations needed to reach 10% of the min time (+5% buffer) ->
  // significant time.
  // 3. Find the iterations to reach 100% of the min time + (5% buffer)

  static constexpr auto kSemiSignificantTime = 0.01;
  static constexpr auto kSignificantTime = 0.10;
  static constexpr auto kSafetyBuffer = 0.05;

  const auto significance = i.seconds / GetMinTimeToApply();

  // See how much "iters" should be increased by.
  double multiplier = 1.0;

  if (significance < kSemiSignificantTime) {
    // 1. Multiply by 10 while we are below 1% of the min time.
    multiplier = 10.0;
  } else if (significance < kSignificantTime) {
    // 2. Find the number of iterations needed to reach 10% of the min time.
    // Use +5% buffer to avoid running short.
    static constexpr auto scale = kSignificantTime + kSafetyBuffer;
    multiplier = (GetMinTimeToApply() * scale) / std::max(i.seconds, 1e-9);
  } else {
    // 3. Find the number of iterations to get to the 100% of the min time.
    // Use +5% buffer to avoid running short and needing to run again.
    static constexpr auto scale = 1.0 + kSafetyBuffer;
    multiplier = (GetMinTimeToApply() * scale) / std::max(i.seconds, 1e-9);
  }

  // So what seems to be the sufficiently-large iteration count? Round up.
  const IterationCount max_next_iters = static_cast<IterationCount>(
      std::llround(std::max(multiplier * static_cast<double>(i.iters),
                            static_cast<double>(i.iters) + 1.0)));
  // But we do have *some* limits though..
  const IterationCount next_iters = std::min(max_next_iters, kMaxIterations);

  BM_VLOG(3) << "Next iters: " << next_iters << ", " << multiplier << "\n";
  return next_iters;  // round up before conversion to integer.
}

With the previous benchmark, it seems that the overhead is reduced from 50-150% to ~20%

------------------------------------------------------------------------------------------------
Benchmark                 Time             CPU   Iterations LastRunTime TimesCalled TotalRunTime
------------------------------------------------------------------------------------------------
VectorLookup/10         379 ns          379 ns      2755924     1.04467           8       1.2279
VectorLookup/20        2783 ns         2783 ns       361921     1.00727           7       1.1888
VectorLookup/30        8494 ns         8494 ns       123341     1.04762           7      1.29101
VectorLookup/40       20736 ns        20736 ns        49884      1.0344           6      1.20582
VectorLookup/50       39235 ns        39235 ns        27109     1.06364           6      1.25163
VectorLookup/60       68139 ns        68136 ns        15541     1.05895           6      1.28128
VectorLookup/70      104222 ns       104219 ns         9835     1.02502           5      1.19017
VectorLookup/80      160129 ns       160127 ns         6548     1.04853           5      1.21012
VectorLookup/90      218037 ns       218035 ns         4742     1.03394           5      1.20603
VectorLookup/100     305099 ns       305098 ns         3424     1.04466           5      1.22355
VectorLookup/110     401033 ns       401023 ns         2644     1.06033           5      1.25001
VectorLookup/120     540107 ns       540099 ns         2010     1.08562           5      1.28743
VectorLookup/130     652292 ns       652282 ns         1595     1.04041           5      1.26183
VectorLookup/140     835440 ns       835427 ns         1268     1.05934           5      1.29917
VectorLookup/150    1262388 ns      1262372 ns          808     1.02001           4      1.18841
VectorLookup/160    1380183 ns      1380157 ns          745     1.02824           4      1.18951
VectorLookup/170    1454011 ns      1453913 ns          707     1.02799           4      1.19032
VectorLookup/180    1740274 ns      1740252 ns          589     1.02502           4      1.19431
VectorLookup/190    2029587 ns      2029546 ns          510     1.03509           4       1.2063
VectorLookup/200    2393218 ns      2393183 ns          432     1.03387           4      1.20711

I would even say that step 1 could use the same kind of approach as the other steps to reduce the amount of times that the iterations have to be recalculated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants