Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid frequent memory allocation/deallocation by memory pool #20

Open
jserv opened this issue Mar 18, 2018 · 16 comments
Open

Avoid frequent memory allocation/deallocation by memory pool #20

jserv opened this issue Mar 18, 2018 · 16 comments
Assignees
Labels
enhancement New feature or request feature Outstanding features we should implement

Comments

@jserv
Copy link
Member

jserv commented Mar 18, 2018

Current PoW internals consist of various malloc and free, which are called frequently. It is bad for performance considerations. Using memory pool is a common technique to speed up and ensure consistent execution time.

I have done preliminary memory pool: https://github.com/jserv/dcurl/tree/memory-pool
NOTE: we might have to manipulate with thread-safe issues, and check out existing implementations such as philip-wernersbach/memory-pool-allocator.

@jserv
Copy link
Member Author

jserv commented Mar 18, 2018

After applying enable-rdtsc.patch, I got the following time-stamp numbers:

*** Validating build/test_trinary ***
=== trits_from_trytes: 42320 ===
=== trytes_from_trits: 5103 ===

*** Validating build/test_curl ***
=== trits_from_trytes: 220208 ===
=== trytes_from_trits: 3171 ===

*** Validating build/test_pow_sse ***
=== trits_from_trytes: 76903 ===
=== trits_from_trytes: 32099 ===
=== trytes_from_trits: 2245 ===
=== trits_from_trytes: 33132 ===
=== trytes_from_trits: 2674 ===
=== trits_from_trytes: 2651 ===

@jserv
Copy link
Member Author

jserv commented Mar 19, 2018

To illustrate the memory impact, TCMalloc is used for comparisons. The following environment is Intel Xeon E5 class server with Ubuntu Linux 17.04.

First, prepare TCMalloc: $ sudo apt install libtcmalloc-minimal4.

  • without TCMalloc
$ make check
*** Validating build/test_trinary ***
=== trits_from_trytes: 5460 ===
=== trytes_from_trits: 4286 ===

*** Validating build/test_curl ***
=== trits_from_trytes: 120820 ===
=== trytes_from_trits: 3940 ===

*** Validating build/test_pow_sse ***
=== trits_from_trytes: 68535 ===
=== trits_from_trytes: 61490 ===
=== trytes_from_trits: 1221 ===
=== trits_from_trytes: 31277 ===
=== trytes_from_trits: 1617 ===
=== trits_from_trytes: 2668 ===
  • with TCMalloc
$ LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4 make check
*** Validating build/test_trinary ***
=== trits_from_trytes: 29244 ===
=== trytes_from_trits: 3920 ===

*** Validating build/test_curl ***
=== trits_from_trytes: 108290 ===
=== trytes_from_trits: 2940 ===

*** Validating build/test_pow_sse ***
=== trits_from_trytes: 79500 ===
=== trits_from_trytes: 57566 ===
=== trytes_from_trits: 3028 ===
=== trits_from_trytes: 31777 ===
=== trytes_from_trits: 1654 ===
=== trits_from_trytes: 1980 ===

dcurl would benefit from the use of pre-allocated memory pool especially when its size is tweaked for trytes/trits representations.

@furuame
Copy link
Member

furuame commented Mar 19, 2018

It is worth mentioning that heap memory which every PoW task allocates is fixed. Maybe we can implement a special memory pool for allocating trytes & trits.

@furuame furuame added the enhancement New feature or request label Mar 22, 2018
@furuame furuame self-assigned this Aug 4, 2018
@furuame
Copy link
Member

furuame commented Aug 7, 2018

In our scenario, the memory usage every PoW task (thread) uses is "fixed" and the variables can be reused. I think we can declare all the variables in advance rather than allocating it from memory pool every time.

@marktwtn
Copy link
Collaborator

The tool heaptrack can show the information of the dynamic memory allocation.
Such as:

  • allocation times
  • allocation bytes
  • memory leak

2018-10-12 10-08-27

The information gives us the blueprint of the memory pool design.
It helps us determine the size of the memory pool.

@jserv
Copy link
Member Author

jserv commented Oct 13, 2018

Dynamic memory allocation tends to be non-deterministic, and is it possible to elininate existing dynamic allocation inside dcurl?

@marktwtn
Copy link
Collaborator

Dynamic memory allocation tends to be non-deterministic, and is it possible to elininate existing dynamic allocation inside dcurl?

Yes, we can eliminate the dynamic allocation to once or even use a declared char array as a memory pool.

@marktwtn
Copy link
Collaborator

I have implemented a memory pool mechanism and integrated it into the dcurl - SSE.

Here are the problems:

  1. Experiment result
    I run the test-pow with executing the PoW 100 times.
    The execution time does not have much difference.
    The time stamp difference of allocating a memory in trits_from_trytes and trytes_from_trits functions may even worse.

    To solve the problem
    (1) Use perf or gprof to analyze the memory pool code and improve the performance.
    (2) Run the program multiple times to see the execution time distribution.

  2. Allocation size
    Take SSE as an example.
    Most allocation size is fixed.
    However, there are some allocation sizes which are related to the maximum thread number and maximum core number.
    I leave these memory allocation unchanged.

@marktwtn
Copy link
Collaborator

marktwtn commented Oct 25, 2018

  1. Experiment result
    Forget about the execution time. It is not related to the memory pool.
    I use rdtsc to read the time stamp counter difference of each memory allocation.

    0 ~ 150 sample point
    timestamp0-150

    2000 ~ 2150 sample point
    timestamp2000-2150

    The graphs show the time stamp difference of allocating a memory with malloc function and memory pool by running PoW 100 times.
    The memory pool looks better than the dynamic memory allocation.
    However, there is a strange peak in memory pool.
    It happens when allocating a 16B memory right after the PoW is finished.
    Still looking for the reason of the weird behavior.

    The previous comment says the result is worse, that is caused by getting the time stamp counter value at the wrong line of the source code.

  2. Allocation size
    Based on the previous comment. there are some allocation which are related to the maximum thread and core number.
    If these numbers can be determined, then there would be no problems at all.

@jserv
Copy link
Member Author

jserv commented Oct 25, 2018

rdtsc is not accurate for SMP.

@marktwtn
Copy link
Collaborator

rdtsc is not accurate for SMP.

However, even if I use clock_gettime function to acquire the time difference, the result is still the same.

@marktwtn
Copy link
Collaborator

marktwtn commented Nov 2, 2018

When I was using the analysis tool such as perf, I found out that the PoW part took the most of the calculation.
Therefore, it was hard to see the behaviour of the other functions such as memory pool allocation.

However, the suggestion to empty the PoW function did not work properly.
Since the time stamp counter difference of each memory allocation is somehow affected by the PoW function.

@jserv
Copy link
Member Author

jserv commented Nov 2, 2018

Ouch! It is a pity. I look forward to the migration to other memory allocators.

@marktwtn
Copy link
Collaborator

marktwtn commented Nov 7, 2018

Since rdtsc can be afftected by out-of-order execution and variable CPU clock frequency,
the measurement is replaced with the function clock_gettime.

The following charts come up with running on different hardware and commenting the specific function transfromXXX() or not.

  • My desktop
    with transformXXX()
    cpu-desktop-nano
    without transformXXX()
    cpu-desktop-nano-notrans

  • My laptop
    with transformXXX()
    cpu-laptop-nano
    without transformXXX()
    cpu-laptop-nano-notrans

  • node.deviceproof.org (with DCURL_CPU_NUM=3)
    with transformXXX()
    cpu-device-nano-3core
    without transformXXX()
    cpu-device-nano-notrans-3core

The question and conclusion:
Comment out the important function transfromXXX() in PoW do reduce the impact on memory allocation.
However, the reason is not cleared. (I guess it is caused by the cache.)
And the memory allocation time is not stabilized, which means the memory allocator is not good enough or there are other impacts in dcurl.

Keep investigating.

@jserv
Copy link
Member Author

jserv commented Jan 31, 2019

After #95 is resolved, we can continue memory pool engagement.

@jserv jserv added the feature Outstanding features we should implement label Feb 11, 2019
@jserv
Copy link
Member Author

jserv commented Nov 4, 2019

Cc. @JulianaTa

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request feature Outstanding features we should implement
Projects
None yet
Development

No branches or pull requests

3 participants