| Commit message (Collapse) | Author | Age | Files | Lines |
| |
|
|
|
| |
Explicitly use iallocztm for internal allocations. ialloc could trigger arena
creation, which may cause lock order reversal (narenas_mtx and log_mtx).
|
| |
|
|
| |
This fixes a compiler warning.
|
| | |
|
| |
|
|
|
| |
We should allow a way to easily disable the feature (e.g. not reserving the
arena id at all).
|
| | |
|
| |
|
|
| |
This change improves memory usage slightly, at virtually no CPU cost.
|
| |
|
|
|
| |
When it happens, this might cause a slowdown on the fast path operations.
However such case is very rare.
|
| |
|
| |
Proposed fix for #1444 - ensure that `tls_callback` in the `#pragma comment(linker)`directive gets the same prefix added as it does i the C declaration.
|
| |
|
|
|
| |
Only triggers libgcc unwind init when prof is enabled. This helps workaround
some bootstrapping issues.
|
| |
|
|
| |
When not using libdl, still allows background_thread to be enabled.
|
| |
|
|
|
| |
This adds some overhead to the tcache flush path (which is one of the
popular paths). Guard it behind a config option.
|
| | |
|
| |
|
|
|
| |
The keyword huge tend to remind people of huge pages which is not relevent to
the feature.
|
| |
|
|
|
|
| |
This feature uses an dedicated arena to handle huge requests, which
significantly improves VM fragmentation. In production workload we tested it
often reduces VM size by >30%.
|
| | |
|
| |
|
|
| |
The rate calculation for the total row was missing.
|
| | |
|
| |
|
|
|
|
| |
For low arena count settings, the huge threshold feature may trigger an unwanted
bg thd creation. Given that the huge arena does eager purging by default,
bypass bg thd creation when initializing the huge arena.
|
| |
|
|
|
|
| |
When custom extent_hooks or transparent huge pages are in use, the purging
semantics may change, which means we may not get zeroed pages on repopulating.
Fixing the issue by manually memset for such cases.
|
| | |
|
| |
|
|
|
| |
Add exten_arena_ind_get() to avoid loading the actual arena ptr in case we just
need to check arena matching.
|
| | |
|
| |
|
|
|
| |
With sharded bins, we may not flush all items from the same arena in one run.
Adjust the stats merging logic accordingly.
|
| |
|
|
|
| |
This avoids having to choose bin shard on the fly, also will allow flexible bin
binding for each thread.
|
| | |
|
| |
|
|
|
| |
The option uses the same format as "slab_sizes" to specify number of shards for
each bin size.
|
| |
|
|
|
|
|
|
|
| |
This makes it possible to have multiple set of bins in an arena, which improves
arena scalability because the bins (especially the small ones) are always the
limiting factor in production workload.
A bin shard is picked on allocation; each extent tracks the bin shard id for
deallocation. The shard size will be determined using runtime options.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If there are 3 or more threads spin-waiting on the same mutex,
there will be excessive exclusive cacheline contention because
pthread_trylock() immediately tries to CAS in a new value, instead
of first checking if the lock is locked.
This diff adds a 'locked' hint flag, and we will only spin wait
without trylock()ing while set. I don't know of any other portable
way to get the same behavior as pthread_mutex_lock().
This is pretty easy to test via ttest, e.g.
./ttest1 500 3 10000 1 100
Throughput is nearly 3x as fast.
This blames to the mutex profiling changes, however, we almost never
have 3 or more threads contending in properly configured production
workloads, but still worth fixing.
|
| |
|
|
|
| |
The setting has been tested in production for a while. No negative effect while
we were able to reduce number of threads per process.
|
| | |
|
| |
|
|
|
| |
Also adds a configure.ac check for __builtin_popcount, which is used
in the new fastpath.
|
| | |
|
| |
|
|
|
|
|
|
| |
Refactor tcache_fill, introducing a new function arena_slab_reg_alloc_batch,
which will fill multiple pointers from a slab.
There should be no functional changes here, but allows future optimization
on reg_alloc_batch.
|
| |
|
|
|
| |
We may have a large number of pages with *zero set (since they are populated on
demand). Only check the first page to avoid paging in all of them.
|
| |
|
|
| |
Also catch invalid tcache id.
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
Add unsized and sized deallocation fastpaths. Similar to the malloc()
fastpath, this removes all frame manipulation for the majority of
free() calls. The performance advantages here are less than that
of the malloc() fastpath, but from prod tests seems to still be half
a percent or so of improvement.
Stats and sampling a both supported (sdallocx needs a sampling check,
for rtree lookups slab will only be set for unsampled objects).
We don't support flush, any flush requests go to the slowpath.
|
| |
|
|
|
| |
It was removed in 0771ff2cea6dc18fcd3f6bf452b4224a4e17ae38.
Add a comment explaining its purpose.
|
| |
|
|
| |
The regression was introduced in 3a1363b.
|
| |
|
|
|
| |
When destroying tcache, decay may not be triggered since tsd is non-nominal.
Explicitly decay to avoid pathological cases.
|
| |
|
|
|
|
|
|
|
|
|
| |
We eagerly coalesce large buffers when deallocating, however the previous logic
around this introduced extra lock overhead -- when coalescing we always lock the
neighbors even if they are active, while for active extents nothing can be done.
This commit checks if the neighbor extents are potentially active before
locking, and avoids locking if possible. This speeds up large_dalloc by ~20%.
It also fixes some undesired behavior: we could stop coalescing because a small
buffer was merged, while a large neighbor was ignored on the other side.
|
| |
|
|
|
|
|
| |
When retain is enabled, the default dalloc hook does nothing (since we avoid
munmap). But the overhead preparing the call is high, specifically the extent
de-register and re-register involve locking and extent / rtree modifications.
Bypass the call with retain in this diff.
|
| |
|
|
|
| |
When overcommit is enabled, commit needs to be set when doing mmap(). The
regression was introduced in f80c97e.
|
| | |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This diff adds a fastpath that assumes size <= SC_LOOKUP_MAXCLASS, and
that we hit tcache. If either of these is false, we fall back to
the previous codepath (renamed 'malloc_default').
Crucially, we only tail call malloc_default, and with the same kind
and number of arguments, so that both clang and gcc tail-calling
will kick in - therefore malloc() gets treated as a leaf function,
and there are *no* caller-saved registers. Previously malloc() contained
5 caller saved registers on x64, resulting in at least 10 extra
memory-movement instructions.
In microbenchmarks this results in up to ~10% improvement in malloc()
fastpath. In real programs, this is a ~1% CPU and latency improvement
overall.
|
| | |
|
| | |
|
| |
|
|
|
|
| |
This comments concatenates the `JEMALLOC_VERSION_GID` to the
`smallocx` symbol name, such that the symbol ends up exported
as `smallocx_{git_hash}`.
|
| |
|
|
|
|
|
|
|
| |
The experimental `smallocx` API is not exposed via header files,
requiring the users to peek at `jemalloc`'s source code to manually
add the external declarations to their own programs.
This should reinforce that `smallocx` is experimental, and that `jemalloc`
does not offer any kind of backwards compatiblity or ABI gurantees for it.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
---
Motivation:
This new experimental memory-allocaction API returns a pointer to
the allocation as well as the usable size of the allocated memory
region.
The `s` in `smallocx` stands for `sized`-`mallocx`, attempting to
convey that this API returns the size of the allocated memory region.
It should allow C++ P0901r0 [0] and Rust Alloc::alloc_excess to make
use of it.
The main purpose of these APIs is to improve telemetry. It is more accurate
to register `smallocx(size, flags)` than `smallocx(nallocx(size), flags)`,
for example. The latter will always line up perfectly with the existing
size classes, causing a loss of telemetry information about the internal
fragmentation induced by potentially poor size-classes choices.
Instrumenting `nallocx` does not help much since user code can cache its
result and use it repeatedly.
---
Implementation:
The implementation adds a new `usize` option to `static_opts_s` and an `usize`
variable to `dynamic_opts_s`. These are then used to cache the result of
`sz_index2size` and similar functions in the code paths in which they are
unconditionally invoked. In the code-paths in which these functions are not
unconditionally invoked, `smallocx` calls, as opposed to `mallocx`, these
functions explicitly.
---
[0]: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0901r0.html
|
| | |
|