| Commit message (Collapse) | Author | Age | Files | Lines |
| |
|
|
| |
This change improves memory usage slightly, at virtually no CPU cost.
|
| | |
|
| |
|
|
|
| |
When it happens, this might cause a slowdown on the fast path operations.
However such case is very rare.
|
| |
|
|
|
| |
In some rare cases (older compiler, e.g. gcc 4.2 w/ MIPS), 8-bit atomics might
be unavailable. Detect such cases so that we can workaround.
|
| |
|
|
|
| |
This regression was introduced by
3d29d11ac2c1583b9959f73c0548545018d31c8a (Clean compilation -Wextra).
|
| |
|
|
|
|
| |
These macros have been unused since
d4ac7582f32f506d5203bea2f0115076202add38 (Introduce a backport of C11
atomics).
|
| |
|
|
|
|
| |
This fixes a build failure when integrating with FreeBSD's libc. This
regression was introduced by d1e11d48d4c706e17ef3508e2ddb910f109b779f
(Move tsd link and in_hook after tcache.).
|
| | |
|
| |
|
|
|
| |
This adds some overhead to the tcache flush path (which is one of the
popular paths). Guard it behind a config option.
|
| |
|
|
|
| |
The keyword huge tend to remind people of huge pages which is not relevent to
the feature.
|
| |
|
|
|
|
| |
This feature uses an dedicated arena to handle huge requests, which
significantly improves VM fragmentation. In production workload we tested it
often reduces VM size by >30%.
|
| |
|
|
|
|
| |
For low arena count settings, the huge threshold feature may trigger an unwanted
bg thd creation. Given that the huge arena does eager purging by default,
bypass bg thd creation when initializing the huge arena.
|
| |
|
|
|
|
| |
When custom extent_hooks or transparent huge pages are in use, the purging
semantics may change, which means we may not get zeroed pages on repopulating.
Fixing the issue by manually memset for such cases.
|
| | |
|
| |
|
|
|
| |
Add exten_arena_ind_get() to avoid loading the actual arena ptr in case we just
need to check arena matching.
|
| | |
|
| |
|
|
|
| |
This avoids having to choose bin shard on the fly, also will allow flexible bin
binding for each thread.
|
| |
|
|
|
| |
The option uses the same format as "slab_sizes" to specify number of shards for
each bin size.
|
| |
|
|
|
|
|
|
|
| |
This makes it possible to have multiple set of bins in an arena, which improves
arena scalability because the bins (especially the small ones) are always the
limiting factor in production workload.
A bin shard is picked on allocation; each extent tracks the bin shard id for
deallocation. The shard size will be determined using runtime options.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If there are 3 or more threads spin-waiting on the same mutex,
there will be excessive exclusive cacheline contention because
pthread_trylock() immediately tries to CAS in a new value, instead
of first checking if the lock is locked.
This diff adds a 'locked' hint flag, and we will only spin wait
without trylock()ing while set. I don't know of any other portable
way to get the same behavior as pthread_mutex_lock().
This is pretty easy to test via ttest, e.g.
./ttest1 500 3 10000 1 100
Throughput is nearly 3x as fast.
This blames to the mutex profiling changes, however, we almost never
have 3 or more threads contending in properly configured production
workloads, but still worth fixing.
|
| |
|
|
|
| |
The setting has been tested in production for a while. No negative effect while
we were able to reduce number of threads per process.
|
| | |
|
| |
|
|
|
| |
Also adds a configure.ac check for __builtin_popcount, which is used
in the new fastpath.
|
| | |
|
| |
|
|
| |
Also catch invalid tcache id.
|
| |
|
|
|
|
| |
Add a cache_bin_dalloc_easy (to match the alloc_easy function),
and use it in tcache_dalloc_small. It will also be used in the
new free fastpath.
|
| |
|
|
|
|
|
|
| |
For a free fastpath, we want something that will not make additional
calls. Assume most free() calls will hit the L1 cache, and use
a custom rtree function for this.
Additionally, roll the ptr=NULL check in to the rtree cache check.
|
| |
|
|
|
|
|
|
| |
Nearly all 32-bit powerpc hardware treats lwsync as sync, and some cores
(Freescale e500) trap lwsync as an illegal instruction, which then gets
emulated in the kernel. To avoid unnecessary traps on the e500, use
sync on all 32-bit powerpc. This pessimizes 32-bit software running on
64-bit hardware, but those numbers should be slim.
|
| | |
|
| |
|
|
|
|
|
|
|
|
| |
The diff 'refactor prof accum...' moved the bytes_until_sample
subtraction before the load of tdata. If tdata is null,
tdata_get(true) will overwrite bytes_until_sample, but we
still sample the current allocation. Instead, do the subtraction
and check logic again, to keep the previous behavior.
blame-rev: 0ac524308d3f636d1a4b5149fa7adf24cf426d9c
|
| |
|
|
|
| |
For the fastpath, we want to tick, but undo the tick and jump to the
slowpath if ticker would fire.
|
| | |
|
| | |
|
| |
|
|
|
|
| |
This comments concatenates the `JEMALLOC_VERSION_GID` to the
`smallocx` symbol name, such that the symbol ends up exported
as `smallocx_{git_hash}`.
|
| |
|
|
|
|
|
|
|
| |
The experimental `smallocx` API is not exposed via header files,
requiring the users to peek at `jemalloc`'s source code to manually
add the external declarations to their own programs.
This should reinforce that `smallocx` is experimental, and that `jemalloc`
does not offer any kind of backwards compatiblity or ABI gurantees for it.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
---
Motivation:
This new experimental memory-allocaction API returns a pointer to
the allocation as well as the usable size of the allocated memory
region.
The `s` in `smallocx` stands for `sized`-`mallocx`, attempting to
convey that this API returns the size of the allocated memory region.
It should allow C++ P0901r0 [0] and Rust Alloc::alloc_excess to make
use of it.
The main purpose of these APIs is to improve telemetry. It is more accurate
to register `smallocx(size, flags)` than `smallocx(nallocx(size), flags)`,
for example. The latter will always line up perfectly with the existing
size classes, causing a loss of telemetry information about the internal
fragmentation induced by potentially poor size-classes choices.
Instrumenting `nallocx` does not help much since user code can cache its
result and use it repeatedly.
---
Implementation:
The implementation adds a new `usize` option to `static_opts_s` and an `usize`
variable to `dynamic_opts_s`. These are then used to cache the result of
`sz_index2size` and similar functions in the code paths in which they are
unconditionally invoked. In the code-paths in which these functions are not
unconditionally invoked, `smallocx` calls, as opposed to `mallocx`, these
functions explicitly.
---
[0]: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0901r0.html
|
| | |
|
| |
|
|
|
|
| |
generation of sub bytes_until_sample, usize; je; for x86 arch.
Subtraction is unconditional, and only flags are checked for the jump,
no extra compare is necessary. This also reduces register pressure.
|
| | |
|
| |
|
|
| |
to load tdata now, avoiding several branches.
|
| |
|
|
|
| |
Combine the branches for checking for an empty cache_bin, and
checking for the low watermark.
|
| |
|
|
|
|
| |
There's an optimizer bug upstream that results in test failures; reported at
https://bugzilla.redhat.com/show_bug.cgi?id=1619354. This works around the
failure reported at https://github.com/jemalloc/jemalloc/issues/1307.
|
| | |
|
| | |
|
| |
|
|
| |
This can be useful in situations where readlink is disallowed.
|
| | |
|
| |
|
|
| |
- Show number/bytes of extents of each size that are dirty, muzzy, retained.
|
| | |
|
| | |
|
| |
|
|
|
| |
- prof_opt_log flag starts logging automatically at runtime
- prof_log_{start,stop} mallctl for manual control
|