| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
| |
Caching the extents on stack to avoid redundant looking up overhead.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Compact extent_t to 128 bytes on 64-bit systems by moving
arena_slab_data_t's nfree into extent_t's e_bits.
Cacheline-align extent_t structures so that they always cross the
minimum number of cacheline boundaries.
Re-order extent_t fields such that all fields except the slab bitmap
(and overlaid heap profiling context pointer) are in the first
cacheline.
This resolves #461.
|
|
|
|
|
|
|
|
|
|
| |
Remove tree-structured bitmap support, in order to reduce complexity and
ease maintenance. No bitmaps larger than 512 bits have been necessary
since before 4.0.0, and there is no current plan that would increase
maximum bitmap size. Although tree-structured bitmaps were used on
32-bit platforms prior to this change, the overall benefits were
questionable (higher metadata overhead, higher bitmap modification cost,
marginally lower search cost).
|
|
|
|
|
| |
Without ALWAYS_INLINE, sometimes ifree() gets compiled into its own function,
which adds overhead on the fast path.
|
|
|
|
|
| |
Rather than iteratively checking all sufficiently large heaps during
search, maintain and use a bitmap in order to skip empty heaps.
|
| |
|
|
|
|
|
|
|
|
|
| |
For extents which do not delay coalescing, use first fit layout policy
rather than first-best fit layout policy. This packs extents toward
older virtual memory mappings, but at the cost of higher search overhead
in the common case.
This resolves #711.
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
| |
A fixed max spin count is used -- with benchmark results showing it
solves almost all problems. As the benchmark used was rather intense,
the upper bound could be a little bit high. However it should offer a
good tradeoff between spinning and blocking.
|
| |
|
|
|
|
| |
Also switched from the term "lock" to "mutex".
|
|
|
|
| |
Also added option 'x' to malloc_stats() to bypass lock section.
|
| |
|
| |
|
|
|
|
|
| |
Two counters are included for the small bins: lock contention rate, and
max lock waiting time.
|
|
|
|
| |
Switched to trylock and update counters based on state.
|
|
|
|
|
| |
Call iealloc() as deep into call chains as possible without causing
redundant calls.
|
| |
|
| |
|
|
|
|
| |
This avoids one atomic operation per tree access.
|
|
|
|
|
|
| |
Expand and restructure the rtree API such that all common operations can
be achieved with minimal work, regardless of whether the rtree leaf
fields are independent versus packed into a single atomic pointer.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This allows leaf elements to differ in size from internal node elements.
In principle it would be more correct to use a different type for each
level of the tree, but due to implementation details related to atomic
operations, we use casts anyway, thus counteracting the value of
additional type correctness. Furthermore, such a scheme would require
function code generation (via cpp macros), as well as either unwieldy
type names for leaves or type aliases, e.g.
typedef struct rtree_elm_d2_s rtree_leaf_elm_t;
This alternate strategy would be more correct, and with less code
duplication, but probably not worth the complexity.
|
|
|
|
|
| |
binind is now redundant; the containing extent_t's szind field always
provides the same value.
|
|
|
|
|
|
|
|
| |
Rather than storing usize only for large (and prof-promoted)
allocations, store the size class index for allocations that reside
within the extent, such that the size class index is valid for all
extents that contain extant allocations, and invalid otherwise (mainly
to make debugging simpler).
|
| |
|
|
|
|
| |
Drop tcaches_mtx before calling tcache_destroy().
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Split decay-based purging into two phases, the first of which uses lazy
purging to convert dirty pages to "muzzy", and the second of which uses
forced purging, decommit, or unmapping to convert pages to clean or
destroy them altogether. Not all operating systems support lazy
purging, yet the application may provide extent hooks that implement
lazy purging, so care must be taken to dynamically omit the first phase
when necessary.
The mallctl interfaces change as follows:
- opt.decay_time --> opt.{dirty,muzzy}_decay_time
- arena.<i>.decay_time --> arena.<i>.{dirty,muzzy}_decay_time
- arenas.decay_time --> arenas.{dirty,muzzy}_decay_time
- stats.arenas.<i>.pdirty --> stats.arenas.<i>.p{dirty,muzzy}
- stats.arenas.<i>.{npurge,nmadvise,purged} -->
stats.arenas.<i>.{dirty,muzzy}_{npurge,nmadvise,purged}
This resolves #521.
|
| |
|
|
|
|
|
|
|
| |
Refactor most of the decay-related functions to take as parameters the
decay_t and associated extents_t structures to operate on. This
prepares for supporting both lazy and forced purging on different decay
schedules.
|
|
|
|
|
|
|
| |
These were all size_ts, so we have atomics support for them on all platforms, so
the conversion is straightforward.
Left non-atomic is curlextents, which AFAICT is not used atomically anywhere.
|
|
|
|
|
|
| |
I expect this to be the trickiest conversion we will see, since we want atomics
on 64-bit platforms, but are also always able to piggyback on some sort of
external synchronization on non-64 bit platforms.
|
|
|
|
|
|
| |
This has the dual advantages of allowing for sparsely used large
allocations, and relying on the kernel to supply zeroed pages, which
tends to be very fast on modern systems.
|
|
|
|
|
| |
These sanity checks prevent what otherwise might result in failed system
calls and unintended fallback execution paths.
|
|
|
|
|
| |
madvise(..., MADV_DONTNEED) only causes demand-zeroing on Linux, so fall
back to overlaying a new mapping.
|
|
|
|
|
|
|
| |
In the process, I changed the implementation of rtree_elm_acquire so that it
won't even try to CAS if its initial read (getting the extent + lock bit)
indicates that the CAS is doomed to fail. This can significantly improve
performance under contention.
|
|
|
|
| |
The decay mutex already protects all accesses.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The new feature, opt.percpu_arena, determines thread-arena association
dynamically based CPU id. Three modes are supported: "percpu", "phycpu"
and disabled.
"percpu" uses the current core id (with help from sched_getcpu())
directly as the arena index, while "phycpu" will assign threads on the
same physical CPU to the same arena. In other words, "percpu" means # of
arenas == # of CPUs, while "phycpu" has # of arenas == 1/2 * (# of
CPUs). Note that no runtime check on whether hyper threading is enabled
is added yet.
When enabled, threads will be migrated between arenas when a CPU change
is detected. In the current design, to reduce overhead from reading CPU
id, each arena tracks the thread accessed most recently. When a new
thread comes in, we will read CPU id and update arena if necessary.
|
|
|
|
|
|
|
|
| |
When witness is enabled, lock rank order needs to be preserved during
prefork, not only for each arena, but also across arenas. This change
breaks arena_prefork into further stages to ensure valid rank order
across arenas. Also changed test/unit/fork to use a manual arena to
catch this case.
|
|
|
|
|
|
| |
In the process, we can do some strength reduction, changing the fetch-adds and
fetch-subs to be simple loads followed by stores, since the modifications all
occur while holding the mutex.
|
|
|
|
|
|
| |
This fixes tcache_flush for manual tcaches, which wasn't able to find
the correct arena it associated with. Also changed the decay test to
cover this case (by using manually created arenas).
|
|
|
|
|
|
| |
This simplifies what would be pairing heap operations to the equivalent
of LIFO queue operations. This is a complementary optimization in the
context of delayed coalescing for cached extents.
|
|
|
|
|
|
|
|
|
| |
Rather than purging uncoalesced extents, perform just enough incremental
coalescing to purge only fully coalesced extents. In the absence of
cached extent reuse, the immediate versus delayed incremental purging
algorithms result in the same purge order.
This resolves #655.
|
|
|
|
| |
strategy
|
|
|
|
|
|
|
|
|
| |
This is the first header refactoring diff, #533. It splits the assert and util
components into separate, hermetic, header files. In the process, it splits out
two of the large sub-components of util (the stdio.h replacement, and bit
manipulation routines) into their own components (malloc_io.h and bit_util.h).
This is mostly to break up cyclic dependencies, but it also breaks off a good
chunk of the catch-all-ness of util, which is nice.
|
|
|
|
|
|
|
|
|
|
| |
Convert the nrequests field to be partially derived, and the curlextents
to be fully derived, in order to reduce the number of stats updates
needed during common operations.
This change affects ndalloc stats during arena reset, because it is no
longer possible to cancel out ndalloc effects (curlextents would become
negative).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This introduces a backport of C11 atomics. It has four implementations; ranked
in order of preference, they are:
- GCC/Clang __atomic builtins
- GCC/Clang __sync builtins
- MSVC _Interlocked builtins
- C11 atomics, from <stdatomic.h>
The primary advantages are:
- Close adherence to the standard API gives us a defined memory model.
- Type safety: atomic objects are now separate types from non-atomic ones, so
that it's impossible to mix up atomic and non-atomic updates (which is
undefined behavior that compilers are starting to take advantage of).
- Efficiency: we can specify ordering for operations, avoiding fences and
atomic operations on strongly ordered architectures (example:
`atomic_write_u32(ptr, val);` involves a CAS loop, whereas
`atomic_store(ptr, val, ATOMIC_RELEASE);` is a plain store.
This diff leaves in the current atomics API (implementing them in terms of the
backport). This lets us transition uses over piecemeal.
Testing:
This is by nature hard to test. I've manually tested the first three options on
Linux on gcc by futzing with the #defines manually, on freebsd with gcc and
clang, on MSVC, and on OS X with clang. All of these were x86 machines though,
and we don't have any test infrastructure set up for non-x86 platforms.
|
|
|
|
|
|
|
|
| |
This fixes a regression caused by
54269dc0ed3e4d04b2539016431de3cfe8330719 (Remove obsolete
arena_maybe_purge() call.), as well as providing a general fix.
This resolves #665.
|