| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Split decay-based purging into two phases, the first of which uses lazy
purging to convert dirty pages to "muzzy", and the second of which uses
forced purging, decommit, or unmapping to convert pages to clean or
destroy them altogether. Not all operating systems support lazy
purging, yet the application may provide extent hooks that implement
lazy purging, so care must be taken to dynamically omit the first phase
when necessary.
The mallctl interfaces change as follows:
- opt.decay_time --> opt.{dirty,muzzy}_decay_time
- arena.<i>.decay_time --> arena.<i>.{dirty,muzzy}_decay_time
- arenas.decay_time --> arenas.{dirty,muzzy}_decay_time
- stats.arenas.<i>.pdirty --> stats.arenas.<i>.p{dirty,muzzy}
- stats.arenas.<i>.{npurge,nmadvise,purged} -->
stats.arenas.<i>.{dirty,muzzy}_{npurge,nmadvise,purged}
This resolves #521.
|
| |
|
|
|
|
|
|
|
| |
Refactor most of the decay-related functions to take as parameters the
decay_t and associated extents_t structures to operate on. This
prepares for supporting both lazy and forced purging on different decay
schedules.
|
|
|
|
|
|
|
| |
These were all size_ts, so we have atomics support for them on all platforms, so
the conversion is straightforward.
Left non-atomic is curlextents, which AFAICT is not used atomically anywhere.
|
|
|
|
|
|
| |
I expect this to be the trickiest conversion we will see, since we want atomics
on 64-bit platforms, but are also always able to piggyback on some sort of
external synchronization on non-64 bit platforms.
|
|
|
|
|
|
| |
This has the dual advantages of allowing for sparsely used large
allocations, and relying on the kernel to supply zeroed pages, which
tends to be very fast on modern systems.
|
|
|
|
|
| |
These sanity checks prevent what otherwise might result in failed system
calls and unintended fallback execution paths.
|
|
|
|
|
| |
madvise(..., MADV_DONTNEED) only causes demand-zeroing on Linux, so fall
back to overlaying a new mapping.
|
|
|
|
|
|
|
| |
In the process, I changed the implementation of rtree_elm_acquire so that it
won't even try to CAS if its initial read (getting the extent + lock bit)
indicates that the CAS is doomed to fail. This can significantly improve
performance under contention.
|
|
|
|
| |
The decay mutex already protects all accesses.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The new feature, opt.percpu_arena, determines thread-arena association
dynamically based CPU id. Three modes are supported: "percpu", "phycpu"
and disabled.
"percpu" uses the current core id (with help from sched_getcpu())
directly as the arena index, while "phycpu" will assign threads on the
same physical CPU to the same arena. In other words, "percpu" means # of
arenas == # of CPUs, while "phycpu" has # of arenas == 1/2 * (# of
CPUs). Note that no runtime check on whether hyper threading is enabled
is added yet.
When enabled, threads will be migrated between arenas when a CPU change
is detected. In the current design, to reduce overhead from reading CPU
id, each arena tracks the thread accessed most recently. When a new
thread comes in, we will read CPU id and update arena if necessary.
|
|
|
|
|
|
|
|
| |
When witness is enabled, lock rank order needs to be preserved during
prefork, not only for each arena, but also across arenas. This change
breaks arena_prefork into further stages to ensure valid rank order
across arenas. Also changed test/unit/fork to use a manual arena to
catch this case.
|
|
|
|
|
|
| |
In the process, we can do some strength reduction, changing the fetch-adds and
fetch-subs to be simple loads followed by stores, since the modifications all
occur while holding the mutex.
|
|
|
|
|
|
| |
This fixes tcache_flush for manual tcaches, which wasn't able to find
the correct arena it associated with. Also changed the decay test to
cover this case (by using manually created arenas).
|
|
|
|
|
|
| |
This simplifies what would be pairing heap operations to the equivalent
of LIFO queue operations. This is a complementary optimization in the
context of delayed coalescing for cached extents.
|
|
|
|
|
|
|
|
|
| |
Rather than purging uncoalesced extents, perform just enough incremental
coalescing to purge only fully coalesced extents. In the absence of
cached extent reuse, the immediate versus delayed incremental purging
algorithms result in the same purge order.
This resolves #655.
|
|
|
|
| |
strategy
|
|
|
|
|
|
|
|
|
| |
This is the first header refactoring diff, #533. It splits the assert and util
components into separate, hermetic, header files. In the process, it splits out
two of the large sub-components of util (the stdio.h replacement, and bit
manipulation routines) into their own components (malloc_io.h and bit_util.h).
This is mostly to break up cyclic dependencies, but it also breaks off a good
chunk of the catch-all-ness of util, which is nice.
|
|
|
|
|
|
|
|
|
|
| |
Convert the nrequests field to be partially derived, and the curlextents
to be fully derived, in order to reduce the number of stats updates
needed during common operations.
This change affects ndalloc stats during arena reset, because it is no
longer possible to cancel out ndalloc effects (curlextents would become
negative).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This introduces a backport of C11 atomics. It has four implementations; ranked
in order of preference, they are:
- GCC/Clang __atomic builtins
- GCC/Clang __sync builtins
- MSVC _Interlocked builtins
- C11 atomics, from <stdatomic.h>
The primary advantages are:
- Close adherence to the standard API gives us a defined memory model.
- Type safety: atomic objects are now separate types from non-atomic ones, so
that it's impossible to mix up atomic and non-atomic updates (which is
undefined behavior that compilers are starting to take advantage of).
- Efficiency: we can specify ordering for operations, avoiding fences and
atomic operations on strongly ordered architectures (example:
`atomic_write_u32(ptr, val);` involves a CAS loop, whereas
`atomic_store(ptr, val, ATOMIC_RELEASE);` is a plain store.
This diff leaves in the current atomics API (implementing them in terms of the
backport). This lets us transition uses over piecemeal.
Testing:
This is by nature hard to test. I've manually tested the first three options on
Linux on gcc by futzing with the #defines manually, on freebsd with gcc and
clang, on MSVC, and on OS X with clang. All of these were x86 machines though,
and we don't have any test infrastructure set up for non-x86 platforms.
|
|
|
|
|
|
|
|
| |
This fixes a regression caused by
54269dc0ed3e4d04b2539016431de3cfe8330719 (Remove obsolete
arena_maybe_purge() call.), as well as providing a general fix.
This resolves #665.
|
| |
|
| |
|
|
|
|
|
| |
This avoids signed/unsigned comparison warnings when specifying integer
constants as inputs.
|
|
|
|
|
|
| |
This fixes a regression introduced by
d433471f581ca50583c7a99f9802f7388f81aa36 (Derive
{allocated,nmalloc,ndalloc,nrequests}_large stats.).
|
|
|
|
|
|
|
| |
Remove obsolete unit test scaffolding for extent quantization. Remove
redundant assertions. Add an assertion to
extents_first_best_fit_locked() that should help prevent aligned
allocation regressions.
|
|
|
|
|
| |
This complements 94c5d22a4da7844d0bdc5b370e47b1ba14268af2 (Remove mb.h,
which is unused).
|
|
|
|
|
| |
Remove a call to arena_maybe_purge() that was necessary for ratio-based
purging, but is obsolete in the context of decay-based purging.
|
|
|
|
|
|
|
|
| |
Extent splitting and coalescing is a major component of large allocation
overhead, and disabling coalescing of cached extents provides a simple
and effective hysteresis mechanism. Once two-phase purging is
implemented, it will probably make sense to leave coalescing disabled
for the first phase, but coalesce during the second phase.
|
|
|
|
|
| |
Refactor extent_can_coalesce(), extent_coalesce(), and extent_record()
to avoid needlessly repeating extent [de]activation operations.
|
|
|
|
|
| |
Mapped memory increases when extent_alloc_wrapper() succeeds, and
decreases when extent_dalloc_wrapper() is called (during purging).
|
|
|
|
| |
This removes the last use of arena->lock.
|
|
|
|
| |
This mildly reduces stats update overhead during normal operation.
|
|
|
|
| |
This replaces arena->lock synchronization.
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
|
|
| |
This avoids a gcc diagnostic note:
note: The ABI for passing parameters with 64-byte alignment has
changed in GCC 4.6
This note related to the cacheline alignment of rtree_ctx_t, which was
introduced by 4a346f55939af4f200121cc4454089592d952f18 (Replace rtree
path cache with LRU cache.).
|
|
|
|
|
|
|
| |
Fix extent_alloc_dss() to account for bytes that are not a multiple of
the page size. This regression was introduced by
577d4572b0821a15e5370f9bf566d884b7cf707c (Make dss operations
lockless.), which was first released in 4.3.0.
|
| |
|
|
|
|
|
| |
NULL can never actually be inserted in practice, and removing support
allows a branch to be removed from the fast path.
|
|
|
|
|
|
|
| |
Rather than dynamically building a table to aid per level computations,
define a constant table at compile time. Omit both high and low
insignificant bits. Use one to three tree levels, depending on the
number of significant bits.
|
|
|
|
| |
A subsequent change instead ignores insignificant high bits.
|
| |
|
|
|
|
|
| |
Anything but a hit in the first element of the lookup cache is
expensive enough to negate the benefits of inlining.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Read adjacent rtree elements while holding element locks, since the
extents mutex only protects against relevant like-state extent mutation.
Fix management of the 'coalesced' loop state variable to merge
forward/backward results, rather than overwriting the result of forward
coalescing if attempting to coalesce backward. In practice this caused
no correctness issues, but could cause extra iterations in rare cases.
These regressions were introduced by
d27f29b468ae3e9d2b1da4a9880351d76e5a1662 (Disentangle arena and extent
locking.).
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Set extent as active prior to registration so that other threads can't
modify it in the absence of locking.
This regression was introduced by
d27f29b468ae3e9d2b1da4a9880351d76e5a1662 (Disentangle arena and extent
locking.), via non-obvious means. Removal of extents_mtx protection
during extent_grow_retained() execution opened up the race, but in the
presence of that locking, the code was safe.
This resolves #599.
|
|
|
|
| |
Do not check for overflow unless it is actually a possibility.
|
|
|
|
|
|
|
| |
Fix compute_size_with_overflow() to use a high_bits mask that has the
high bits set, rather than the low bits. This regression was introduced
by 5154ff32ee8c37bacb6afd8a07b923eb33228357 (Unify the allocation
paths).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Refactor arena and extent locking protocols such that arena and
extent locks are never held when calling into the extent_*_wrapper()
API. This requires extra care during purging since the arena lock no
longer protects the inner purging logic. It also requires extra care to
protect extents from being merged with adjacent extents.
Convert extent_t's 'active' flag to an enumerated 'state', so that
retained extents are explicitly marked as such, rather than depending on
ring linkage state.
Refactor the extent collections (and their synchronization) for cached
and retained extents into extents_t. Incorporate LRU functionality to
support purging. Incorporate page count accounting, which replaces
arena->ndirty and arena->stats.retained.
Assert that no core locks are held when entering any internal
[de]allocation functions. This is in addition to existing assertions
that no locks are held when entering external [de]allocation functions.
Audit and document synchronization protocols for all arena_t fields.
This fixes a potential deadlock due to recursive allocation during
gdump, in a similar fashion to b49c649bc18fff4bd10a1c8adbaf1f25f6453cb6
(Fix lock order reversal during gdump.), but with a necessarily much
broader code impact.
|