jemalloc.git - jemalloc is a general purpose malloc(3) implementation that emphasizes fragmentation avoidance and scalable concurrency support.

	Commit message (Collapse)	Author	Age	Files	Lines
*	Implement two-phase decay-based purging.	Jason Evans	2017-03-15	6	-244/+523
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Split decay-based purging into two phases, the first of which uses lazy purging to convert dirty pages to "muzzy", and the second of which uses forced purging, decommit, or unmapping to convert pages to clean or destroy them altogether. Not all operating systems support lazy purging, yet the application may provide extent hooks that implement lazy purging, so care must be taken to dynamically omit the first phase when necessary. The mallctl interfaces change as follows: - opt.decay_time --> opt.{dirty,muzzy}_decay_time - arena.<i>.decay_time --> arena.<i>.{dirty,muzzy}_decay_time - arenas.decay_time --> arenas.{dirty,muzzy}_decay_time - stats.arenas.<i>.pdirty --> stats.arenas.<i>.p{dirty,muzzy} - stats.arenas.<i>.{npurge,nmadvise,purged} --> stats.arenas.<i>.{dirty,muzzy}_{npurge,nmadvise,purged} This resolves #521.
*	Move arena_t's purging field into arena_decay_t.	Jason Evans	2017-03-15	1	-5/+4
\|
*	Refactor decay-related function parametrization.	Jason Evans	2017-03-15	1	-86/+96
\| \| \| \| \| \| \|	Refactor most of the decay-related functions to take as parameters the decay_t and associated extents_t structures to operate on. This prepares for supporting both lazy and forced purging on different decay schedules.
*	Convert remaining arena_stats_t fields to atomics	David Goldblatt	2017-03-14	2	-47/+83
\| \| \| \| \| \| \|	These were all size_ts, so we have atomics support for them on all platforms, so the conversion is straightforward. Left non-atomic is curlextents, which AFAICT is not used atomically anywhere.
*	Switch atomic uint64_ts in arena_stats_t to C11 atomics	David Goldblatt	2017-03-14	2	-41/+99
\| \| \| \| \| \|	I expect this to be the trickiest conversion we will see, since we want atomics on 64-bit platforms, but are also always able to piggyback on some sort of external synchronization on non-64 bit platforms.
*	Prefer pages_purge_forced() over memset().	Jason Evans	2017-03-14	2	-16/+30
\| \| \| \| \| \|	This has the dual advantages of allowing for sparsely used large allocations, and relying on the kernel to supply zeroed pages, which tends to be very fast on modern systems.
*	Add alignment/size assertions to pages_*().	Jason Evans	2017-03-14	1	-0/+15
\| \| \| \| \|	These sanity checks prevent what otherwise might result in failed system calls and unintended fallback execution paths.
*	Fix pages_purge_forced() to discard pages on non-Linux systems.	Jason Evans	2017-03-14	1	-1/+8
\| \| \| \| \|	madvise(..., MADV_DONTNEED) only causes demand-zeroing on Linux, so fall back to overlaying a new mapping.
*	Convert rtree code to use C11 atomics	David Goldblatt	2017-03-13	1	-16/+34
\| \| \| \| \| \| \|	In the process, I changed the implementation of rtree_elm_acquire so that it won't even try to CAS if its initial read (getting the extent + lock bit) indicates that the CAS is doomed to fail. This can significantly improve performance under contention.
*	Convert arena_t's purging field to non-atomic bool.	Jason Evans	2017-03-10	1	-4/+5
\| \| \| \|	The decay mutex already protects all accesses.
*	Implement per-CPU arena.	Qi Wang	2017-03-09	5	-29/+150
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The new feature, opt.percpu_arena, determines thread-arena association dynamically based CPU id. Three modes are supported: "percpu", "phycpu" and disabled. "percpu" uses the current core id (with help from sched_getcpu()) directly as the arena index, while "phycpu" will assign threads on the same physical CPU to the same arena. In other words, "percpu" means # of arenas == # of CPUs, while "phycpu" has # of arenas == 1/2 * (# of CPUs). Note that no runtime check on whether hyper threading is enabled is added yet. When enabled, threads will be migrated between arenas when a CPU change is detected. In the current design, to reduce overhead from reading CPU id, each arena tracks the thread accessed most recently. When a new thread comes in, we will read CPU id and update arena if necessary.
*	Fix arena_prefork lock rank order for witness.	Qi Wang	2017-03-09	2	-12/+30
\| \| \| \| \| \| \| \|	When witness is enabled, lock rank order needs to be preserved during prefork, not only for each arena, but also across arenas. This change breaks arena_prefork into further stages to ensure valid rank order across arenas. Also changed test/unit/fork to use a manual arena to catch this case.
*	Convert extents_t's npages field to use C11-style atomics	David Goldblatt	2017-03-09	1	-6/+23
\| \| \| \| \| \|	In the process, we can do some strength reduction, changing the fetch-adds and fetch-subs to be simple loads followed by stores, since the modifications all occur while holding the mutex.
*	Store associated arena in tcache.	Qi Wang	2017-03-07	3	-9/+11
\| \| \| \| \| \|	This fixes tcache_flush for manual tcaches, which wasn't able to find the correct arena it associated with. Also changed the decay test to cover this case (by using manually created arenas).
*	Use any-best-fit for cached extent allocation.	Jason Evans	2017-03-07	1	-5/+8
\| \| \| \| \| \|	This simplifies what would be pairing heap operations to the equivalent of LIFO queue operations. This is a complementary optimization in the context of delayed coalescing for cached extents.
*	Perform delayed coalescing prior to purging.	Jason Evans	2017-03-07	2	-46/+133
\| \| \| \| \| \| \| \| \|	Rather than purging uncoalesced extents, perform just enough incremental coalescing to purge only fully coalesced extents. In the absence of cached extent reuse, the immediate versus delayed incremental purging algorithms result in the same purge order. This resolves #655.
*	Change arena to use the atomic functions for ssize_t instead of the union ↵	David Goldblatt	2017-03-07	1	-6/+2
\| \| \| \|	strategy
*	Disentangle assert and util	David Goldblatt	2017-03-06	1	-11/+31
\| \| \| \| \| \| \| \| \|	This is the first header refactoring diff, #533. It splits the assert and util components into separate, hermetic, header files. In the process, it splits out two of the large sub-components of util (the stdio.h replacement, and bit manipulation routines) into their own components (malloc_io.h and bit_util.h). This is mostly to break up cyclic dependencies, but it also breaks off a good chunk of the catch-all-ness of util, which is nice.
*	Optimize malloc_large_stats_t maintenance.	Jason Evans	2017-03-04	1	-29/+6
\| \| \| \| \| \| \| \| \| \|	Convert the nrequests field to be partially derived, and the curlextents to be fully derived, in order to reduce the number of stats updates needed during common operations. This change affects ndalloc stats during arena reset, because it is no longer possible to cancel out ndalloc effects (curlextents would become negative).
*	Introduce a backport of C11 atomics	David Goldblatt	2017-03-03	1	-2/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This introduces a backport of C11 atomics. It has four implementations; ranked in order of preference, they are: - GCC/Clang __atomic builtins - GCC/Clang __sync builtins - MSVC _Interlocked builtins - C11 atomics, from <stdatomic.h> The primary advantages are: - Close adherence to the standard API gives us a defined memory model. - Type safety: atomic objects are now separate types from non-atomic ones, so that it's impossible to mix up atomic and non-atomic updates (which is undefined behavior that compilers are starting to take advantage of). - Efficiency: we can specify ordering for operations, avoiding fences and atomic operations on strongly ordered architectures (example: `atomic_write_u32(ptr, val);` involves a CAS loop, whereas `atomic_store(ptr, val, ATOMIC_RELEASE);` is a plain store. This diff leaves in the current atomics API (implementing them in terms of the backport). This lets us transition uses over piecemeal. Testing: This is by nature hard to test. I've manually tested the first three options on Linux on gcc by futzing with the #defines manually, on freebsd with gcc and clang, on MSVC, and on OS X with clang. All of these were x86 machines though, and we don't have any test infrastructure set up for non-x86 platforms.
*	Immediately purge cached extents if decay_time is 0.	Jason Evans	2017-03-03	2	-38/+34
\| \| \| \| \| \| \| \|	This fixes a regression caused by 54269dc0ed3e4d04b2539016431de3cfe8330719 (Remove obsolete arena_maybe_purge() call.), as well as providing a general fix. This resolves #665.
*	Convert arena_decay_t's time to be atomically synchronized.	Jason Evans	2017-03-03	2	-14/+23
\|
*	Small style fix in ctl.c	Qi Wang	2017-03-01	1	-2/+1
\|
*	Add casts to CONF_HANDLE_T_U().	Jason Evans	2017-03-01	1	-4/+4
\| \| \| \| \|	This avoids signed/unsigned comparison warnings when specifying integer constants as inputs.
*	Fix {allocated,nmalloc,ndalloc,nrequests}_large stats regression.	Jason Evans	2017-02-27	1	-14/+2
\| \| \| \| \| \|	This fixes a regression introduced by d433471f581ca50583c7a99f9802f7388f81aa36 (Derive {allocated,nmalloc,ndalloc,nrequests}_large stats.).
*	Tidy up extent quantization.	Jason Evans	2017-02-27	1	-21/+5
\| \| \| \| \| \| \|	Remove obsolete unit test scaffolding for extent quantization. Remove redundant assertions. Add an assertion to extents_first_best_fit_locked() that should help prevent aligned allocation regressions.
*	Remove remainder of mb (memory barrier).	Jason Evans	2017-02-22	1	-2/+0
\| \| \| \| \|	This complements 94c5d22a4da7844d0bdc5b370e47b1ba14268af2 (Remove mb.h, which is unused).
*	Remove obsolete arena_maybe_purge() call.	Jason Evans	2017-02-21	1	-4/+0
\| \| \| \| \|	Remove a call to arena_maybe_purge() that was necessary for ratio-based purging, but is obsolete in the context of decay-based purging.
*	Disable coalescing of cached extents.	Jason Evans	2017-02-17	2	-23/+38
\| \| \| \| \| \| \| \|	Extent splitting and coalescing is a major component of large allocation overhead, and disabling coalescing of cached extents provides a simple and effective hysteresis mechanism. Once two-phase purging is implemented, it will probably make sense to leave coalescing disabled for the first phase, but coalesce during the second phase.
*	Optimize extent coalescing.	Jason Evans	2017-02-17	1	-20/+23
\| \| \| \| \|	Refactor extent_can_coalesce(), extent_coalesce(), and extent_record() to avoid needlessly repeating extent [de]activation operations.
*	Fix arena->stats.mapped accounting.	Jason Evans	2017-02-16	2	-26/+58
\| \| \| \| \|	Mapped memory increases when extent_alloc_wrapper() succeeds, and decreases when extent_dalloc_wrapper() is called (during purging).
*	Synchronize arena->decay with arena->decay.mtx.	Jason Evans	2017-02-16	1	-25/+31
\| \| \| \|	This removes the last use of arena->lock.
*	Derive {allocated,nmalloc,ndalloc,nrequests}_large stats.	Jason Evans	2017-02-16	1	-22/+23
\| \| \| \|	This mildly reduces stats update overhead during normal operation.
*	Synchronize arena->tcache_ql with arena->tcache_ql_mtx.	Jason Evans	2017-02-16	3	-14/+22
\| \| \| \|	This replaces arena->lock synchronization.
*	Convert arena->stats synchronization to atomics.	Jason Evans	2017-02-16	3	-218/+307
\|
*	Convert arena->prof_accumbytes synchronization to atomics.	Jason Evans	2017-02-16	3	-15/+19
\|
*	Convert arena->dss_prec synchronization to atomics.	Jason Evans	2017-02-16	2	-14/+7
\|
*	Do not generate unused tsd_*_[gs]et() functions.	Jason Evans	2017-02-13	1	-1/+1
\| \| \| \| \| \| \| \| \|	This avoids a gcc diagnostic note: note: The ABI for passing parameters with 64-byte alignment has changed in GCC 4.6 This note related to the cacheline alignment of rtree_ctx_t, which was introduced by 4a346f55939af4f200121cc4454089592d952f18 (Replace rtree path cache with LRU cache.).
*	Fix extent_alloc_dss() regression.	Jason Evans	2017-02-10	1	-19/+29
\| \| \| \| \| \| \|	Fix extent_alloc_dss() to account for bytes that are not a multiple of the page size. This regression was introduced by 577d4572b0821a15e5370f9bf566d884b7cf707c (Make dss operations lockless.), which was first released in 4.3.0.
*	Replace spin_init() with SPIN_INITIALIZER.	Jason Evans	2017-02-09	2	-5/+2
\|
*	Remove rtree support for 0 (NULL) keys.	Jason Evans	2017-02-09	1	-10/+8
\| \| \| \| \|	NULL can never actually be inserted in practice, and removing support allows a branch to be removed from the fast path.
*	Determine rtree levels at compile time.	Jason Evans	2017-02-09	2	-111/+28
\| \| \| \| \| \| \|	Rather than dynamically building a table to aid per level computations, define a constant table at compile time. Omit both high and low insignificant bits. Use one to three tree levels, depending on the number of significant bits.
*	Remove rtree leading 0 bit optimization.	Jason Evans	2017-02-09	1	-53/+12
\| \| \| \|	A subsequent change instead ignores insignificant high bits.
*	Make non-essential inline rtree functions static functions.	Jason Evans	2017-02-09	1	-8/+69
\|
*	Split rtree_elm_lookup_hard() out of rtree_elm_lookup().	Jason Evans	2017-02-09	1	-0/+105
\| \| \| \| \|	Anything but a hit in the first element of the lookup cache is expensive enough to negate the benefits of inlining.
*	Fix extent_record().	Jason Evans	2017-02-07	1	-18/+33
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Read adjacent rtree elements while holding element locks, since the extents mutex only protects against relevant like-state extent mutation. Fix management of the 'coalesced' loop state variable to merge forward/backward results, rather than overwriting the result of forward coalescing if attempting to coalesce backward. In practice this caused no correctness issues, but could cause extra iterations in rare cases. These regressions were introduced by d27f29b468ae3e9d2b1da4a9880351d76e5a1662 (Disentangle arena and extent locking.).
*	Fix a race in extent_grow_retained().	Jason Evans	2017-02-04	1	-9/+14
\| \| \| \| \| \| \| \| \| \| \| \| \|	Set extent as active prior to registration so that other threads can't modify it in the absence of locking. This regression was introduced by d27f29b468ae3e9d2b1da4a9880351d76e5a1662 (Disentangle arena and extent locking.), via non-obvious means. Removal of extents_mtx protection during extent_grow_retained() execution opened up the race, but in the presence of that locking, the code was safe. This resolves #599.
*	Optimize compute_size_with_overflow().	Jason Evans	2017-02-04	1	-5/+16
\| \| \| \|	Do not check for overflow unless it is actually a possibility.
*	Fix compute_size_with_overflow().	Jason Evans	2017-02-04	1	-1/+1
\| \| \| \| \| \| \|	Fix compute_size_with_overflow() to use a high_bits mask that has the high bits set, rather than the low bits. This regression was introduced by 5154ff32ee8c37bacb6afd8a07b923eb33228357 (Unify the allocation paths).
*	Disentangle arena and extent locking.	Jason Evans	2017-02-02	6	-531/+551
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Refactor arena and extent locking protocols such that arena and extent locks are never held when calling into the extent_*_wrapper() API. This requires extra care during purging since the arena lock no longer protects the inner purging logic. It also requires extra care to protect extents from being merged with adjacent extents. Convert extent_t's 'active' flag to an enumerated 'state', so that retained extents are explicitly marked as such, rather than depending on ring linkage state. Refactor the extent collections (and their synchronization) for cached and retained extents into extents_t. Incorporate LRU functionality to support purging. Incorporate page count accounting, which replaces arena->ndirty and arena->stats.retained. Assert that no core locks are held when entering any internal [de]allocation functions. This is in addition to existing assertions that no locks are held when entering external [de]allocation functions. Audit and document synchronization protocols for all arena_t fields. This fixes a potential deadlock due to recursive allocation during gdump, in a similar fashion to b49c649bc18fff4bd10a1c8adbaf1f25f6453cb6 (Fix lock order reversal during gdump.), but with a necessarily much broader code impact.