jemalloc.git - jemalloc is a general purpose malloc(3) implementation that emphasizes fragmentation avoidance and scalable concurrency support.

	Commit message (Collapse)	Author	Age	Files	Lines
*	Use iallocztm instead of ialloc in prof_log functions.	Qi Wang	2019-04-02	1	-5/+8
\| \| \| \| \|	Explicitly use iallocztm for internal allocations. ialloc could trigger arena creation, which may cause lock order reversal (narenas_mtx and log_mtx).
*	Avoid check_min for opt_lg_extent_max_active_fit.	Qi Wang	2019-03-29	1	-1/+1
\| \| \| \|	This fixes a compiler warning.
*	Add the missing unlock in the error path of extent_register.	Qi Wang	2019-03-29	1	-0/+1
\|
*	Allow low values of oversize_threshold to disable the feature.	Qi Wang	2019-03-29	1	-2/+2
\| \| \| \| \|	We should allow a way to easily disable the feature (e.g. not reserving the arena id at all).
*	Output message before aborting on tcache size-matching check.	Qi Wang	2019-03-29	1	-0/+3
\|
*	Eagerly purge oversized merged extents.	Qi Wang	2019-03-15	1	-0/+7
\| \| \| \|	This change improves memory usage slightly, at virtually no CPU cost.
*	Fallback to 32-bit when 8-bit atomics are missing for TSD.	Qi Wang	2019-03-09	1	-6/+7
\| \| \| \| \|	When it happens, this might cause a slowdown on the fast path operations. However such case is very rare.
*	Stringify tls_callback linker directive	Dave Rigby	2019-02-22	1	-1/+1
\| \| \|	Proposed fix for #1444 - ensure that `tls_callback` in the `#pragma comment(linker)`directive gets the same prefix added as it does i the C declaration.
*	Guard libgcc unwind init with opt_prof.	Qi Wang	2019-02-22	1	-8/+6
\| \| \| \| \|	Only triggers libgcc unwind init when prof is enabled. This helps workaround some bootstrapping issues.
*	Make background_thread not dependent on libdl.	Qi Wang	2019-02-07	1	-1/+8
\| \| \| \|	When not using libdl, still allows background_thread to be enabled.
*	Sanity check szind on tcache flush.	Qi Wang	2019-02-01	1	-2/+40
\| \| \| \| \|	This adds some overhead to the tcache flush path (which is one of the popular paths). Guard it behind a config option.
*	Tweak the spacing for the total_wait_time per second.	Qi Wang	2019-01-28	1	-0/+1
\|
*	Rename huge_threshold to oversize_threshold.	Qi Wang	2019-01-25	4	-14/+14
\| \| \| \| \|	The keyword huge tend to remind people of huge pages which is not relevent to the feature.
*	Set huge_threshold to 8M by default.	Qi Wang	2019-01-24	1	-1/+8
\| \| \| \| \| \|	This feature uses an dedicated arena to handle huge requests, which significantly improves VM fragmentation. In production workload we tested it often reduces VM size by >30%.
*	Tweak the spacing for nrequests in stats output.	Qi Wang	2019-01-24	1	-2/+2
\|
*	Fix stats output (rate for total # of requests).	Qi Wang	2019-01-24	1	-0/+6
\| \| \| \|	The rate calculation for the total row was missing.
*	Un-experimental the huge_threshold feature.	Qi Wang	2019-01-16	3	-4/+3
\|
*	Avoid creating bg thds for huge arena lone.	Qi Wang	2019-01-16	4	-15/+42
\| \| \| \| \| \|	For low arena count settings, the huge threshold feature may trigger an unwanted bg thd creation. Given that the huge arena does eager purging by default, bypass bg thd creation when initializing the huge arena.
*	Avoid potential issues on extent zero-out.	Qi Wang	2019-01-12	1	-4/+21
\| \| \| \| \| \|	When custom extent_hooks or transparent huge pages are in use, the purging semantics may change, which means we may not get zeroed pages on repopulating. Fixing the issue by manually memset for such cases.
*	Force purge on thread death only when w/o bg thds.	Qi Wang	2019-01-12	1	-2/+2
\|
*	Only read arena index from extent on the tcache flush path.	Qi Wang	2018-12-18	1	-5/+9
\| \| \| \| \|	Add exten_arena_ind_get() to avoid loading the actual arena ptr in case we just need to check arena matching.
*	Add rate counters to stats	Alexander Zinoviev	2018-12-18	1	-224/+230
\|
*	Fix incorrect stats mreging with sharded bins.	Qi Wang	2018-12-08	1	-2/+1
\| \| \| \| \|	With sharded bins, we may not flush all items from the same arena in one run. Adjust the stats merging logic accordingly.
*	Store the bin shard selection in TSD.	Qi Wang	2018-12-04	2	-5/+9
\| \| \| \| \|	This avoids having to choose bin shard on the fly, also will allow flexible bin binding for each thread.
*	Add stats for arenas.bin.i.nshards.	Qi Wang	2018-12-04	2	-2/+12
\|
*	Add opt.bin_shards to specify number of bin shards.	Qi Wang	2018-12-04	3	-22/+76
\| \| \| \| \|	The option uses the same format as "slab_sizes" to specify number of shards for each bin size.
*	Add support for sharded bins within an arena.	Qi Wang	2018-12-04	5	-66/+130
\| \| \| \| \| \| \| \| \|	This makes it possible to have multiple set of bins in an arena, which improves arena scalability because the bins (especially the small ones) are always the limiting factor in production workload. A bin shard is picked on allocation; each extent tracks the bin shard id for deallocation. The shard size will be determined using runtime options.
*	mutex: fix trylock spin wait contention	Dave Watson	2018-11-28	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If there are 3 or more threads spin-waiting on the same mutex, there will be excessive exclusive cacheline contention because pthread_trylock() immediately tries to CAS in a new value, instead of first checking if the lock is locked. This diff adds a 'locked' hint flag, and we will only spin wait without trylock()ing while set. I don't know of any other portable way to get the same behavior as pthread_mutex_lock(). This is pretty easy to test via ttest, e.g. ./ttest1 500 3 10000 1 100 Throughput is nearly 3x as fast. This blames to the mutex profiling changes, however, we almost never have 3 or more threads contending in properly configured production workloads, but still worth fixing.
*	Set the default number of background threads to 4.	Qi Wang	2018-11-16	1	-4/+3
\| \| \| \| \|	The setting has been tested in production for a while. No negative effect while we were able to reduce number of threads per process.
*	Deprecate OSSpinLock.	Qi Wang	2018-11-14	1	-3/+1
\|
*	Add a fastpath for arena_slab_reg_alloc_batch	Dave Watson	2018-11-14	1	-9/+36
\| \| \| \| \|	Also adds a configure.ac check for __builtin_popcount, which is used in the new fastpath.
*	add extent_nfree_sub	Dave Watson	2018-11-14	1	-1/+1
\|
*	arena: Refactor tcache_fill to batch fill from slab	Dave Watson	2018-11-14	1	-14/+46
\| \| \| \| \| \| \| \|	Refactor tcache_fill, introducing a new function arena_slab_reg_alloc_batch, which will fill multiple pointers from a slab. There should be no functional changes here, but allows future optimization on reg_alloc_batch.
*	Avoid touching all pages in extent_recycle for debug build.	Qi Wang	2018-11-13	1	-2/+3
\| \| \| \| \|	We may have a large number of pages with *zero set (since they are populated on demand). Only check the first page to avoid paging in all of them.
*	Fix tcache_flush (follow up cd2931a).	Qi Wang	2018-11-13	1	-5/+14
\| \| \| \|	Also catch invalid tcache id.
*	Add a free() and sdallocx(where flags=0) fastpath	Dave Watson	2018-11-12	1	-11/+86
\| \| \| \| \| \| \| \| \| \| \| \| \|	Add unsized and sized deallocation fastpaths. Similar to the malloc() fastpath, this removes all frame manipulation for the majority of free() calls. The performance advantages here are less than that of the malloc() fastpath, but from prod tests seems to still be half a percent or so of improvement. Stats and sampling a both supported (sdallocx needs a sampling check, for rtree lookups slab will only be set for unsampled objects). We don't support flush, any flush requests go to the slowpath.
*	Restore a FreeBSD-specific getpagesize(3) optimization.	Edward Tomasz Napierala	2018-11-09	1	-0/+6
\| \| \| \| \|	It was removed in 0771ff2cea6dc18fcd3f6bf452b4224a4e17ae38. Add a comment explaining its purpose.
*	Fix tcaches_flush.	Qi Wang	2018-11-09	1	-1/+1
\| \| \| \|	The regression was introduced in 3a1363b.
*	Properly trigger decay on tcache destory.	Qi Wang	2018-11-09	1	-0/+18
\| \| \| \| \|	When destroying tcache, decay may not be triggered since tsd is non-nominal. Explicitly decay to avoid pathological cases.
*	Optimize large deallocation.	Qi Wang	2018-11-08	1	-17/+41
\| \| \| \| \| \| \| \| \| \| \|	We eagerly coalesce large buffers when deallocating, however the previous logic around this introduced extra lock overhead -- when coalescing we always lock the neighbors even if they are active, while for active extents nothing can be done. This commit checks if the neighbor extents are potentially active before locking, and avoids locking if possible. This speeds up large_dalloc by ~20%. It also fixes some undesired behavior: we could stop coalescing because a small buffer was merged, while a large neighbor was ignored on the other side.
*	Bypass extent_dalloc when retain is enabled.	Qi Wang	2018-11-08	1	-8/+18
\| \| \| \| \| \| \|	When retain is enabled, the default dalloc hook does nothing (since we avoid munmap). But the overhead preparing the call is high, specifically the extent de-register and re-register involve locking and extent / rtree modifications. Bypass the call with retain in this diff.
*	Set commit properly for FreeBSD w/ overcommit.	Qi Wang	2018-11-05	1	-0/+4
\| \| \| \| \|	When overcommit is enabled, commit needs to be set when doing mmap(). The regression was introduced in f80c97e.
*	Make use of pthread_set_name_np(3) on FreeBSD.	Edward Tomasz Napierala	2018-10-24	1	-0/+2
\|
*	malloc: Add a fastpath	Dave Watson	2018-10-18	1	-8/+89
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This diff adds a fastpath that assumes size <= SC_LOOKUP_MAXCLASS, and that we hit tcache. If either of these is false, we fall back to the previous codepath (renamed 'malloc_default'). Crucially, we only tail call malloc_default, and with the same kind and number of arguments, so that both clang and gcc tail-calling will kick in - therefore malloc() gets treated as a leaf function, and there are no caller-saved registers. Previously malloc() contained 5 caller saved registers on x64, resulting in at least 10 extra memory-movement instructions. In microbenchmarks this results in up to ~10% improvement in malloc() fastpath. In real programs, this is a ~1% CPU and latency improvement overall.
*	drop bump_empty_alloc option. Size class lookup support used instead.	Dave Watson	2018-10-17	1	-16/+1
\|
*	sz: Support 0 size in size2index lookup/compute	Dave Watson	2018-10-17	1	-3/+4
\|
*	Make `smallocx` symbol name depend on the `JEMALLOC_VERSION_GID`	gnzlbg	2018-10-17	1	-5/+10
\| \| \| \| \| \|	This comments concatenates the `JEMALLOC_VERSION_GID` to the `smallocx` symbol name, such that the symbol ends up exported as `smallocx_{git_hash}`.
*	Hide smallocx even when enabled from the library API	gnzlbg	2018-10-17	1	-0/+5
\| \| \| \| \| \| \| \| \|	The experimental `smallocx` API is not exposed via header files, requiring the users to peek at `jemalloc`'s source code to manually add the external declarations to their own programs. This should reinforce that `smallocx` is experimental, and that `jemalloc` does not offer any kind of backwards compatiblity or ABI gurantees for it.
*	Add experimental API: smallocx_return_t smallocx(size, flags)	gnzlbg	2018-10-17	1	-1/+76
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	--- Motivation: This new experimental memory-allocaction API returns a pointer to the allocation as well as the usable size of the allocated memory region. The `s` in `smallocx` stands for `sized`-`mallocx`, attempting to convey that this API returns the size of the allocated memory region. It should allow C++ P0901r0 [0] and Rust Alloc::alloc_excess to make use of it. The main purpose of these APIs is to improve telemetry. It is more accurate to register `smallocx(size, flags)` than `smallocx(nallocx(size), flags)`, for example. The latter will always line up perfectly with the existing size classes, causing a loss of telemetry information about the internal fragmentation induced by potentially poor size-classes choices. Instrumenting `nallocx` does not help much since user code can cache its result and use it repeatedly. --- Implementation: The implementation adds a new `usize` option to `static_opts_s` and an `usize` variable to `dynamic_opts_s`. These are then used to cache the result of `sz_index2size` and similar functions in the code paths in which they are unconditionally invoked. In the code-paths in which these functions are not unconditionally invoked, `smallocx` calls, as opposed to `mallocx`, these functions explicitly. --- [0]: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0901r0.html
*	remove malloc_init() off the fastpath	Dave Watson	2018-10-15	2	-8/+23
\|