jemalloc.git - jemalloc is a general purpose malloc(3) implementation that emphasizes fragmentation avoidance and scalable concurrency support.

	Commit message (Collapse)	Author	Age	Files	Lines
*	Eagerly purge oversized merged extents.	Qi Wang	2019-03-15	1	-0/+20
\| \| \| \|	This change improves memory usage slightly, at virtually no CPU cost.
*	Remove some unused comments.	Qi Wang	2019-03-15	1	-3/+0
\|
*	Fallback to 32-bit when 8-bit atomics are missing for TSD.	Qi Wang	2019-03-09	1	-2/+17
\| \| \| \| \|	When it happens, this might cause a slowdown on the fast path operations. However such case is very rare.
*	Detect if 8-bit atomics are available.	Qi Wang	2019-03-09	2	-0/+14
\| \| \| \| \|	In some rare cases (older compiler, e.g. gcc 4.2 w/ MIPS), 8-bit atomics might be unavailable. Detect such cases so that we can workaround.
*	Do not use #pragma GCC diagnostic with gcc < 4.6.	Jason Evans	2019-03-09	1	-10/+12
\| \| \| \| \|	This regression was introduced by 3d29d11ac2c1583b9959f73c0548545018d31c8a (Clean compilation -Wextra).
*	Remove JE_FORCE_SYNC_COMPARE_AND_SWAP_[48].	Jason Evans	2019-02-22	1	-16/+0
\| \| \| \| \| \|	These macros have been unused since d4ac7582f32f506d5203bea2f0115076202add38 (Introduce a backport of C11 atomics).
*	Avoid redefining tsd_t.	Jason Evans	2019-02-21	1	-1/+1
\| \| \| \| \| \|	This fixes a build failure when integrating with FreeBSD's libc. This regression was introduced by d1e11d48d4c706e17ef3508e2ddb910f109b779f (Move tsd link and in_hook after tcache.).
*	Disable muzzy decay by default.	Qi Wang	2019-02-04	1	-1/+1
\|
*	Sanity check szind on tcache flush.	Qi Wang	2019-02-01	1	-0/+3
\| \| \| \| \|	This adds some overhead to the tcache flush path (which is one of the popular paths). Guard it behind a config option.
*	Rename huge_threshold to oversize_threshold.	Qi Wang	2019-01-25	3	-6/+6
\| \| \| \| \|	The keyword huge tend to remind people of huge pages which is not relevent to the feature.
*	Set huge_threshold to 8M by default.	Qi Wang	2019-01-24	1	-1/+1
\| \| \| \| \| \|	This feature uses an dedicated arena to handle huge requests, which significantly improves VM fragmentation. In production workload we tested it often reduces VM size by >30%.
*	Avoid creating bg thds for huge arena lone.	Qi Wang	2019-01-16	1	-0/+1
\| \| \| \| \| \|	For low arena count settings, the huge threshold feature may trigger an unwanted bg thd creation. Given that the huge arena does eager purging by default, bypass bg thd creation when initializing the huge arena.
*	Avoid potential issues on extent zero-out.	Qi Wang	2019-01-12	1	-0/+5
\| \| \| \| \| \|	When custom extent_hooks or transparent huge pages are in use, the purging semantics may change, which means we may not get zeroed pages on repopulating. Fixing the issue by manually memset for such cases.
*	implement malloc_getcpu for windows	Leonardo Santagada	2019-01-08	2	-2/+4
\|
*	Only read arena index from extent on the tcache flush path.	Qi Wang	2018-12-18	1	-9/+10
\| \| \| \| \|	Add exten_arena_ind_get() to avoid loading the actual arena ptr in case we just need to check arena matching.
*	Add rate counters to stats	Alexander Zinoviev	2018-12-18	2	-8/+19
\|
*	Store the bin shard selection in TSD.	Qi Wang	2018-12-04	3	-5/+21
\| \| \| \| \|	This avoids having to choose bin shard on the fly, also will allow flexible bin binding for each thread.
*	Add opt.bin_shards to specify number of bin shards.	Qi Wang	2018-12-04	2	-9/+5
\| \| \| \| \|	The option uses the same format as "slab_sizes" to specify number of shards for each bin size.
*	Add support for sharded bins within an arena.	Qi Wang	2018-12-04	7	-7/+87
\| \| \| \| \| \| \| \| \|	This makes it possible to have multiple set of bins in an arena, which improves arena scalability because the bins (especially the small ones) are always the limiting factor in production workload. A bin shard is picked on allocation; each extent tracks the bin shard id for deallocation. The shard size will be determined using runtime options.
*	mutex: fix trylock spin wait contention	Dave Watson	2018-11-28	1	-6/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If there are 3 or more threads spin-waiting on the same mutex, there will be excessive exclusive cacheline contention because pthread_trylock() immediately tries to CAS in a new value, instead of first checking if the lock is locked. This diff adds a 'locked' hint flag, and we will only spin wait without trylock()ing while set. I don't know of any other portable way to get the same behavior as pthread_mutex_lock(). This is pretty easy to test via ttest, e.g. ./ttest1 500 3 10000 1 100 Throughput is nearly 3x as fast. This blames to the mutex profiling changes, however, we almost never have 3 or more threads contending in properly configured production workloads, but still worth fixing.
*	Set the default number of background threads to 4.	Qi Wang	2018-11-16	1	-0/+1
\| \| \| \| \|	The setting has been tested in production for a while. No negative effect while we were able to reduce number of threads per process.
*	Deprecate OSSpinLock.	Qi Wang	2018-11-14	3	-17/+1
\|
*	Add a fastpath for arena_slab_reg_alloc_batch	Dave Watson	2018-11-14	2	-0/+25
\| \| \| \| \|	Also adds a configure.ac check for __builtin_popcount, which is used in the new fastpath.
*	add extent_nfree_sub	Dave Watson	2018-11-14	1	-0/+6
\|
*	Fix tcache_flush (follow up cd2931a).	Qi Wang	2018-11-13	2	-0/+6
\| \| \| \|	Also catch invalid tcache id.
*	refactor tcache_dalloc_small	Dave Watson	2018-11-12	2	-6/+17
\| \| \| \| \| \|	Add a cache_bin_dalloc_easy (to match the alloc_easy function), and use it in tcache_dalloc_small. It will also be used in the new free fastpath.
*	rtree: add rtree_szind_slab_read_fast	Dave Watson	2018-11-12	1	-0/+36
\| \| \| \| \| \| \| \|	For a free fastpath, we want something that will not make additional calls. Assume most free() calls will hit the L1 cache, and use a custom rtree function for this. Additionally, roll the ptr=NULL check in to the rtree cache check.
*	Restrict lwsync to powerpc64 only	Justin Hibbits	2018-10-24	1	-1/+3
\| \| \| \| \| \| \| \|	Nearly all 32-bit powerpc hardware treats lwsync as sync, and some cores (Freescale e500) trap lwsync as an illegal instruction, which then gets emulated in the kernel. To avoid unnecessary traps on the e500, use sync on all 32-bit powerpc. This pessimizes 32-bit software running on 64-bit hardware, but those numbers should be slim.
*	Make use of pthread_set_name_np(3) on FreeBSD.	Edward Tomasz Napierala	2018-10-24	1	-0/+3
\|
*	prof: Fix memory regression	Dave Watson	2018-10-23	1	-8/+28
\| \| \| \| \| \| \| \| \| \|	The diff 'refactor prof accum...' moved the bytes_until_sample subtraction before the load of tdata. If tdata is null, tdata_get(true) will overwrite bytes_until_sample, but we still sample the current allocation. Instead, do the subtraction and check logic again, to keep the previous behavior. blame-rev: 0ac524308d3f636d1a4b5149fa7adf24cf426d9c
*	ticker: add ticker_trytick	Dave Watson	2018-10-18	1	-0/+13
\| \| \| \| \|	For the fastpath, we want to tick, but undo the tick and jump to the slowpath if ticker would fire.
*	drop bump_empty_alloc option. Size class lookup support used instead.	Dave Watson	2018-10-17	2	-2/+0
\|
*	sz: Support 0 size in size2index lookup/compute	Dave Watson	2018-10-17	1	-3/+10
\|
*	Make `smallocx` symbol name depend on the `JEMALLOC_VERSION_GID`	gnzlbg	2018-10-17	1	-0/+1
\| \| \| \| \| \|	This comments concatenates the `JEMALLOC_VERSION_GID` to the `smallocx` symbol name, such that the symbol ends up exported as `smallocx_{git_hash}`.
*	Hide smallocx even when enabled from the library API	gnzlbg	2018-10-17	2	-11/+0
\| \| \| \| \| \| \| \| \|	The experimental `smallocx` API is not exposed via header files, requiring the users to peek at `jemalloc`'s source code to manually add the external declarations to their own programs. This should reinforce that `smallocx` is experimental, and that `jemalloc` does not offer any kind of backwards compatiblity or ABI gurantees for it.
*	Add experimental API: smallocx_return_t smallocx(size, flags)	gnzlbg	2018-10-17	3	-0/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	--- Motivation: This new experimental memory-allocaction API returns a pointer to the allocation as well as the usable size of the allocated memory region. The `s` in `smallocx` stands for `sized`-`mallocx`, attempting to convey that this API returns the size of the allocated memory region. It should allow C++ P0901r0 [0] and Rust Alloc::alloc_excess to make use of it. The main purpose of these APIs is to improve telemetry. It is more accurate to register `smallocx(size, flags)` than `smallocx(nallocx(size), flags)`, for example. The latter will always line up perfectly with the existing size classes, causing a loss of telemetry information about the internal fragmentation induced by potentially poor size-classes choices. Instrumenting `nallocx` does not help much since user code can cache its result and use it repeatedly. --- Implementation: The implementation adds a new `usize` option to `static_opts_s` and an `usize` variable to `dynamic_opts_s`. These are then used to cache the result of `sz_index2size` and similar functions in the code paths in which they are unconditionally invoked. In the code-paths in which these functions are not unconditionally invoked, `smallocx` calls, as opposed to `mallocx`, these functions explicitly. --- [0]: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0901r0.html
*	remove malloc_init() off the fastpath	Dave Watson	2018-10-15	2	-2/+0
\|
*	restrict bytes_until_sample to int64_t. This allows optimal asm	Dave Watson	2018-10-15	2	-6/+10
\| \| \| \| \| \|	generation of sub bytes_until_sample, usize; je; for x86 arch. Subtraction is unconditional, and only flags are checked for the jump, no extra compare is necessary. This also reduces register pressure.
*	refactor prof accum, so that tdata is not loaded if we aren't going to sample.	Dave Watson	2018-10-15	1	-14/+14
\|
*	move bytes until sample to tsd. Fastpath allocation does not need	Dave Watson	2018-10-15	3	-4/+7
\| \| \| \|	to load tdata now, avoiding several branches.
*	Remove a branch from cache_bin_alloc_easy	Dave Watson	2018-10-15	1	-10/+15
\| \| \| \| \|	Combine the branches for checking for an empty cache_bin, and checking for the low watermark.
*	bit_util: Don't use __builtin_clz on s390x	Rajeev Misra	2018-09-20	1	-1/+1
\| \| \| \| \| \|	There's an optimizer bug upstream that results in test failures; reported at https://bugzilla.redhat.com/show_bug.cgi?id=1619354. This works around the failure reported at https://github.com/jemalloc/jemalloc/issues/1307.
*	Bit_util: Use intrinsics for pow2_ceil, where available.	Rajeev Misra	2018-08-16	1	-0/+34
\|
*	FreeBSD build changes and allow to run the tests.	David Carlier	2018-08-09	1	-3/+10
\|
*	Allow the use of readlinkat over readlink.	David Goldblatt	2018-08-03	1	-0/+6
\| \| \| \|	This can be useful in situations where readlink is disallowed.
*	Add stats for the size of extent_avail heap	Tyler Etzel	2018-08-02	2	-0/+4
\|
*	Add extents information to mallocstats output	Tyler Etzel	2018-08-02	6	-2/+27
\| \| \| \|	- Show number/bytes of extents of each size that are dirty, muzzy, retained.
*	Fix comment on SC_NPSIZES.	Tyler Etzel	2018-08-02	1	-5/+1
\|
*	Add unit tests for logging	Tyler Etzel	2018-08-01	1	-2/+11
\|
*	Add logging for sampled allocations	Tyler Etzel	2018-08-01	7	-24/+86
\| \| \| \| \|	- prof_opt_log flag starts logging automatically at runtime - prof_log_{start,stop} mallctl for manual control