aboutsummaryrefslogtreecommitdiffstats
path: root/arch/ia64/kernel/syscalls
Commit message (Collapse)AuthorAgeFilesLines
* ia64: fix clock_getres(CLOCK_MONOTONIC) to report ITC frequencySergei Trofimovich2022-09-111-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | clock_gettime(CLOCK_MONOTONIC, &tp) is very precise on ia64 as it uses ITC (similar to rdtsc on x86). It's not quite a hrtimer as it is a few times slower than 1ns. Usually 2-3ns. clock_getres(CLOCK_MONOTONIC, &res) never reflected that fact and reported 0.04s precision (1/HZ value). In https://bugs.gentoo.org/596382 gstreamer's test suite failed loudly when it noticed precision discrepancy. Before the change: clock_getres(CLOCK_MONOTONIC, &res) reported 250Hz precision. After the change: clock_getres(CLOCK_MONOTONIC, &res) reports ITC (400Mhz) precision. The patch is based on matoro's fix. I added a bit of explanation why we need to special-case arch-specific clock_getres(). [akpm@linux-foundation.org: coding-style cleanups] Link: https://lkml.kernel.org/r/20220820181813.2275195-1-slyich@gmail.com Signed-off-by: Sergei Trofimovich <slyich@gmail.com> Cc: matoro <matoro_mailinglist_kernel@matoro.tk> Cc: Émeric Maschino <emeric.maschino@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* arch: syscalls: simplify uapi/kapi directory creationMasahiro Yamada2022-03-311-2/+1
| | | | | | | $(shell ...) expands to empty. There is no need to assign it to _dummy. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
* mm/mempolicy: wire up syscall set_mempolicy_home_nodeAneesh Kumar K.V2022-01-151-0/+1
| | | | | | | | | | | | | | | | | | | | Link: https://lkml.kernel.org/r/20211202123810.267175-4-aneesh.kumar@linux.ibm.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Ben Widawsky <ben.widawsky@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Feng Tang <feng.tang@intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Andi Kleen <ak@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Huang Ying <ying.huang@intel.com> Cc: <linux-api@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* futex: Wireup futex_waitv syscallAndré Almeida2021-11-251-0/+1
| | | | | | | | | | Wireup futex_waitv syscall for all remaining archs. Signed-off-by: André Almeida <andrealmeid@collabora.com> Acked-by: Max Filippov <jcmvbkbc@gmail.com> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Tested-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Signed-off-by: Arnd Bergmann <arnd@arndb.de>
* Merge branch 'akpm' (patches from Andrew)Linus Torvalds2021-09-031-0/+2
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Merge misc updates from Andrew Morton: "173 patches. Subsystems affected by this series: ia64, ocfs2, block, and mm (debug, pagecache, gup, swap, shmem, memcg, selftests, pagemap, mremap, bootmem, sparsemem, vmalloc, kasan, pagealloc, memory-failure, hugetlb, userfaultfd, vmscan, compaction, mempolicy, memblock, oom-kill, migration, ksm, percpu, vmstat, and madvise)" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (173 commits) mm/madvise: add MADV_WILLNEED to process_madvise() mm/vmstat: remove unneeded return value mm/vmstat: simplify the array size calculation mm/vmstat: correct some wrong comments mm/percpu,c: remove obsolete comments of pcpu_chunk_populated() selftests: vm: add COW time test for KSM pages selftests: vm: add KSM merging time test mm: KSM: fix data type selftests: vm: add KSM merging across nodes test selftests: vm: add KSM zero page merging test selftests: vm: add KSM unmerge test selftests: vm: add KSM merge test mm/migrate: correct kernel-doc notation mm: wire up syscall process_mrelease mm: introduce process_mrelease system call memblock: make memblock_find_in_range method private mm/mempolicy.c: use in_task() in mempolicy_slab_node() mm/mempolicy: unify the create() func for bind/interleave/prefer-many policies mm/mempolicy: advertise new MPOL_PREFERRED_MANY mm/hugetlb: add support for mempolicy MPOL_PREFERRED_MANY ...
| * mm: wire up syscall process_mreleaseSuren Baghdasaryan2021-09-031-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Split off from prev patch in the series that implements the syscall. Link: https://lkml.kernel.org/r/20210809185259.405936-2-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Florian Weimer <fweimer@redhat.com> Cc: Jan Engelhardt <jengelh@inai.de> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Rik van Riel <riel@surriel.com> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tim Murray <timmurray@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | exit/bdflush: Remove the deprecated bdflush system callEric W. Biederman2021-07-121-1/+1
|/ | | | | | | | | | | | | | | | The bdflush system call has been deprecated for a very long time. Recently Michael Schmitz tested[1] and found that the last known caller of of the bdflush system call is unaffected by it's removal. Since the code is not needed delete it. [1] https://lkml.kernel.org/r/36123b5d-daa0-6c2b-f2d4-a942f069fd54@gmail.com Link: https://lkml.kernel.org/r/87sg10quue.fsf_-_@disp2133 Tested-by: Michael Schmitz <schmitzmic@gmail.com> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Cyril Hrubis <chrubis@suse.cz> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
* quota: Wire up quotactl_fd syscallJan Kara2021-06-071-1/+1
| | | | | | | Wire up the quotactl_fd syscall. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>
* quota: Disable quotactl_path syscallJan Kara2021-05-171-1/+1
| | | | | | | | | | | | | | | | In commit fa8b90070a80 ("quota: wire up quotactl_path") we have wired up new quotactl_path syscall. However some people in LWN discussion have objected that the path based syscall is missing dirfd and flags argument which is mostly standard for contemporary path based syscalls. Indeed they have a point and after a discussion with Christian Brauner and Sascha Hauer I've decided to disable the syscall for now and update its API. Since there is no userspace currently using that syscall and it hasn't been released in any major release, we should be fine. CC: Christian Brauner <christian.brauner@ubuntu.com> CC: Sascha Hauer <s.hauer@pengutronix.de> Link: https://lore.kernel.org/lkml/20210512153621.n5u43jsytbik4yze@wittgenstein Signed-off-by: Jan Kara <jack@suse.cz>
* Merge tag 'landlock_v34' of ↵Linus Torvalds2021-05-011-0/+3
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security Pull Landlock LSM from James Morris: "Add Landlock, a new LSM from Mickaël Salaün. Briefly, Landlock provides for unprivileged application sandboxing. From Mickaël's cover letter: "The goal of Landlock is to enable to restrict ambient rights (e.g. global filesystem access) for a set of processes. Because Landlock is a stackable LSM [1], it makes possible to create safe security sandboxes as new security layers in addition to the existing system-wide access-controls. This kind of sandbox is expected to help mitigate the security impact of bugs or unexpected/malicious behaviors in user-space applications. Landlock empowers any process, including unprivileged ones, to securely restrict themselves. Landlock is inspired by seccomp-bpf but instead of filtering syscalls and their raw arguments, a Landlock rule can restrict the use of kernel objects like file hierarchies, according to the kernel semantic. Landlock also takes inspiration from other OS sandbox mechanisms: XNU Sandbox, FreeBSD Capsicum or OpenBSD Pledge/Unveil. In this current form, Landlock misses some access-control features. This enables to minimize this patch series and ease review. This series still addresses multiple use cases, especially with the combined use of seccomp-bpf: applications with built-in sandboxing, init systems, security sandbox tools and security-oriented APIs [2]" The cover letter and v34 posting is here: https://lore.kernel.org/linux-security-module/20210422154123.13086-1-mic@digikod.net/ See also: https://landlock.io/ This code has had extensive design discussion and review over several years" Link: https://lore.kernel.org/lkml/50db058a-7dde-441b-a7f9-f6837fe8b69f@schaufler-ca.com/ [1] Link: https://lore.kernel.org/lkml/f646e1c7-33cf-333f-070c-0a40ad0468cd@digikod.net/ [2] * tag 'landlock_v34' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: landlock: Enable user space to infer supported features landlock: Add user and kernel documentation samples/landlock: Add a sandbox manager example selftests/landlock: Add user space tests landlock: Add syscall implementations arch: Wire up Landlock syscalls fs,security: Add sb_delete hook landlock: Support filesystem access-control LSM: Infrastructure management of the superblock landlock: Add ptrace restrictions landlock: Set up the security framework and manage credentials landlock: Add ruleset and domain management landlock: Add object management
| * arch: Wire up Landlock syscallsMickaël Salaün2021-04-221-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Wire up the following system calls for all architectures: * landlock_create_ruleset(2) * landlock_add_rule(2) * landlock_restrict_self(2) Cc: Arnd Bergmann <arnd@arndb.de> Cc: James Morris <jmorris@namei.org> Cc: Jann Horn <jannh@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Serge E. Hallyn <serge@hallyn.com> Signed-off-by: Mickaël Salaün <mic@linux.microsoft.com> Link: https://lore.kernel.org/r/20210422154123.13086-10-mic@digikod.net Signed-off-by: James Morris <jamorris@linux.microsoft.com>
* | Merge tag 'kbuild-v5.13' of ↵Linus Torvalds2021-04-293-80/+4
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull Kbuild updates from Masahiro Yamada: - Evaluate $(call cc-option,...) etc. only for build targets - Add CONFIG_VMLINUX_MAP to generate .map file when linking vmlinux - Remove unnecessary --gcc-toolchains Clang flag because the --prefix flag finds the toolchains - Do not pass Clang's --prefix flag when using the integrated as - Check the assembler version in Kconfig time - Add new CONFIG options, AS_VERSION, AS_IS_GNU, AS_IS_LLVM to clean up some dependencies in Kconfig - Fix invalid Module.symvers creation when building only modules without vmlinux - Fix false-positive modpost warnings when CONFIG_TRIM_UNUSED_KSYMS is set, but there is no module to build - Refactor module installation Makefile - Support zstd for module compression - Convert alpha and ia64 to use generic shell scripts to generate the syscall headers - Add a new elfnote to indicate if the kernel was built with LTO, which will be used by pahole - Flatten the directory structure under include/config/ so CONFIG options and filenames match - Change the deb source package name from linux-$(KERNELRELEASE) to linux-upstream * tag 'kbuild-v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (42 commits) kbuild: Add $(KBUILD_HOSTLDFLAGS) to 'has_libelf' test kbuild: deb-pkg: change the source package name to linux-upstream tools: do not include scripts/Kbuild.include kbuild: redo fake deps at include/config/*.h kbuild: remove TMPO from try-run MAINTAINERS: add pattern for dummy-tools kbuild: add an elfnote for whether vmlinux is built with lto ia64: syscalls: switch to generic syscallhdr.sh ia64: syscalls: switch to generic syscalltbl.sh alpha: syscalls: switch to generic syscallhdr.sh alpha: syscalls: switch to generic syscalltbl.sh sysctl: use min() helper for namecmp() kbuild: add support for zstd compressed modules kbuild: remove CONFIG_MODULE_COMPRESS kbuild: merge scripts/Makefile.modsign to scripts/Makefile.modinst kbuild: move module strip/compression code into scripts/Makefile.modinst kbuild: refactor scripts/Makefile.modinst kbuild: rename extmod-prefix to extmod_prefix kbuild: check module name conflict for external modules as well kbuild: show the target directory for depmod log ...
| * | ia64: syscalls: switch to generic syscallhdr.shMasahiro Yamada2021-04-252-42/+2
| | | | | | | | | | | | | | | | | | | | | | | | Many architectures duplicate similar shell scripts. This commit converts ia64 to use scripts/syscallhdr.sh. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
| * | ia64: syscalls: switch to generic syscalltbl.shMasahiro Yamada2021-04-252-38/+2
| |/ | | | | | | | | | | | | | | Many architectures duplicate similar shell scripts. This commit converts ia64 to use scripts/syscalltbl.sh. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
* / quota: wire up quotactl_pathSascha Hauer2021-03-171-0/+1
|/ | | | | | | | | Wire up the quotactl_path syscall added in the previous patch. Link: https://lore.kernel.org/r/20210304123541.30749-3-s.hauer@pengutronix.de Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>
* Merge tag 'kbuild-v5.12' of ↵Linus Torvalds2021-02-251-6/+7
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull Kbuild updates from Masahiro Yamada: - Fix false-positive build warnings for ARCH=ia64 builds - Optimize dictionary size for module compression with xz - Check the compiler and linker versions in Kconfig - Fix misuse of extra-y - Support DWARF v5 debug info - Clamp SUBLEVEL to 255 because stable releases 4.4.x and 4.9.x exceeded the limit - Add generic syscall{tbl,hdr}.sh for cleanups across arches - Minor cleanups of genksyms - Minor cleanups of Kconfig * tag 'kbuild-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (38 commits) initramfs: Remove redundant dependency of RD_ZSTD on BLK_DEV_INITRD kbuild: remove deprecated 'always' and 'hostprogs-y/m' kbuild: parse C= and M= before changing the working directory kbuild: reuse this-makefile to define abs_srctree kconfig: unify rule of config, menuconfig, nconfig, gconfig, xconfig kconfig: omit --oldaskconfig option for 'make config' kconfig: fix 'invalid option' for help option kconfig: remove dead code in conf_askvalue() kconfig: clean up nested if-conditionals in check_conf() kconfig: Remove duplicate call to sym_get_string_value() Makefile: Remove # characters from compiler string Makefile: reuse CC_VERSION_TEXT kbuild: check the minimum linker version in Kconfig kbuild: remove ld-version macro scripts: add generic syscallhdr.sh scripts: add generic syscalltbl.sh arch: syscalls: remove $(srctree)/ prefix from syscall tables arch: syscalls: add missing FORCE and fix 'targets' to make if_changed work gen_compile_commands: prune some directories kbuild: simplify access to the kernel's version ...
| * arch: syscalls: remove $(srctree)/ prefix from syscall tablesMasahiro Yamada2021-02-221-1/+1
| | | | | | | | | | | | | | The 'syscall' variables are not directly used in the commands. Remove the $(srctree)/ prefix because we can rely on VPATH. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
| * arch: syscalls: add missing FORCE and fix 'targets' to make if_changed workMasahiro Yamada2021-02-221-5/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | The rules in these Makefiles cannot detect the command line change because the prerequisite 'FORCE' is missing. Adding 'FORCE' will result in the headers being rebuilt every time because the 'targets' additions are also wrong; the file paths in 'targets' must be relative to the current Makefile. Fix all of them so the if_changed rules work correctly. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
* | fs: add mount_setattr()Christian Brauner2021-01-241-0/+1
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This implements the missing mount_setattr() syscall. While the new mount api allows to change the properties of a superblock there is currently no way to change the properties of a mount or a mount tree using file descriptors which the new mount api is based on. In addition the old mount api has the restriction that mount options cannot be applied recursively. This hasn't changed since changing mount options on a per-mount basis was implemented in [1] and has been a frequent request not just for convenience but also for security reasons. The legacy mount syscall is unable to accommodate this behavior without introducing a whole new set of flags because MS_REC | MS_REMOUNT | MS_BIND | MS_RDONLY | MS_NOEXEC | [...] only apply the mount option to the topmost mount. Changing MS_REC to apply to the whole mount tree would mean introducing a significant uapi change and would likely cause significant regressions. The new mount_setattr() syscall allows to recursively clear and set mount options in one shot. Multiple calls to change mount options requesting the same changes are idempotent: int mount_setattr(int dfd, const char *path, unsigned flags, struct mount_attr *uattr, size_t usize); Flags to modify path resolution behavior are specified in the @flags argument. Currently, AT_EMPTY_PATH, AT_RECURSIVE, AT_SYMLINK_NOFOLLOW, and AT_NO_AUTOMOUNT are supported. If useful, additional lookup flags to restrict path resolution as introduced with openat2() might be supported in the future. The mount_setattr() syscall can be expected to grow over time and is designed with extensibility in mind. It follows the extensible syscall pattern we have used with other syscalls such as openat2(), clone3(), sched_{set,get}attr(), and others. The set of mount options is passed in the uapi struct mount_attr which currently has the following layout: struct mount_attr { __u64 attr_set; __u64 attr_clr; __u64 propagation; __u64 userns_fd; }; The @attr_set and @attr_clr members are used to clear and set mount options. This way a user can e.g. request that a set of flags is to be raised such as turning mounts readonly by raising MOUNT_ATTR_RDONLY in @attr_set while at the same time requesting that another set of flags is to be lowered such as removing noexec from a mount tree by specifying MOUNT_ATTR_NOEXEC in @attr_clr. Note, since the MOUNT_ATTR_<atime> values are an enum starting from 0, not a bitmap, users wanting to transition to a different atime setting cannot simply specify the atime setting in @attr_set, but must also specify MOUNT_ATTR__ATIME in the @attr_clr field. So we ensure that MOUNT_ATTR__ATIME can't be partially set in @attr_clr and that @attr_set can't have any atime bits set if MOUNT_ATTR__ATIME isn't set in @attr_clr. The @propagation field lets callers specify the propagation type of a mount tree. Propagation is a single property that has four different settings and as such is not really a flag argument but an enum. Specifically, it would be unclear what setting and clearing propagation settings in combination would amount to. The legacy mount() syscall thus forbids the combination of multiple propagation settings too. The goal is to keep the semantics of mount propagation somewhat simple as they are overly complex as it is. The @userns_fd field lets user specify a user namespace whose idmapping becomes the idmapping of the mount. This is implemented and explained in detail in the next patch. [1]: commit 2e4b7fcd9260 ("[PATCH] r/o bind mounts: honor mount writer counts at remount") Link: https://lore.kernel.org/r/20210121131959.646623-35-christian.brauner@ubuntu.com Cc: David Howells <dhowells@redhat.com> Cc: Aleksa Sarai <cyphar@cyphar.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Cc: linux-api@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
* epoll: wire up syscall epoll_pwait2Willem de Bruijn2020-12-191-0/+1
| | | | | | | | | | | | Split off from prev patch in the series that implements the syscall. Link: https://lkml.kernel.org/r/20201121144401.3727659-4-willemdebruijn.kernel@gmail.com Signed-off-by: Willem de Bruijn <willemb@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/madvise: introduce process_madvise() syscall: an external memory hinting APIMinchan Kim2020-10-181-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is usecase that System Management Software(SMS) want to give a memory hint like MADV_[COLD|PAGEEOUT] to other processes and in the case of Android, it is the ActivityManagerService. The information required to make the reclaim decision is not known to the app. Instead, it is known to the centralized userspace daemon(ActivityManagerService), and that daemon must be able to initiate reclaim on its own without any app involvement. To solve the issue, this patch introduces a new syscall process_madvise(2). It uses pidfd of an external process to give the hint. It also supports vector address range because Android app has thousands of vmas due to zygote so it's totally waste of CPU and power if we should call the syscall one by one for each vma.(With testing 2000-vma syscall vs 1-vector syscall, it showed 15% performance improvement. I think it would be bigger in real practice because the testing ran very cache friendly environment). Another potential use case for the vector range is to amortize the cost ofTLB shootdowns for multiple ranges when using MADV_DONTNEED; this could benefit users like TCP receive zerocopy and malloc implementations. In future, we could find more usecases for other advises so let's make it happens as API since we introduce a new syscall at this moment. With that, existing madvise(2) user could replace it with process_madvise(2) with their own pid if they want to have batch address ranges support feature. ince it could affect other process's address range, only privileged process(PTRACE_MODE_ATTACH_FSCREDS) or something else(e.g., being the same UID) gives it the right to ptrace the process could use it successfully. The flag argument is reserved for future use if we need to extend the API. I think supporting all hints madvise has/will supported/support to process_madvise is rather risky. Because we are not sure all hints make sense from external process and implementation for the hint may rely on the caller being in the current context so it could be error-prone. Thus, I just limited hints as MADV_[COLD|PAGEOUT] in this patch. If someone want to add other hints, we could hear the usecase and review it for each hint. It's safer for maintenance rather than introducing a buggy syscall but hard to fix it later. So finally, the API is as follows, ssize_t process_madvise(int pidfd, const struct iovec *iovec, unsigned long vlen, int advice, unsigned int flags); DESCRIPTION The process_madvise() system call is used to give advice or directions to the kernel about the address ranges from external process as well as local process. It provides the advice to address ranges of process described by iovec and vlen. The goal of such advice is to improve system or application performance. The pidfd selects the process referred to by the PID file descriptor specified in pidfd. (See pidofd_open(2) for further information) The pointer iovec points to an array of iovec structures, defined in <sys/uio.h> as: struct iovec { void *iov_base; /* starting address */ size_t iov_len; /* number of bytes to be advised */ }; The iovec describes address ranges beginning at address(iov_base) and with size length of bytes(iov_len). The vlen represents the number of elements in iovec. The advice is indicated in the advice argument, which is one of the following at this moment if the target process specified by pidfd is external. MADV_COLD MADV_PAGEOUT Permission to provide a hint to external process is governed by a ptrace access mode PTRACE_MODE_ATTACH_FSCREDS check; see ptrace(2). The process_madvise supports every advice madvise(2) has if target process is in same thread group with calling process so user could use process_madvise(2) to extend existing madvise(2) to support vector address ranges. RETURN VALUE On success, process_madvise() returns the number of bytes advised. This return value may be less than the total number of requested bytes, if an error occurred. The caller should check return value to determine whether a partial advice occurred. FAQ: Q.1 - Why does any external entity have better knowledge? Quote from Sandeep "For Android, every application (including the special SystemServer) are forked from Zygote. The reason of course is to share as many libraries and classes between the two as possible to benefit from the preloading during boot. After applications start, (almost) all of the APIs end up calling into this SystemServer process over IPC (binder) and back to the application. In a fully running system, the SystemServer monitors every single process periodically to calculate their PSS / RSS and also decides which process is "important" to the user for interactivity. So, because of how these processes start _and_ the fact that the SystemServer is looping to monitor each process, it does tend to *know* which address range of the application is not used / useful. Besides, we can never rely on applications to clean things up themselves. We've had the "hey app1, the system is low on memory, please trim your memory usage down" notifications for a long time[1]. They rely on applications honoring the broadcasts and very few do. So, if we want to avoid the inevitable killing of the application and restarting it, some way to be able to tell the OS about unimportant memory in these applications will be useful. - ssp Q.2 - How to guarantee the race(i.e., object validation) between when giving a hint from an external process and get the hint from the target process? process_madvise operates on the target process's address space as it exists at the instant that process_madvise is called. If the space target process can run between the time the process_madvise process inspects the target process address space and the time that process_madvise is actually called, process_madvise may operate on memory regions that the calling process does not expect. It's the responsibility of the process calling process_madvise to close this race condition. For example, the calling process can suspend the target process with ptrace, SIGSTOP, or the freezer cgroup so that it doesn't have an opportunity to change its own address space before process_madvise is called. Another option is to operate on memory regions that the caller knows a priori will be unchanged in the target process. Yet another option is to accept the race for certain process_madvise calls after reasoning that mistargeting will do no harm. The suggested API itself does not provide synchronization. It also apply other APIs like move_pages, process_vm_write. The race isn't really a problem though. Why is it so wrong to require that callers do their own synchronization in some manner? Nobody objects to write(2) merely because it's possible for two processes to open the same file and clobber each other's writes --- instead, we tell people to use flock or something. Think about mmap. It never guarantees newly allocated address space is still valid when the user tries to access it because other threads could unmap the memory right before. That's where we need synchronization by using other API or design from userside. It shouldn't be part of API itself. If someone needs more fine-grained synchronization rather than process level, there were two ideas suggested - cookie[2] and anon-fd[3]. Both are applicable via using last reserved argument of the API but I don't think it's necessary right now since we have already ways to prevent the race so don't want to add additional complexity with more fine-grained optimization model. To make the API extend, it reserved an unsigned long as last argument so we could support it in future if someone really needs it. Q.3 - Why doesn't ptrace work? Injecting an madvise in the target process using ptrace would not work for us because such injected madvise would have to be executed by the target process, which means that process would have to be runnable and that creates the risk of the abovementioned race and hinting a wrong VMA. Furthermore, we want to act the hint in caller's context, not the callee's, because the callee is usually limited in cpuset/cgroups or even freezed state so they can't act by themselves quick enough, which causes more thrashing/kill. It doesn't work if the target process are ptraced(e.g., strace, debugger, minidump) because a process can have at most one ptracer. [1] https://developer.android.com/topic/performance/memory" [2] process_getinfo for getting the cookie which is updated whenever vma of process address layout are changed - Daniel Colascione - https://lore.kernel.org/lkml/20190520035254.57579-1-minchan@kernel.org/T/#m7694416fd179b2066a2c62b5b139b14e3894e224 [3] anonymous fd which is used for the object(i.e., address range) validation - Michal Hocko - https://lore.kernel.org/lkml/20200120112722.GY18451@dhcp22.suse.cz/ [minchan@kernel.org: fix process_madvise build break for arm64] Link: http://lkml.kernel.org/r/20200303145756.GA219683@google.com [minchan@kernel.org: fix build error for mips of process_madvise] Link: http://lkml.kernel.org/r/20200508052517.GA197378@google.com [akpm@linux-foundation.org: fix patch ordering issue] [akpm@linux-foundation.org: fix arm64 whoops] [minchan@kernel.org: make process_madvise() vlen arg have type size_t, per Florian] [akpm@linux-foundation.org: fix i386 build] [sfr@canb.auug.org.au: fix syscall numbering] Link: https://lkml.kernel.org/r/20200905142639.49fc3f1a@canb.auug.org.au [sfr@canb.auug.org.au: madvise.c needs compat.h] Link: https://lkml.kernel.org/r/20200908204547.285646b4@canb.auug.org.au [minchan@kernel.org: fix mips build] Link: https://lkml.kernel.org/r/20200909173655.GC2435453@google.com [yuehaibing@huawei.com: remove duplicate header which is included twice] Link: https://lkml.kernel.org/r/20200915121550.30584-1-yuehaibing@huawei.com [minchan@kernel.org: do not use helper functions for process_madvise] Link: https://lkml.kernel.org/r/20200921175539.GB387368@google.com [akpm@linux-foundation.org: pidfd_get_pid() gained an argument] [sfr@canb.auug.org.au: fix up for "iov_iter: transparently handle compat iovecs in import_iovec"] Link: https://lkml.kernel.org/r/20200928212542.468e1fef@canb.auug.org.au Signed-off-by: Minchan Kim <minchan@kernel.org> Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com> Cc: Brian Geffon <bgeffon@google.com> Cc: Christian Brauner <christian@brauner.io> Cc: Daniel Colascione <dancol@google.com> Cc: Jann Horn <jannh@google.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Dias <joaodias@google.com> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oleksandr Natalenko <oleksandr@redhat.com> Cc: Sandeep Patil <sspatil@google.com> Cc: SeongJae Park <sj38.park@gmail.com> Cc: SeongJae Park <sjpark@amazon.de> Cc: Shakeel Butt <shakeelb@google.com> Cc: Sonny Rao <sonnyrao@google.com> Cc: Tim Murray <timmurray@google.com> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Florian Weimer <fw@deneb.enyo.de> Cc: <linux-man@vger.kernel.org> Link: http://lkml.kernel.org/r/20200302193630.68771-3-minchan@kernel.org Link: http://lkml.kernel.org/r/20200508183320.GA125527@google.com Link: http://lkml.kernel.org/r/20200622192900.22757-4-minchan@kernel.org Link: https://lkml.kernel.org/r/20200901000633.1920247-4-minchan@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ia64: Remove perfmonChristoph Hellwig2020-09-111-1/+1
| | | | | | | | | | | perfmon has been marked broken and thus been disabled for all builds for more than two years. Remove it entirely. Cc: Anant Thazhemadam <anant.thazhemadam@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Enthusiastically-ACKed-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20200911094920.1173631-1-hch@lst.de
* all arch: remove system call sys_sysctlXiaoming Ni2020-08-141-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since commit 61a47c1ad3a4dc ("sysctl: Remove the sysctl system call"), sys_sysctl is actually unavailable: any input can only return an error. We have been warning about people using the sysctl system call for years and believe there are no more users. Even if there are users of this interface if they have not complained or fixed their code by now they probably are not going to, so there is no point in warning them any longer. So completely remove sys_sysctl on all architectures. [nixiaoming@huawei.com: s390: fix build error for sys_call_table_emu] Link: http://lkml.kernel.org/r/20200618141426.16884-1-nixiaoming@huawei.com Signed-off-by: Xiaoming Ni <nixiaoming@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Will Deacon <will@kernel.org> [arm/arm64] Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Aleksa Sarai <cyphar@cyphar.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andi Kleen <ak@linux.intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Bin Meng <bin.meng@windriver.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: chenzefeng <chenzefeng2@huawei.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Christian Brauner <christian@brauner.io> Cc: Chris Zankel <chris@zankel.net> Cc: David Howells <dhowells@redhat.com> Cc: David S. Miller <davem@davemloft.net> Cc: Diego Elio Pettenò <flameeyes@flameeyes.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Dominik Brodowski <linux@dominikbrodowski.net> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Iurii Zaikin <yzaikin@google.com> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Kars de Jong <jongk@linux-m68k.org> Cc: Kees Cook <keescook@chromium.org> Cc: Krzysztof Kozlowski <krzk@kernel.org> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Marco Elver <elver@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Martin K. Petersen <martin.petersen@oracle.com> Cc: Masahiro Yamada <yamada.masahiro@socionext.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Miklos Szeredi <mszeredi@redhat.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Cc: Nick Piggin <npiggin@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Olof Johansson <olof@lixom.net> Cc: Paul Burton <paulburton@kernel.org> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Ravi Bangoria <ravi.bangoria@linux.ibm.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Sami Tolvanen <samitolvanen@google.com> Cc: Sargun Dhillon <sargun@sargun.me> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Sudeep Holla <sudeep.holla@arm.com> Cc: Sven Schnelle <svens@stackframe.org> Cc: Thiago Jung Bauermann <bauerman@linux.ibm.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Zhou Yanjie <zhouyanjie@wanyeetech.com> Link: http://lkml.kernel.org/r/20200616030734.87257-1-nixiaoming@huawei.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* arch: wire-up close_range()Christian Brauner2020-06-171-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This wires up the close_range() syscall into all arches at once. Suggested-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Cc: Jann Horn <jannh@google.com> Cc: David Howells <dhowells@redhat.com> Cc: Dmitry V. Levin <ldv@altlinux.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Florian Weimer <fweimer@redhat.com> Cc: linux-api@vger.kernel.org Cc: linux-alpha@vger.kernel.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-ia64@vger.kernel.org Cc: linux-m68k@lists.linux-m68k.org Cc: linux-mips@vger.kernel.org Cc: linux-parisc@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-s390@vger.kernel.org Cc: linux-sh@vger.kernel.org Cc: sparclinux@vger.kernel.org Cc: linux-xtensa@linux-xtensa.org Cc: linux-arch@vger.kernel.org Cc: x86@kernel.org
* vfs: add faccessat2 syscallMiklos Szeredi2020-05-141-0/+1
| | | | | | | | | | | | | | | | | | POSIX defines faccessat() as having a fourth "flags" argument, while the linux syscall doesn't have it. Glibc tries to emulate AT_EACCESS and AT_SYMLINK_NOFOLLOW, but AT_EACCESS emulation is broken. Add a new faccessat(2) syscall with the added flags argument and implement both flags. The value of AT_EACCESS is defined in glibc headers to be the same as AT_REMOVEDIR. Use this value for the kernel interface as well, together with the explanatory comment. Also add AT_EMPTY_PATH support, which is not documented by POSIX, but can be useful and is trivial to implement. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
* asm-generic: fix unistd_32.h generation formatMichal Simek2020-04-071-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Generated files are also checked by sparse that's why add newline to remove sparse (C=1) warning. The issue was found on Microblaze and reported like this: ./arch/microblaze/include/generated/uapi/asm/unistd_32.h:438:45: warning: no newline at end of file Mips and PowerPC have it already but let's align with style used by m68k. Signed-off-by: Michal Simek <michal.simek@xilinx.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Stefan Asserhall <stefan.asserhall@xilinx.com> Acked-by: Max Filippov <jcmvbkbc@gmail.com> (xtensa) Cc: Arnd Bergmann <arnd@arndb.de> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Chris Zankel <chris@zankel.net> Cc: David S. Miller <davem@davemloft.net> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Helge Deller <deller@gmx.de> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Burton <paulburton@kernel.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Rich Felker <dalias@libc.org> Cc: Richard Henderson <rth@twiddle.net> Cc: Tony Luck <tony.luck@intel.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Link: http://lkml.kernel.org/r/4d32ab4e1fb2edb691d2e1687e8fb303c09fd023.1581504803.git.michal.simek@xilinx.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge tag 'threads-v5.6' of ↵Linus Torvalds2020-01-291-0/+1
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux Pull thread management updates from Christian Brauner: "Sargun Dhillon over the last cycle has worked on the pidfd_getfd() syscall. This syscall allows for the retrieval of file descriptors of a process based on its pidfd. A task needs to have ptrace_may_access() permissions with PTRACE_MODE_ATTACH_REALCREDS (suggested by Oleg and Andy) on the target. One of the main use-cases is in combination with seccomp's user notification feature. As a reminder, seccomp's user notification feature was made available in v5.0. It allows a task to retrieve a file descriptor for its seccomp filter. The file descriptor is usually handed of to a more privileged supervising process. The supervisor can then listen for syscall events caught by the seccomp filter of the supervisee and perform actions in lieu of the supervisee, usually emulating syscalls. pidfd_getfd() is needed to expand its uses. There are currently two major users that wait on pidfd_getfd() and one future user: - Netflix, Sargun said, is working on a service mesh where users should be able to connect to a dns-based VIP. When a user connects to e.g. 1.2.3.4:80 that runs e.g. service "foo" they will be redirected to an envoy process. This service mesh uses seccomp user notifications and pidfd to intercept all connect calls and instead of connecting them to 1.2.3.4:80 connects them to e.g. 127.0.0.1:8080. - LXD uses the seccomp notifier heavily to intercept and emulate mknod() and mount() syscalls for unprivileged containers/processes. With pidfd_getfd() more uses-cases e.g. bridging socket connections will be possible. - The patchset has also seen some interest from the browser corner. Right now, Firefox is using a SECCOMP_RET_TRAP sandbox managed by a broker process. In the future glibc will start blocking all signals during dlopen() rendering this type of sandbox impossible. Hence, in the future Firefox will switch to a seccomp-user-nofication based sandbox which also makes use of file descriptor retrieval. The thread for this can be found at https://sourceware.org/ml/libc-alpha/2019-12/msg00079.html With pidfd_getfd() it is e.g. possible to bridge socket connections for the supervisee (binding to a privileged port) and taking actions on file descriptors on behalf of the supervisee in general. Sargun's first version was using an ioctl on pidfds but various people pushed for it to be a proper syscall which he duely implemented as well over various review cycles. Selftests are of course included. I've also added instructions how to deal with merge conflicts below. There's also a small fix coming from the kernel mentee project to correctly annotate struct sighand_struct with __rcu to fix various sparse warnings. We've received a few more such fixes and even though they are mostly trivial I've decided to postpone them until after -rc1 since they came in rather late and I don't want to risk introducing build warnings. Finally, there's a new prctl() command PR_{G,S}ET_IO_FLUSHER which is needed to avoid allocation recursions triggerable by storage drivers that have userspace parts that run in the IO path (e.g. dm-multipath, iscsi, etc). These allocation recursions deadlock the device. The new prctl() allows such privileged userspace components to avoid allocation recursions by setting the PF_MEMALLOC_NOIO and PF_LESS_THROTTLE flags. The patch carries the necessary acks from the relevant maintainers and is routed here as part of prctl() thread-management." * tag 'threads-v5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: prctl: PR_{G,S}ET_IO_FLUSHER to support controlling memory reclaim sched.h: Annotate sighand_struct with __rcu test: Add test for pidfd getfd arch: wire up pidfd_getfd syscall pid: Implement pidfd_getfd syscall vfs, fdtable: Add fget_task helper
| * arch: wire up pidfd_getfd syscallSargun Dhillon2020-01-131-0/+1
| | | | | | | | | | | | | | | | | | | | This wires up the pidfd_getfd syscall for all architectures. Signed-off-by: Sargun Dhillon <sargun@sargun.me> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Link: https://lore.kernel.org/r/20200107175927.4558-4-sargun@sargun.me Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
* | open: introduce openat2(2) syscallAleksa Sarai2020-01-181-0/+1
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | /* Background. */ For a very long time, extending openat(2) with new features has been incredibly frustrating. This stems from the fact that openat(2) is possibly the most famous counter-example to the mantra "don't silently accept garbage from userspace" -- it doesn't check whether unknown flags are present[1]. This means that (generally) the addition of new flags to openat(2) has been fraught with backwards-compatibility issues (O_TMPFILE has to be defined as __O_TMPFILE|O_DIRECTORY|[O_RDWR or O_WRONLY] to ensure old kernels gave errors, since it's insecure to silently ignore the flag[2]). All new security-related flags therefore have a tough road to being added to openat(2). Userspace also has a hard time figuring out whether a particular flag is supported on a particular kernel. While it is now possible with contemporary kernels (thanks to [3]), older kernels will expose unknown flag bits through fcntl(F_GETFL). Giving a clear -EINVAL during openat(2) time matches modern syscall designs and is far more fool-proof. In addition, the newly-added path resolution restriction LOOKUP flags (which we would like to expose to user-space) don't feel related to the pre-existing O_* flag set -- they affect all components of path lookup. We'd therefore like to add a new flag argument. Adding a new syscall allows us to finally fix the flag-ignoring problem, and we can make it extensible enough so that we will hopefully never need an openat3(2). /* Syscall Prototype. */ /* * open_how is an extensible structure (similar in interface to * clone3(2) or sched_setattr(2)). The size parameter must be set to * sizeof(struct open_how), to allow for future extensions. All future * extensions will be appended to open_how, with their zero value * acting as a no-op default. */ struct open_how { /* ... */ }; int openat2(int dfd, const char *pathname, struct open_how *how, size_t size); /* Description. */ The initial version of 'struct open_how' contains the following fields: flags Used to specify openat(2)-style flags. However, any unknown flag bits or otherwise incorrect flag combinations (like O_PATH|O_RDWR) will result in -EINVAL. In addition, this field is 64-bits wide to allow for more O_ flags than currently permitted with openat(2). mode The file mode for O_CREAT or O_TMPFILE. Must be set to zero if flags does not contain O_CREAT or O_TMPFILE. resolve Restrict path resolution (in contrast to O_* flags they affect all path components). The current set of flags are as follows (at the moment, all of the RESOLVE_ flags are implemented as just passing the corresponding LOOKUP_ flag). RESOLVE_NO_XDEV => LOOKUP_NO_XDEV RESOLVE_NO_SYMLINKS => LOOKUP_NO_SYMLINKS RESOLVE_NO_MAGICLINKS => LOOKUP_NO_MAGICLINKS RESOLVE_BENEATH => LOOKUP_BENEATH RESOLVE_IN_ROOT => LOOKUP_IN_ROOT open_how does not contain an embedded size field, because it is of little benefit (userspace can figure out the kernel open_how size at runtime fairly easily without it). It also only contains u64s (even though ->mode arguably should be a u16) to avoid having padding fields which are never used in the future. Note that as a result of the new how->flags handling, O_PATH|O_TMPFILE is no longer permitted for openat(2). As far as I can tell, this has always been a bug and appears to not be used by userspace (and I've not seen any problems on my machines by disallowing it). If it turns out this breaks something, we can special-case it and only permit it for openat(2) but not openat2(2). After input from Florian Weimer, the new open_how and flag definitions are inside a separate header from uapi/linux/fcntl.h, to avoid problems that glibc has with importing that header. /* Testing. */ In a follow-up patch there are over 200 selftests which ensure that this syscall has the correct semantics and will correctly handle several attack scenarios. In addition, I've written a userspace library[4] which provides convenient wrappers around openat2(RESOLVE_IN_ROOT) (this is necessary because no other syscalls support RESOLVE_IN_ROOT, and thus lots of care must be taken when using RESOLVE_IN_ROOT'd file descriptors with other syscalls). During the development of this patch, I've run numerous verification tests using libpathrs (showing that the API is reasonably usable by userspace). /* Future Work. */ Additional RESOLVE_ flags have been suggested during the review period. These can be easily implemented separately (such as blocking auto-mount during resolution). Furthermore, there are some other proposed changes to the openat(2) interface (the most obvious example is magic-link hardening[5]) which would be a good opportunity to add a way for userspace to restrict how O_PATH file descriptors can be re-opened. Another possible avenue of future work would be some kind of CHECK_FIELDS[6] flag which causes the kernel to indicate to userspace which openat2(2) flags and fields are supported by the current kernel (to avoid userspace having to go through several guesses to figure it out). [1]: https://lwn.net/Articles/588444/ [2]: https://lore.kernel.org/lkml/CA+55aFyyxJL1LyXZeBsf2ypriraj5ut1XkNDsunRBqgVjZU_6Q@mail.gmail.com [3]: commit 629e014bb834 ("fs: completely ignore unknown open flags") [4]: https://sourceware.org/bugzilla/show_bug.cgi?id=17523 [5]: https://lore.kernel.org/lkml/20190930183316.10190-2-cyphar@cyphar.com/ [6]: https://youtu.be/ggD-eb3yPVs Suggested-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* arch: mark syscall number 435 reserved for clone3Christian Brauner2019-07-151-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | A while ago Arnd made it possible to give new system calls the same syscall number on all architectures (except alpha). To not break this nice new feature let's mark 435 for clone3 as reserved on all architectures that do not yet implement it. Even if an architecture does not plan to implement it this ensures that new system calls coming after clone3 will have the same number on all architectures. Signed-off-by: Christian Brauner <christian@brauner.io> Cc: linux-arch@vger.kernel.org Cc: linux-alpha@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: linux-ia64@vger.kernel.org Cc: linux-m68k@lists.linux-m68k.org Cc: linux-mips@vger.kernel.org Cc: linux-parisc@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-s390@vger.kernel.org Cc: linux-sh@vger.kernel.org Cc: sparclinux@vger.kernel.org Link: https://lore.kernel.org/r/20190714192205.27190-2-christian@brauner.io Reviewed-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Christian Brauner <christian@brauner.io>
* arch: wire-up pidfd_open()Christian Brauner2019-06-281-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This wires up the pidfd_open() syscall into all arches at once. Signed-off-by: Christian Brauner <christian@brauner.io> Reviewed-by: David Howells <dhowells@redhat.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Kees Cook <keescook@chromium.org> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Jann Horn <jannh@google.com> Cc: Andy Lutomirsky <luto@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Aleksa Sarai <cyphar@cyphar.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-api@vger.kernel.org Cc: linux-alpha@vger.kernel.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-ia64@vger.kernel.org Cc: linux-m68k@lists.linux-m68k.org Cc: linux-mips@vger.kernel.org Cc: linux-parisc@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-s390@vger.kernel.org Cc: linux-sh@vger.kernel.org Cc: sparclinux@vger.kernel.org Cc: linux-xtensa@linux-xtensa.org Cc: linux-arch@vger.kernel.org Cc: x86@kernel.org
* uapi: Wire up the mount API syscalls on non-x86 arches [ver #2]David Howells2019-05-161-0/+6
| | | | | | | | | Wire up the mount API syscalls on non-x86 arches. Reported-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* arch: add pidfd and io_uring syscalls everywhereArnd Bergmann2019-04-151-0/+4
| | | | | | | | | | | | | Add the io_uring and pidfd_send_signal system calls to all architectures. These system calls are designed to handle both native and compat tasks, so all entries are the same across architectures, only arm-compat and the generic tale still use an old format. Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> (s390) Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
* y2038: add 64-bit time_t syscalls to all 32-bit architecturesArnd Bergmann2019-02-071-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | This adds 21 new system calls on each ABI that has 32-bit time_t today. All of these have the exact same semantics as their existing counterparts, and the new ones all have macro names that end in 'time64' for clarification. This gets us to the point of being able to safely use a C library that has 64-bit time_t in user space. There are still a couple of loose ends to tie up in various areas of the code, but this is the big one, and should be entirely uncontroversial at this point. In particular, there are four system calls (getitimer, setitimer, waitid, and getrusage) that don't have a 64-bit counterpart yet, but these can all be safely implemented in the C library by wrapping around the existing system calls because the 32-bit time_t they pass only counts elapsed time, not time since the epoch. They will be dealt with later. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Catalin Marinas <catalin.marinas@arm.com>
* arch: add pkey and rseq syscall numbers everywhereArnd Bergmann2019-01-251-0/+4
| | | | | | | | | | | | | Most architectures define system call numbers for the rseq and pkey system calls, even when they don't support the features, and perhaps never will. Only a few architectures are missing these, so just define them anyway for consistency. If we decide to add them later to one of these, the system call numbers won't get out of sync then. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
* ia64: assign syscall numbers for perf and seccompArnd Bergmann2019-01-251-0/+2
| | | | | | | | | | | Most architectures have assigned numbers for both seccomp and perf_event_open, even when they do not implement either. ia64 is an exception here, so for consistency lets add numbers for both of them. Unless CONFIG_PERF_EVENTS and CONFIG_SECCOMP are implemented, the system calls just return -ENOSYS. Signed-off-by: Arnd Bergmann <arnd@arndb.de>
* ia64: add statx and io_pgetevents syscallsArnd Bergmann2019-01-251-0/+2
| | | | | | | All architectures should implement these two, so assign numbers and hook them up on ia64. Signed-off-by: Arnd Bergmann <arnd@arndb.de>
* ia64: add __NR_umount2 definitionArnd Bergmann2019-01-251-1/+1
| | | | | | | | | | | | Other architectures commonly use __NR_umount2 for sys_umount, only ia64 and alpha use __NR_umount here. In order to synchronize the generated tables, use umount2 like everyone else, and add back the old name from asm/unistd.h for compatibility. The __IGNORE_* lines are now all obsolete and can be removed as a side-effect. Signed-off-by: Arnd Bergmann <arnd@arndb.de>
* ia64: add system call table generation supportFiroz Khan2018-11-134-0/+445
The system call tables are in different format in all architecture and it will be difficult to manually add, modify or delete the syscall table entries in the res- pective files. To make it easy by keeping a script and which will generate the uapi header and syscall table file. This change will also help to unify the implemen- tation across all architectures. The system call table generation script is added in kernel/syscalls directory which contain the scripts to generate both uapi header file and system call table files. The syscall.tbl will be input for the scripts. syscall.tbl contains the list of available system calls along with system call number and corresponding entry point. Add a new system call in this architecture will be possible by adding new entry in the syscall.tbl file. Adding a new table entry consisting of: - System call number. - ABI. - System call name. - Entry point name. syscallhdr.sh and syscalltbl.sh will generate uapi header unistd_64.h and syscall_table.h files respectively. Both .sh files will parse the content syscall.tbl to generate the header and table files. unistd_64.h will be included by uapi/asm/unistd.h and syscall_table.h is included by kernel/entry.S - the real system call table. ARM, s390 and x86 architecuture does have similar support. I leverage their implementation to come up with a generic solution. Signed-off-by: Firoz Khan <firoz.khan@linaro.org> Signed-off-by: Tony Luck <tony.luck@intel.com>