aboutsummaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* Merge branch 'WIP.irq'HEADmasterIngo Molnar2019-06-3013-45/+211
|\
| * x86/irq: Seperate unused system vectors from spurious entry againWIP.irqThomas Gleixner2019-06-285-16/+76
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Quite some time ago the interrupt entry stubs for unused vectors in the system vector range got removed and directly mapped to the spurious interrupt vector entry point. Sounds reasonable, but it's subtly broken. The spurious interrupt vector entry point pushes vector number 0xFF on the stack which makes the whole logic in __smp_spurious_interrupt() pointless. As a consequence any spurious interrupt which comes from a vector != 0xFF is treated as a real spurious interrupt (vector 0xFF) and not acknowledged. That subsequently stalls all interrupt vectors of equal and lower priority, which brings the system to a grinding halt. This can happen because even on 64-bit the system vector space is not guaranteed to be fully populated. A full compile time handling of the unused vectors is not possible because quite some of them are conditonally populated at runtime. Bring the entry stubs back, which wastes 160 bytes if all stubs are unused, but gains the proper handling back. There is no point to selectively spare some of the stubs which are known at compile time as the required code in the IDT management would be way larger and convoluted. Do not route the spurious entries through common_interrupt and do_IRQ() as the original code did. Route it to smp_spurious_interrupt() which evaluates the vector number and acts accordingly now that the real vector numbers are handed in. Fixup the pr_warn so the actual spurious vector (0xff) is clearly distiguished from the other vectors and also note for the vectored case whether it was pending in the ISR or not. "Spurious APIC interrupt (vector 0xFF) on CPU#0, should never happen." "Spurious interrupt vector 0xed on CPU#1. Acked." "Spurious interrupt vector 0xee on CPU#1. Not pending!." Fixes: 2414e021ac8d ("x86: Avoid building unused IRQ entry stubs") Reported-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Jan Beulich <jbeulich@suse.com>
| * x86/irq: Handle spurious interrupt after shutdown gracefullyThomas Gleixner2019-06-283-4/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since the rework of the vector management, warnings about spurious interrupts have been reported. Robert provided some more information and did an initial analysis. The following situation leads to these warnings: CPU 0 CPU 1 IO_APIC interrupt is raised sent to CPU1 Unable to handle immediately (interrupts off, deep idle delay) mask() ... free() shutdown() synchronize_irq() clear_vector() do_IRQ() -> vector is clear Before the rework the vector entries of legacy interrupts were statically assigned and occupied precious vector space while most of them were unused. Due to that the above situation was handled silently because the vector was handled and the core handler of the assigned interrupt descriptor noticed that it is shut down and returned. While this has been usually observed with legacy interrupts, this situation is not limited to them. Any other interrupt source, e.g. MSI, can cause the same issue. After adding proper synchronization for level triggered interrupts, this can only happen for edge triggered interrupts where the IO-APIC obviously cannot provide information about interrupts in flight. While the spurious warning is actually harmless in this case it worries users and driver developers. Handle it gracefully by marking the vector entry as VECTOR_SHUTDOWN instead of VECTOR_UNUSED when the vector is freed up. If that above late handling happens the spurious detector will not complain and switch the entry to VECTOR_UNUSED. Any subsequent spurious interrupt on that line will trigger the spurious warning as before. Fixes: 464d12309e1b ("x86/vector: Switch IOAPIC to global reservation mode") Reported-by: Robert Hodaszi <Robert.Hodaszi@digi.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>- Tested-by: Robert Hodaszi <Robert.Hodaszi@digi.com>
| * x86/ioapic: Implement irq_get_irqchip_state() callbackThomas Gleixner2019-06-281-0/+46
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When an interrupt is shut down in free_irq() there might be an inflight interrupt pending in the IO-APIC remote IRR which is not yet serviced. That means the interrupt has been sent to the target CPUs local APIC, but the target CPU is in a state which delays the servicing. So free_irq() would proceed to free resources and to clear the vector because synchronize_hardirq() does not see an interrupt handler in progress. That can trigger a spurious interrupt warning, which is harmless and just confuses users, but it also can leave the remote IRR in a stale state because once the handler is invoked the interrupt resources might be freed already and therefore acknowledgement is not possible anymore. Implement the irq_get_irqchip_state() callback for the IO-APIC irq chip. The callback is invoked from free_irq() via __synchronize_hardirq(). Check the remote IRR bit of the interrupt and return 'in flight' if it is set and the interrupt is configured in level mode. For edge mode the remote IRR has no meaning. As this is only meaningful for level triggered interrupts this won't cure the potential spurious interrupt warning for edge triggered interrupts, but the edge trigger case does not result in stale hardware state. This has to be addressed at the vector/interrupt entry level seperately. Fixes: 464d12309e1b ("x86/vector: Switch IOAPIC to global reservation mode") Reported-by: Robert Hodaszi <Robert.Hodaszi@digi.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * genirq: Add optional hardware synchronization for shutdownThomas Gleixner2019-06-282-19/+60
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | free_irq() ensures that no hardware interrupt handler is executing on a different CPU before actually releasing resources and deactivating the interrupt completely in a domain hierarchy. But that does not catch the case where the interrupt is on flight at the hardware level but not yet serviced by the target CPU. That creates an interesing race condition: CPU 0 CPU 1 IRQ CHIP interrupt is raised sent to CPU1 Unable to handle immediately (interrupts off, deep idle delay) mask() ... free() shutdown() synchronize_irq() release_resources() do_IRQ() -> resources are not available That might be harmless and just trigger a spurious interrupt warning, but some interrupt chips might get into a wedged state. Utilize the existing irq_get_irqchip_state() callback for the synchronization in free_irq(). synchronize_hardirq() is not using this mechanism as it might actually deadlock unter certain conditions, e.g. when called with interrupts disabled and the target CPU is the one on which the synchronization is invoked. synchronize_irq() uses it because that function cannot be called from non preemtible contexts as it might sleep. Fixes: 464d12309e1b ("x86/vector: Switch IOAPIC to global reservation mode") Reported-by: Robert Hodaszi <Robert.Hodaszi@digi.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * genirq: Fix misleading synchronize_irq() documentationThomas Gleixner2019-06-281-1/+2
| | | | | | | | | | | | | | The function might sleep, so it cannot be called from interrupt context. Not even with care. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * genirq: Delay deactivation in free_irq()Thomas Gleixner2019-06-285-5/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When interrupts are shutdown, they are immediately deactivated in the irqdomain hierarchy. While this looks obviously correct there is a subtle issue: There might be an interrupt in flight when free_irq() is invoking the shutdown. This is properly handled at the irq descriptor / primary handler level, but the deactivation might completely disable resources which are required to acknowledge the interrupt. Split the shutdown code and deactivate the interrupt after synchronization in free_irq(). Fixup all other usage sites where this is not an issue to invoke the combined shutdown_and_deactivate() function instead. This still might be an issue if the interrupt in flight servicing is delayed on a remote CPU beyond the invocation of synchronize_irq(), but that cannot be handled at that level and needs to be handled in the synchronize_irq() context. Fixes: f8264e34965a ("irqdomain: Introduce new interfaces to support hierarchy irqdomains") Reported-by: Robert Hodaszi <Robert.Hodaszi@digi.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
* | Merge branch 'linus'Ingo Molnar2019-06-3047-161/+486
|\ \
| * | Linux 5.2-rc7v5.2-rc7Linus Torvalds2019-06-301-2/+2
| | |
| * | Merge tag 'powerpc-5.2-7' of ↵Linus Torvalds2019-06-301-1/+1
| |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc fix from Michael Ellerman: "One fix for a regression in my commit adding KUAP (Kernel User Access Prevention) on Radix, which incorrectly touched the AMR in the early machine check handler. Thanks to Nicholas Piggin" * tag 'powerpc-5.2-7' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: powerpc/64s/exception: Fix machine check early corrupting AMR
| | * | powerpc/64s/exception: Fix machine check early corrupting AMRNicholas Piggin2019-06-251-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The early machine check runs in real mode, so locking is unnecessary. Worse, the windup does not restore AMR, so this can result in a false KUAP fault after a recoverable machine check hits inside a user copy operation. Fix this similarly to HMI by just avoiding the kuap lock in the early machine check handler (it will be set by the late handler that runs in virtual mode if that runs). If the virtual mode handler is reached, it will lock and restore the AMR. Fixes: 890274c2dc4c0 ("powerpc/64s: Implement KUAP for Radix MMU") Cc: Russell Currey <ruscur@russell.cc> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
| * | | Merge branch 'smp-urgent-for-linus' of ↵Linus Torvalds2019-06-301-0/+6
| |\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull SMP fixes from Thomas Gleixner: "Two small changes for the cpu hotplug code: - Prevent out of bounds access which actually might crash the machine caused by a missing bounds check in the fail injection code - Warn about unsupported migitation mode command line arguments to make people aware that they typoed the paramater. Not necessarily a fix but quite some people tripped over that" * 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: cpu/hotplug: Fix out-of-bounds read when setting fail state cpu/speculation: Warn on unsupported mitigations= parameter
| * \ \ \ Merge branch 'x86-urgent-for-linus' of ↵Linus Torvalds2019-06-298-55/+89
| |\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Ingo Molnar: "Misc fixes all over the place: - might_sleep() atomicity fix in the microcode loader - resctrl boundary condition fix - APIC arithmethics bug fix for frequencies >= 4.2 GHz - three 5-level paging crash fixes - two speculation fixes - a perf/stacktrace fix" * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/unwind/orc: Fall back to using frame pointers for generated code perf/x86: Always store regs->ip in perf_callchain_kernel() x86/speculation: Allow guests to use SSBD even if host does not x86/mm: Handle physical-virtual alignment mismatch in phys_p4d_init() x86/boot/64: Add missing fixup_pointer() for next_early_pgt access x86/boot/64: Fix crash if kernel image crosses page table boundary x86/apic: Fix integer overflow on 10 bit left shift of cpu_khz x86/resctrl: Prevent possible overrun during bitmap operations x86/microcode: Fix the microcode load on CPU hotplug for real
| * \ \ \ \ Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds2019-06-2911-38/+48
| |\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo Molnar: "Various fixes, most of them related to bugs perf fuzzing found in the x86 code" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/regs: Use PERF_REG_EXTENDED_MASK perf/x86: Remove pmu->pebs_no_xmm_regs perf/x86: Clean up PEBS_XMM_REGS perf/x86/regs: Check reserved bits perf/x86: Disable extended registers for non-supported PMUs perf/ioctl: Add check for the sample_period value perf/core: Fix perf_sample_regs_user() mm check
| * \ \ \ \ \ Merge branch 'irq-urgent-for-linus' of ↵Linus Torvalds2019-06-295-17/+71
| |\ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fixes from Ingo Molnar: "Diverse irqchip driver fixes" * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: irqchip/gic-v3-its: Fix command queue pointer comparison bug irqchip/mips-gic: Use the correct local interrupt map registers irqchip/ti-sci-inta: Fix kernel crash if irq_create_fwspec_mapping fail irqchip/irq-csky-mpintc: Support auto irq deliver to all cpus
| * \ \ \ \ \ \ Merge branch 'efi-urgent-for-linus' of ↵Linus Torvalds2019-06-294-13/+18
| |\ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull EFI fixes from Ingo Molnar: "Four fixes: - fix a kexec crash on arm64 - fix a reboot crash on some Android platforms - future-proof the code for upcoming ACPI 6.2 changes - fix a build warning on x86" * 'efi-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: efibc: Replace variable set function in notifier call x86/efi: fix a -Wtype-limits compilation warning efi/bgrt: Drop BGRT status field reserved bits check efi/memreserve: deal with memreserve entries in unmapped memory
| * \ \ \ \ \ \ \ Merge tag 'pm-5.2-rc7' of ↵Linus Torvalds2019-06-293-6/+31
| |\ \ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fix from Rafael Wysocki: "Avoid skipping bus-level PCI power management during system resume for PCIe ports left in D0 during the preceding suspend transition on platforms where the power states of those ports can change out of the PCI layer's control" * tag 'pm-5.2-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: PCI: PM: Avoid skipping bus-level PM on platforms without ACPI
| | * | | | | | | | PCI: PM: Avoid skipping bus-level PM on platforms without ACPIRafael J. Wysocki2019-06-263-6/+31
| | | |_|_|_|_|_|/ | | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are platforms that do not call pm_set_suspend_via_firmware(), so pm_suspend_via_firmware() returns 'false' on them, but the power states of PCI devices (PCIe ports in particular) are changed as a result of powering down core platform components during system-wide suspend. Thus the pm_suspend_via_firmware() checks in pci_pm_suspend_noirq() and pci_pm_resume_noirq() introduced by commit 3e26c5feed2a ("PCI: PM: Skip devices in D0 for suspend-to- idle") are not sufficient to determine that devices left in D0 during suspend will remain in D0 during resume and so the bus-level power management can be skipped for them. For this reason, introduce a new global suspend flag, PM_SUSPEND_FLAG_NO_PLATFORM, set it for suspend-to-idle only and replace the pm_suspend_via_firmware() checks mentioned above with checks against this flag. Fixes: 3e26c5feed2a ("PCI: PM: Skip devices in D0 for suspend-to-idle") Reported-by: Jon Hunter <jonathanh@nvidia.com> Tested-by: Jon Hunter <jonathanh@nvidia.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Tested-by: Mika Westerberg <mika.westerberg@linux.intel.com> Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com>
| * | | | | | | | Merge tag 'xarray-5.2-rc6' of git://git.infradead.org/users/willy/linux-daxLinus Torvalds2019-06-296-5/+108
| |\ \ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull XArray fixes from Matthew Wilcox: - Account XArray nodes for the page cache to the appropriate cgroup (Johannes Weiner) - Fix idr_get_next() when called under the RCU lock (Matthew Wilcox) - Add a test for xa_insert() (Matthew Wilcox) * tag 'xarray-5.2-rc6' of git://git.infradead.org/users/willy/linux-dax: XArray tests: Add check_insert idr: Fix idr_get_next race with idr_remove mm: fix page cache convergence regression
| | * | | | | | | | XArray tests: Add check_insertMatthew Wilcox2019-06-021-0/+38
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A simple test which just checks that inserting an entry into an empty array succeeds. Try various different interesting indices. Signed-off-by: Matthew Wilcox <willy@infradead.org>
| | * | | | | | | | idr: Fix idr_get_next race with idr_removeMatthew Wilcox (Oracle)2019-06-022-2/+58
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If the entry is deleted from the IDR between the call to radix_tree_iter_find() and rcu_dereference_raw(), idr_get_next() will return NULL, which will end the iteration prematurely. We should instead continue to the next entry in the IDR. This only happens if the iteration is protected by the RCU lock. Most IDR users use a spinlock or semaphore to exclude simultaneous modifications. It was noticed once the PID allocator was converted to use the IDR, as it uses the RCU lock, but there may be other users elsewhere in the kernel. We can't use the normal pattern of calling radix_tree_deref_retry() (which catches both a retry entry in a leaf node and a node entry in the root) as the IDR supports storing entries which are unaligned, which will trigger an infinite loop if they are encountered. Instead, we have to explicitly check whether the entry is a retry entry. Fixes: 0a835c4f090a ("Reimplement IDR and IDA using the radix tree") Reported-by: Brendan Gregg <bgregg@netflix.com> Tested-by: Brendan Gregg <bgregg@netflix.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
| | * | | | | | | | mm: fix page cache convergence regressionJohannes Weiner2019-05-313-3/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since a28334862993 ("page cache: Finish XArray conversion"), on most major Linux distributions, the page cache doesn't correctly transition when the hot data set is changing, and leaves the new pages thrashing indefinitely instead of kicking out the cold ones. On a freshly booted, freshly ssh'd into virtual machine with 1G RAM running stock Arch Linux: [root@ham ~]# ./reclaimtest.sh + dd of=workingset-a bs=1M count=0 seek=600 + cat workingset-a + cat workingset-a + cat workingset-a + cat workingset-a + cat workingset-a + cat workingset-a + cat workingset-a + cat workingset-a + ./mincore workingset-a 153600/153600 workingset-a + dd of=workingset-b bs=1M count=0 seek=600 + cat workingset-b + cat workingset-b + cat workingset-b + cat workingset-b + ./mincore workingset-a workingset-b 104029/153600 workingset-a 120086/153600 workingset-b + cat workingset-b + cat workingset-b + cat workingset-b + cat workingset-b + ./mincore workingset-a workingset-b 104029/153600 workingset-a 120268/153600 workingset-b workingset-b is a 600M file on a 1G host that is otherwise entirely idle. No matter how often it's being accessed, it won't get cached. While investigating, I noticed that the non-resident information gets aggressively reclaimed - /proc/vmstat::workingset_nodereclaim. This is a problem because a workingset transition like this relies on the non-resident information tracked in the page cache tree of evicted file ranges: when the cache faults are refaults of recently evicted cache, we challenge the existing active set, and that allows a new workingset to establish itself. Tracing the shrinker that maintains this memory revealed that all page cache tree nodes were allocated to the root cgroup. This is a problem, because 1) the shrinker sizes the amount of non-resident information it keeps to the size of the cgroup's other memory and 2) on most major Linux distributions, only kernel threads live in the root cgroup and everything else gets put into services or session groups: [root@ham ~]# cat /proc/self/cgroup 0::/user.slice/user-0.slice/session-c1.scope As a result, we basically maintain no non-resident information for the workloads running on the system, thus breaking the caching algorithm. Looking through the code, I found the culprit in the above-mentioned patch: when switching from the radix tree to xarray, it dropped the __GFP_ACCOUNT flag from the tree node allocations - the flag that makes sure the allocated memory gets charged to and tracked by the cgroup of the calling process - in this case, the one doing the fault. To fix this, allow xarray users to specify per-tree flag that makes xarray allocate nodes using __GFP_ACCOUNT. Then restore the page cache tree annotation to request such cgroup tracking for the cache nodes. With this patch applied, the page cache correctly converges on new workingsets again after just a few iterations: [root@ham ~]# ./reclaimtest.sh + dd of=workingset-a bs=1M count=0 seek=600 + cat workingset-a + cat workingset-a + cat workingset-a + cat workingset-a + cat workingset-a + cat workingset-a + cat workingset-a + cat workingset-a + ./mincore workingset-a 153600/153600 workingset-a + dd of=workingset-b bs=1M count=0 seek=600 + cat workingset-b + ./mincore workingset-a workingset-b 124607/153600 workingset-a 87876/153600 workingset-b + cat workingset-b + ./mincore workingset-a workingset-b 81313/153600 workingset-a 133321/153600 workingset-b + cat workingset-b + ./mincore workingset-a workingset-b 63036/153600 workingset-a 153600/153600 workingset-b Cc: stable@vger.kernel.org # 4.20+ Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
| * | | | | | | | | Merge branch 'akpm' (patches from Andrew)Linus Torvalds2019-06-2920-80/+97
| |\ \ \ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Merge misc fixes from Andrew Morton: "15 fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: linux/kernel.h: fix overflow for DIV_ROUND_UP_ULL mm, swap: fix THP swap out fork,memcg: alloc_thread_stack_node needs to set tsk->stack MAINTAINERS: add CLANG/LLVM BUILD SUPPORT info mm/vmalloc.c: avoid bogus -Wmaybe-uninitialized warning mm/page_idle.c: fix oops because end_pfn is larger than max_pfn initramfs: fix populate_initrd_image() section mismatch mm/oom_kill.c: fix uninitialized oc->constraint mm: hugetlb: soft-offline: dissolve_free_huge_page() return zero on !PageHuge mm: soft-offline: return -EBUSY if set_hwpoison_free_buddy_page() fails signal: remove the wrong signal_pending() check in restore_user_sigmask() fs/binfmt_flat.c: make load_flat_shared_library() work mm/mempolicy.c: fix an incorrect rebind node in mpol_rebind_nodemask fs/proc/array.c: allow reporting eip/esp for all coredumping threads mm/dev_pfn: exclude MEMORY_DEVICE_PRIVATE while computing virtual address
| | * | | | | | | | | linux/kernel.h: fix overflow for DIV_ROUND_UP_ULLVinod Koul2019-06-291-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | DIV_ROUND_UP_ULL adds the two arguments and then invokes DIV_ROUND_DOWN_ULL. But on a 32bit system the addition of two 32 bit values can overflow. DIV_ROUND_DOWN_ULL does it correctly and stashes the addition into a unsigned long long so cast the result to unsigned long long here to avoid the overflow condition. [akpm@linux-foundation.org: DIV_ROUND_UP_ULL must be an rval] Link: http://lkml.kernel.org/r/20190625100518.30753-1-vkoul@kernel.org Signed-off-by: Vinod Koul <vkoul@kernel.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Bjorn Andersson <bjorn.andersson@linaro.org> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| | * | | | | | | | | mm, swap: fix THP swap outHuang Ying2019-06-291-5/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 0-Day test system reported some OOM regressions for several THP (Transparent Huge Page) swap test cases. These regressions are bisected to 6861428921b5 ("block: always define BIO_MAX_PAGES as 256"). In the commit, BIO_MAX_PAGES is set to 256 even when THP swap is enabled. So the bio_alloc(gfp_flags, 512) in get_swap_bio() may fail when swapping out THP. That causes the OOM. As in the patch description of 6861428921b5 ("block: always define BIO_MAX_PAGES as 256"), THP swap should use multi-page bvec to write THP to swap space. So the issue is fixed via doing that in get_swap_bio(). BTW: I remember I have checked the THP swap code when 6861428921b5 ("block: always define BIO_MAX_PAGES as 256") was merged, and thought the THP swap code needn't to be changed. But apparently, I was wrong. I should have done this at that time. Link: http://lkml.kernel.org/r/20190624075515.31040-1-ying.huang@intel.com Fixes: 6861428921b5 ("block: always define BIO_MAX_PAGES as 256") Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| | * | | | | | | | | fork,memcg: alloc_thread_stack_node needs to set tsk->stackAndrea Arcangeli2019-06-291-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 5eed6f1dff87 ("fork,memcg: fix crash in free_thread_stack on memcg charge fail") corrected two instances, but there was a third instance of this bug. Without setting tsk->stack, if memcg_charge_kernel_stack fails, it'll execute free_thread_stack() on a dangling pointer. Enterprise kernels are compiled with VMAP_STACK=y so this isn't critical, but custom VMAP_STACK=n builds should have some performance advantage, with the drawback of risking to fail fork because compaction didn't succeed. So as long as VMAP_STACK=n is a supported option it's worth fixing it upstream. Link: http://lkml.kernel.org/r/20190619011450.28048-1-aarcange@redhat.com Fixes: 9b6f7e163cd0 ("mm: rework memcg kernel stack accounting") Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Rik van Riel <riel@surriel.com> Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| | * | | | | | | | | MAINTAINERS: add CLANG/LLVM BUILD SUPPORT infoNick Desaulniers2019-06-291-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add keyword support so that our mailing list gets cc'ed for clang/llvm patches. We're pretty active on our mailing list so far as code review. There are numerous Googlers like myself that are paid to support building the Linux kernel with Clang and LLVM. Link: http://lkml.kernel.org/r/20190620001907.255803-1-ndesaulniers@google.com Signed-off-by: Nick Desaulniers <ndesaulniers@google.com> Reviewed-by: Nathan Chancellor <natechancellor@gmail.com> Cc: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| | * | | | | | | | | mm/vmalloc.c: avoid bogus -Wmaybe-uninitialized warningArnd Bergmann2019-06-291-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | gcc gets confused in pcpu_get_vm_areas() because there are too many branches that affect whether 'lva' was initialized before it gets used: mm/vmalloc.c: In function 'pcpu_get_vm_areas': mm/vmalloc.c:991:4: error: 'lva' may be used uninitialized in this function [-Werror=maybe-uninitialized] insert_vmap_area_augment(lva, &va->rb_node, ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ &free_vmap_area_root, &free_vmap_area_list); ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ mm/vmalloc.c:916:20: note: 'lva' was declared here struct vmap_area *lva; ^~~ Add an intialization to NULL, and check whether this has changed before the first use. [akpm@linux-foundation.org: tweak comments] Link: http://lkml.kernel.org/r/20190618092650.2943749-1-arnd@arndb.de Fixes: 68ad4a330433 ("mm/vmalloc.c: keep track of free blocks for vmap allocation") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Joel Fernandes <joelaf@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| | * | | | | | | | | mm/page_idle.c: fix oops because end_pfn is larger than max_pfnColin Ian King2019-06-291-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently the calcuation of end_pfn can round up the pfn number to more than the actual maximum number of pfns, causing an Oops. Fix this by ensuring end_pfn is never more than max_pfn. This can be easily triggered when on systems where the end_pfn gets rounded up to more than max_pfn using the idle-page stress-ng stress test: sudo stress-ng --idle-page 0 BUG: unable to handle kernel paging request at 00000000000020d8 #PF error: [normal kernel read fault] PGD 0 P4D 0 Oops: 0000 [#1] SMP PTI CPU: 1 PID: 11039 Comm: stress-ng-idle- Not tainted 5.0.0-5-generic #6-Ubuntu Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 RIP: 0010:page_idle_get_page+0xc8/0x1a0 Code: 0f b1 0a 75 7d 48 8b 03 48 89 c2 48 c1 e8 33 83 e0 07 48 c1 ea 36 48 8d 0c 40 4c 8d 24 88 49 c1 e4 07 4c 03 24 d5 00 89 c3 be <49> 8b 44 24 58 48 8d b8 80 a1 02 00 e8 07 d5 77 00 48 8b 53 08 48 RSP: 0018:ffffafd7c672fde8 EFLAGS: 00010202 RAX: 0000000000000005 RBX: ffffe36341fff700 RCX: 000000000000000f RDX: 0000000000000284 RSI: 0000000000000275 RDI: 0000000001fff700 RBP: ffffafd7c672fe00 R08: ffffa0bc34056410 R09: 0000000000000276 R10: ffffa0bc754e9b40 R11: ffffa0bc330f6400 R12: 0000000000002080 R13: ffffe36341fff700 R14: 0000000000080000 R15: ffffa0bc330f6400 FS: 00007f0ec1ea5740(0000) GS:ffffa0bc7db00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000000020d8 CR3: 0000000077d68000 CR4: 00000000000006e0 Call Trace: page_idle_bitmap_write+0x8c/0x140 sysfs_kf_bin_write+0x5c/0x70 kernfs_fop_write+0x12e/0x1b0 __vfs_write+0x1b/0x40 vfs_write+0xab/0x1b0 ksys_write+0x55/0xc0 __x64_sys_write+0x1a/0x20 do_syscall_64+0x5a/0x110 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Link: http://lkml.kernel.org/r/20190618124352.28307-1-colin.king@canonical.com Fixes: 33c3fc71c8cf ("mm: introduce idle page tracking") Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| | * | | | | | | | | initramfs: fix populate_initrd_image() section mismatchGeert Uytterhoeven2019-06-291-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With gcc-4.6.3: WARNING: vmlinux.o(.text.unlikely+0x140): Section mismatch in reference from the function populate_initrd_image() to the variable .init.ramfs.info:__initramfs_size The function populate_initrd_image() references the variable __init __initramfs_size. This is often because populate_initrd_image lacks a __init annotation or the annotation of __initramfs_size is wrong. WARNING: vmlinux.o(.text.unlikely+0x14c): Section mismatch in reference from the function populate_initrd_image() to the function .init.text:unpack_to_rootfs() The function populate_initrd_image() references the function __init unpack_to_rootfs(). This is often because populate_initrd_image lacks a __init annotation or the annotation of unpack_to_rootfs is wrong. WARNING: vmlinux.o(.text.unlikely+0x198): Section mismatch in reference from the function populate_initrd_image() to the function .init.text:xwrite() The function populate_initrd_image() references the function __init xwrite(). This is often because populate_initrd_image lacks a __init annotation or the annotation of xwrite is wrong. Indeed, if the compiler decides not to inline populate_initrd_image(), a warning is generated. Fix this by adding the missing __init annotations. Link: http://lkml.kernel.org/r/20190617074340.12779-1-geert@linux-m68k.org Fixes: 7c184ecd262fe64f ("initramfs: factor out a helper to populate the initrd image") Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| | * | | | | | | | | mm/oom_kill.c: fix uninitialized oc->constraintYafang Shao2019-06-291-7/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In dump_oom_summary() oc->constraint is used to show oom_constraint_text, but it hasn't been set before. So the value of it is always the default value 0. We should inititialize it before. Bellow is the output when memcg oom occurs, before this patch: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null), cpuset=/,mems_allowed=0,oom_memcg=/foo,task_memcg=/foo,task=bash,pid=7997,uid=0 after this patch: oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null), cpuset=/,mems_allowed=0,oom_memcg=/foo,task_memcg=/foo,task=bash,pid=13681,uid=0 Link: http://lkml.kernel.org/r/1560522038-15879-1-git-send-email-laoar.shao@gmail.com Fixes: ef8444ea01d7 ("mm, oom: reorganize the oom report in dump_header") Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Wind Yu <yuzhoujian@didichuxing.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| | * | | | | | | | | mm: hugetlb: soft-offline: dissolve_free_huge_page() return zero on !PageHugeNaoya Horiguchi2019-06-292-13/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | madvise(MADV_SOFT_OFFLINE) often returns -EBUSY when calling soft offline for hugepages with overcommitting enabled. That was caused by the suboptimal code in current soft-offline code. See the following part: ret = migrate_pages(&pagelist, new_page, NULL, MPOL_MF_MOVE_ALL, MIGRATE_SYNC, MR_MEMORY_FAILURE); if (ret) { ... } else { /* * We set PG_hwpoison only when the migration source hugepage * was successfully dissolved, because otherwise hwpoisoned * hugepage remains on free hugepage list, then userspace will * find it as SIGBUS by allocation failure. That's not expected * in soft-offlining. */ ret = dissolve_free_huge_page(page); if (!ret) { if (set_hwpoison_free_buddy_page(page)) num_poisoned_pages_inc(); } } return ret; Here dissolve_free_huge_page() returns -EBUSY if the migration source page was freed into buddy in migrate_pages(), but even in that case we actually has a chance that set_hwpoison_free_buddy_page() succeeds. So that means current code gives up offlining too early now. dissolve_free_huge_page() checks that a given hugepage is suitable for dissolving, where we should return success for !PageHuge() case because the given hugepage is considered as already dissolved. This change also affects other callers of dissolve_free_huge_page(), which are cleaned up together. [n-horiguchi@ah.jp.nec.com: v3] Link: http://lkml.kernel.org/r/1560761476-4651-3-git-send-email-n-horiguchi@ah.jp.nec.comLink: http://lkml.kernel.org/r/1560154686-18497-3-git-send-email-n-horiguchi@ah.jp.nec.com Fixes: 6bc9b56433b76 ("mm: fix race on soft-offlining") Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reported-by: Chen, Jerry T <jerry.t.chen@intel.com> Tested-by: Chen, Jerry T <jerry.t.chen@intel.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Xishi Qiu <xishi.qiuxishi@alibaba-inc.com> Cc: "Chen, Jerry T" <jerry.t.chen@intel.com> Cc: "Zhuo, Qiuxu" <qiuxu.zhuo@intel.com> Cc: <stable@vger.kernel.org> [4.19+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| | * | | | | | | | | mm: soft-offline: return -EBUSY if set_hwpoison_free_buddy_page() failsNaoya Horiguchi2019-06-291-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The pass/fail of soft offline should be judged by checking whether the raw error page was finally contained or not (i.e. the result of set_hwpoison_free_buddy_page()), but current code do not work like that. It might lead us to misjudge the test result when set_hwpoison_free_buddy_page() fails. Without this fix, there are cases where madvise(MADV_SOFT_OFFLINE) may not offline the original page and will not return an error. Link: http://lkml.kernel.org/r/1560154686-18497-2-git-send-email-n-horiguchi@ah.jp.nec.com Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Fixes: 6bc9b56433b76 ("mm: fix race on soft-offlining") Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Xishi Qiu <xishi.qiuxishi@alibaba-inc.com> Cc: "Chen, Jerry T" <jerry.t.chen@intel.com> Cc: "Zhuo, Qiuxu" <qiuxu.zhuo@intel.com> Cc: <stable@vger.kernel.org> [4.19+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| | * | | | | | | | | signal: remove the wrong signal_pending() check in restore_user_sigmask()Oleg Nesterov2019-06-296-28/+36
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is the minimal fix for stable, I'll send cleanups later. Commit 854a6ed56839 ("signal: Add restore_user_sigmask()") introduced the visible change which breaks user-space: a signal temporary unblocked by set_user_sigmask() can be delivered even if the caller returns success or timeout. Change restore_user_sigmask() to accept the additional "interrupted" argument which should be used instead of signal_pending() check, and update the callers. Eric said: : For clarity. I don't think this is required by posix, or fundamentally to : remove the races in select. It is what linux has always done and we have : applications who care so I agree this fix is needed. : : Further in any case where the semantic change that this patch rolls back : (aka where allowing a signal to be delivered and the select like call to : complete) would be advantage we can do as well if not better by using : signalfd. : : Michael is there any chance we can get this guarantee of the linux : implementation of pselect and friends clearly documented. The guarantee : that if the system call completes successfully we are guaranteed that no : signal that is unblocked by using sigmask will be delivered? Link: http://lkml.kernel.org/r/20190604134117.GA29963@redhat.com Fixes: 854a6ed56839a40f6b5d02a2962f48841482eec4 ("signal: Add restore_user_sigmask()") Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reported-by: Eric Wong <e@80x24.org> Tested-by: Eric Wong <e@80x24.org> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Deepa Dinamani <deepa.kernel@gmail.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Jason Baron <jbaron@akamai.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: David Laight <David.Laight@ACULAB.COM> Cc: <stable@vger.kernel.org> [5.0+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| | * | | | | | | | | fs/binfmt_flat.c: make load_flat_shared_library() workJann Horn2019-06-291-16/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | load_flat_shared_library() is broken: It only calls load_flat_file() if prepare_binprm() returns zero, but prepare_binprm() returns the number of bytes read - so this only happens if the file is empty. Instead, call into load_flat_file() if the number of bytes read is non-negative. (Even if the number of bytes is zero - in that case, load_flat_file() will see nullbytes and return a nice -ENOEXEC.) In addition, remove the code related to bprm creds and stop using prepare_binprm() - this code is loading a library, not a main executable, and it only actually uses the members "buf", "file" and "filename" of the linux_binprm struct. Instead, call kernel_read() directly. Link: http://lkml.kernel.org/r/20190524201817.16509-1-jannh@google.com Fixes: 287980e49ffc ("remove lots of IS_ERR_VALUE abuses") Signed-off-by: Jann Horn <jannh@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Kees Cook <keescook@chromium.org> Cc: Nicolas Pitre <nicolas.pitre@linaro.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Greg Ungerer <gerg@linux-m68k.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| | * | | | | | | | | mm/mempolicy.c: fix an incorrect rebind node in mpol_rebind_nodemaskzhong jiang2019-06-291-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | mpol_rebind_nodemask() is called for MPOL_BIND and MPOL_INTERLEAVE mempoclicies when the tasks's cpuset's mems_allowed changes. For policies created without MPOL_F_STATIC_NODES or MPOL_F_RELATIVE_NODES, it works by remapping the policy's allowed nodes (stored in v.nodes) using the previous value of mems_allowed (stored in w.cpuset_mems_allowed) as the domain of map and the new mems_allowed (passed as nodes) as the range of the map (see the comment of bitmap_remap() for details). The result of remapping is stored back as policy's nodemask in v.nodes, and the new value of mems_allowed should be stored in w.cpuset_mems_allowed to facilitate the next rebind, if it happens. However, 213980c0f23b ("mm, mempolicy: simplify rebinding mempolicies when updating cpusets") introduced a bug where the result of remapping is stored in w.cpuset_mems_allowed instead. Thus, a mempolicy's allowed nodes can evolve in an unexpected way after a series of rebinding due to cpuset mems_allowed changes, possibly binding to a wrong node or a smaller number of nodes which may e.g. overload them. This patch fixes the bug so rebinding again works as intended. [vbabka@suse.cz: new changlog] Link: http://lkml.kernel.org/r/ef6a69c6-c052-b067-8f2c-9d615c619bb9@suse.cz Link: http://lkml.kernel.org/r/1558768043-23184-1-git-send-email-zhongjiang@huawei.com Fixes: 213980c0f23b ("mm, mempolicy: simplify rebinding mempolicies when updating cpusets") Signed-off-by: zhong jiang <zhongjiang@huawei.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Oscar Salvador <osalvador@suse.de> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| | * | | | | | | | | fs/proc/array.c: allow reporting eip/esp for all coredumping threadsJohn Ogness2019-06-291-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 0a1eb2d474ed ("fs/proc: Stop reporting eip and esp in /proc/PID/stat") stopped reporting eip/esp and fd7d56270b52 ("fs/proc: Report eip/esp in /prod/PID/stat for coredumping") reintroduced the feature to fix a regression with userspace core dump handlers (such as minicoredumper). Because PF_DUMPCORE is only set for the primary thread, this didn't fix the original problem for secondary threads. Allow reporting the eip/esp for all threads by checking for PF_EXITING as well. This is set for all the other threads when they are killed. coredump_wait() waits for all the tasks to become inactive before proceeding to invoke a core dumper. Link: http://lkml.kernel.org/r/87y32p7i7a.fsf@linutronix.de Link: http://lkml.kernel.org/r/20190522161614.628-1-jlu@pengutronix.de Fixes: fd7d56270b526ca3 ("fs/proc: Report eip/esp in /prod/PID/stat for coredumping") Signed-off-by: John Ogness <john.ogness@linutronix.de> Reported-by: Jan Luebbe <jlu@pengutronix.de> Tested-by: Jan Luebbe <jlu@pengutronix.de> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| | * | | | | | | | | mm/dev_pfn: exclude MEMORY_DEVICE_PRIVATE while computing virtual addressAnshuman Khandual2019-06-291-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The presence of struct page does not guarantee linear mapping for the pfn physical range. Device private memory which is non-coherent is excluded from linear mapping during devm_memremap_pages() though they will still have struct page coverage. Change pfn_t_to_virt() to just check for device private memory before giving out virtual address for a given pfn. pfn_t_to_virt() actually has no callers. Let's fix it for the 5.2 kernel and remove it in 5.3. Link: http://lkml.kernel.org/r/1558089514-25067-1-git-send-email-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | | | | | | | Merge tag 'arc-5.2-rc7' of ↵Linus Torvalds2019-06-292-8/+157
| |\ \ \ \ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc Pull ARC fixes from Vineet Gupta: - hsdk platform unifying apertures - build system CROSS_COMPILE prefix * tag 'arc-5.2-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc: ARC: [plat-hsdk]: unify memory apertures configuration ARC: build: Try to guess CROSS_COMPILE with cc-cross-prefix
| | * | | | | | | | | | ARC: [plat-hsdk]: unify memory apertures configurationEugeniy Paltsev2019-06-111-8/+153
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | HSDK SoC has memory bridge which allows to configure memory map for different AXI masters in runtime. As of today we adjust memory apertures configuration in U-boot so we have different configuration in case of loading kernel via U-boot and JTAG. It isn't really critical in case of existing platform configuration as configuration differs for <currently> unused address space regions or unused AXI masters. However we may face with this issue when we'll bringup new peripherals or touch their address space. Fix that by perform full configuration of memory bridge in HSDK platform code. Basically we simply copy memory bridge configuration code from U-boot. Acked-by: Alexey Brodkin <abrodkin@synopsys.com> Signed-off-by: Eugeniy Paltsev <Eugeniy.Paltsev@synopsys.com> Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
| | * | | | | | | | | | ARC: build: Try to guess CROSS_COMPILE with cc-cross-prefixAlexey Brodkin2019-06-111-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For a long time we used to hard-code CROSS_COMPILE prefix for ARC until it started to cause problems, so we decided to solely rely on CROSS_COMPILE externally set by a user: commit 40660f1fcee8 ("ARC: build: Don't set CROSS_COMPILE in arch's Makefile"). While it works perfectly fine for build-systems where the prefix gets defined anyways for us human beings it's quite an annoying requirement especially given most of time the same one prefix "arc-linux-" is all what we need. It looks like finally we're getting the best of both worlds: 1. W/o cross-toolchain we still may install headers, build .dtb etc 2. W/ cross-toolchain get the kerne built with only ARCH=arc Inspired by [1] & [2]. [1] http://lists.infradead.org/pipermail/linux-snps-arc/2019-May/005788.html [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=fc2b47b55f17 A side note: even though "cc-cross-prefix" does its job it pollutes console with output of "which" for all the prefixes it didn't manage to find a matching cross-compiler for like that: | # ARCH=arc make defconfig | which: no arceb-linux-gcc in (~/.local/bin:~/bin:/usr/bin:/usr/sbin) | *** Default configuration is based on 'nsim_hs_defconfig' Suggested-by: Vineet Gupta <vgupta@synopsys.com> Reviewed-by: Masahiro Yamada <yamada.masahiro@socionext.com> Signed-off-by: Alexey Brodkin <abrodkin@synopsys.com> Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
| * | | | | | | | | | | Merge tag 'riscv-for-v5.2/fixes-rc7' of ↵Linus Torvalds2019-06-296-16/+39
| |\ \ \ \ \ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux Pull RISC-V fixes from Paul Walmsley: "Minor RISC-V fixes and one defconfig update. The fixes have no functional impact: - Fix some comment text in the memory management vmalloc_fault path. - Fix some warnings from the DT compiler in our newly-added DT files. - Change the newly-added DT bindings such that SoC IP blocks with external I/O are marked as "disabled" by default, then enable them explicitly in board DT files when the devices are used on the board. This aligns the bindings with existing upstream practice. - Add the MIT license as an option for a minor header file, at the request of one of the U-Boot maintainers. The RISC-V defconfig update builds the SiFive SPI driver and the MMC-SPI driver by default. The intention here is to make v5.2 more usable for testers and users with RISC-V hardware" * tag 'riscv-for-v5.2/fixes-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: riscv: mm: Fix code comment dt-bindings: clock: sifive: add MIT license as an option for the header file dt-bindings: riscv: resolve 'make dt_binding_check' warnings riscv: dts: Re-organize the DT nodes RISC-V: defconfig: enable MMC & SPI for RISC-V
| | * | | | | | | | | | | riscv: mm: Fix code commentShihPo Hung2019-06-261-3/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix the comment since vmalloc_fault doesn't reach flush_tlb_fix_spurious_fault. Signed-off-by: ShihPo Hung <shihpo.hung@sifive.com> Cc: Palmer Dabbelt <palmer@sifive.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: linux-riscv@lists.infradead.org Reviewed-by: Palmer Dabbelt <palmer@sifive.com> Signed-off-by: Paul Walmsley <paul.walmsley@sifive.com>
| | * | | | | | | | | | | dt-bindings: clock: sifive: add MIT license as an option for the header filePaul Walmsley2019-06-261-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | At Bin Meng's request, add the MIT license as an option for the SiFive FU540 PRCI header file. Signed-off-by: Paul Walmsley <paul.walmsley@sifive.com> Cc: Bin Meng <bmeng.cn@gmail.com>
| | * | | | | | | | | | | dt-bindings: riscv: resolve 'make dt_binding_check' warningsPaul Walmsley2019-06-261-12/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Rob pointed out that one of the examples in the RISC-V 'cpus' YAML schema results in warnings from 'make dt_binding_check'. Fix these. While here, make the whitespace in the second example consistent with the first example. Signed-off-by: Paul Walmsley <paul.walmsley@sifive.com> Cc: Rob Herring <robh@kernel.org> Reviewed-by: Rob Herring <robh@kernel.org> # for fixing the dtc warnings
| | * | | | | | | | | | | riscv: dts: Re-organize the DT nodesYash Shah2019-06-262-0/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As per the convention for any SOC device with external connection, define only device DT node in SOC DTSi file with status = "disabled" and enable device in Board DTS file with status = "okay" Reported-by: Anup Patel <anup@brainfault.org> Signed-off-by: Yash Shah <yash.shah@sifive.com> Signed-off-by: Paul Walmsley <paul.walmsley@sifive.com>
| | * | | | | | | | | | | RISC-V: defconfig: enable MMC & SPI for RISC-VAtish Patra2019-06-261-0/+5
| | | |_|_|/ / / / / / / | | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, riscv upstream defconfig doesn't let you boot through userspace if rootfs is on the SD card. Let's enable MMC & SPI drivers as well so that one can boot to the user space using default config in upstream kernel. While here, enable automatic mounting of devtmpfs to simplify kernel testing with minimal root filesystems. (pjw) Signed-off-by: Atish Patra <atish.patra@wdc.com> Reviewed-by: Palmer Dabbelt <palmer@sifive.com> [paul.walmsley@sifive.com: mention the DEVTMPFS_MOUNT change in the patch description] Signed-off-by: Paul Walmsley <paul.walmsley@sifive.com>
| * | | | | | | | | | | Merge tag 'nfs-for-5.2-4' of git://git.linux-nfs.org/projects/anna/linux-nfsLinus Torvalds2019-06-292-9/+9
| |\ \ \ \ \ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull two more NFS client fixes from Anna Schumaker: "These are both stable fixes. One to calculate the correct client message length in the case of partial transmissions. And the other to set the proper TCP timeout for flexfiles" * tag 'nfs-for-5.2-4' of git://git.linux-nfs.org/projects/anna/linux-nfs: NFS/flexfiles: Use the correct TCP timeout for flexfiles I/O SUNRPC: Fix up calculation of client message length
| | * | | | | | | | | | | NFS/flexfiles: Use the correct TCP timeout for flexfiles I/OTrond Myklebust2019-06-281-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix a typo where we're confusing the default TCP retrans value (NFS_DEF_TCP_RETRANS) for the default TCP timeout value. Fixes: 15d03055cf39f ("pNFS/flexfiles: Set reasonable default ...") Cc: stable@vger.kernel.org # 4.8+ Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| | * | | | | | | | | | | SUNRPC: Fix up calculation of client message lengthTrond Myklebust2019-06-281-8/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In the case where a record marker was used, xs_sendpages() needs to return the length of the payload + record marker so that we operate correctly in the case of a partial transmission. When the callers check return value, they therefore need to take into account the record marker length. Fixes: 06b5fc3ad94e ("Merge tag 'nfs-rdma-for-5.1-1'...") Cc: stable@vger.kernel.org # 5.1+ Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>