aboutsummaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* x86/irq: Seperate unused system vectors from spurious entry againWIP.irqThomas Gleixner2019-06-285-16/+76
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Quite some time ago the interrupt entry stubs for unused vectors in the system vector range got removed and directly mapped to the spurious interrupt vector entry point. Sounds reasonable, but it's subtly broken. The spurious interrupt vector entry point pushes vector number 0xFF on the stack which makes the whole logic in __smp_spurious_interrupt() pointless. As a consequence any spurious interrupt which comes from a vector != 0xFF is treated as a real spurious interrupt (vector 0xFF) and not acknowledged. That subsequently stalls all interrupt vectors of equal and lower priority, which brings the system to a grinding halt. This can happen because even on 64-bit the system vector space is not guaranteed to be fully populated. A full compile time handling of the unused vectors is not possible because quite some of them are conditonally populated at runtime. Bring the entry stubs back, which wastes 160 bytes if all stubs are unused, but gains the proper handling back. There is no point to selectively spare some of the stubs which are known at compile time as the required code in the IDT management would be way larger and convoluted. Do not route the spurious entries through common_interrupt and do_IRQ() as the original code did. Route it to smp_spurious_interrupt() which evaluates the vector number and acts accordingly now that the real vector numbers are handed in. Fixup the pr_warn so the actual spurious vector (0xff) is clearly distiguished from the other vectors and also note for the vectored case whether it was pending in the ISR or not. "Spurious APIC interrupt (vector 0xFF) on CPU#0, should never happen." "Spurious interrupt vector 0xed on CPU#1. Acked." "Spurious interrupt vector 0xee on CPU#1. Not pending!." Fixes: 2414e021ac8d ("x86: Avoid building unused IRQ entry stubs") Reported-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Jan Beulich <jbeulich@suse.com>
* x86/irq: Handle spurious interrupt after shutdown gracefullyThomas Gleixner2019-06-283-4/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since the rework of the vector management, warnings about spurious interrupts have been reported. Robert provided some more information and did an initial analysis. The following situation leads to these warnings: CPU 0 CPU 1 IO_APIC interrupt is raised sent to CPU1 Unable to handle immediately (interrupts off, deep idle delay) mask() ... free() shutdown() synchronize_irq() clear_vector() do_IRQ() -> vector is clear Before the rework the vector entries of legacy interrupts were statically assigned and occupied precious vector space while most of them were unused. Due to that the above situation was handled silently because the vector was handled and the core handler of the assigned interrupt descriptor noticed that it is shut down and returned. While this has been usually observed with legacy interrupts, this situation is not limited to them. Any other interrupt source, e.g. MSI, can cause the same issue. After adding proper synchronization for level triggered interrupts, this can only happen for edge triggered interrupts where the IO-APIC obviously cannot provide information about interrupts in flight. While the spurious warning is actually harmless in this case it worries users and driver developers. Handle it gracefully by marking the vector entry as VECTOR_SHUTDOWN instead of VECTOR_UNUSED when the vector is freed up. If that above late handling happens the spurious detector will not complain and switch the entry to VECTOR_UNUSED. Any subsequent spurious interrupt on that line will trigger the spurious warning as before. Fixes: 464d12309e1b ("x86/vector: Switch IOAPIC to global reservation mode") Reported-by: Robert Hodaszi <Robert.Hodaszi@digi.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>- Tested-by: Robert Hodaszi <Robert.Hodaszi@digi.com>
* x86/ioapic: Implement irq_get_irqchip_state() callbackThomas Gleixner2019-06-281-0/+46
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When an interrupt is shut down in free_irq() there might be an inflight interrupt pending in the IO-APIC remote IRR which is not yet serviced. That means the interrupt has been sent to the target CPUs local APIC, but the target CPU is in a state which delays the servicing. So free_irq() would proceed to free resources and to clear the vector because synchronize_hardirq() does not see an interrupt handler in progress. That can trigger a spurious interrupt warning, which is harmless and just confuses users, but it also can leave the remote IRR in a stale state because once the handler is invoked the interrupt resources might be freed already and therefore acknowledgement is not possible anymore. Implement the irq_get_irqchip_state() callback for the IO-APIC irq chip. The callback is invoked from free_irq() via __synchronize_hardirq(). Check the remote IRR bit of the interrupt and return 'in flight' if it is set and the interrupt is configured in level mode. For edge mode the remote IRR has no meaning. As this is only meaningful for level triggered interrupts this won't cure the potential spurious interrupt warning for edge triggered interrupts, but the edge trigger case does not result in stale hardware state. This has to be addressed at the vector/interrupt entry level seperately. Fixes: 464d12309e1b ("x86/vector: Switch IOAPIC to global reservation mode") Reported-by: Robert Hodaszi <Robert.Hodaszi@digi.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* genirq: Add optional hardware synchronization for shutdownThomas Gleixner2019-06-282-19/+60
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | free_irq() ensures that no hardware interrupt handler is executing on a different CPU before actually releasing resources and deactivating the interrupt completely in a domain hierarchy. But that does not catch the case where the interrupt is on flight at the hardware level but not yet serviced by the target CPU. That creates an interesing race condition: CPU 0 CPU 1 IRQ CHIP interrupt is raised sent to CPU1 Unable to handle immediately (interrupts off, deep idle delay) mask() ... free() shutdown() synchronize_irq() release_resources() do_IRQ() -> resources are not available That might be harmless and just trigger a spurious interrupt warning, but some interrupt chips might get into a wedged state. Utilize the existing irq_get_irqchip_state() callback for the synchronization in free_irq(). synchronize_hardirq() is not using this mechanism as it might actually deadlock unter certain conditions, e.g. when called with interrupts disabled and the target CPU is the one on which the synchronization is invoked. synchronize_irq() uses it because that function cannot be called from non preemtible contexts as it might sleep. Fixes: 464d12309e1b ("x86/vector: Switch IOAPIC to global reservation mode") Reported-by: Robert Hodaszi <Robert.Hodaszi@digi.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* genirq: Fix misleading synchronize_irq() documentationThomas Gleixner2019-06-281-1/+2
| | | | | | | The function might sleep, so it cannot be called from interrupt context. Not even with care. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* genirq: Delay deactivation in free_irq()Thomas Gleixner2019-06-285-5/+22
| | | | | | | | | | | | | | | | | | | | | | | | | When interrupts are shutdown, they are immediately deactivated in the irqdomain hierarchy. While this looks obviously correct there is a subtle issue: There might be an interrupt in flight when free_irq() is invoking the shutdown. This is properly handled at the irq descriptor / primary handler level, but the deactivation might completely disable resources which are required to acknowledge the interrupt. Split the shutdown code and deactivate the interrupt after synchronization in free_irq(). Fixup all other usage sites where this is not an issue to invoke the combined shutdown_and_deactivate() function instead. This still might be an issue if the interrupt in flight servicing is delayed on a remote CPU beyond the invocation of synchronize_irq(), but that cannot be handled at that level and needs to be handled in the synchronize_irq() context. Fixes: f8264e34965a ("irqdomain: Introduce new interfaces to support hierarchy irqdomains") Reported-by: Robert Hodaszi <Robert.Hodaszi@digi.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
* Linux 5.2-rc6v5.2-rc6WIP.x86/ipiLinus Torvalds2019-06-221-1/+1
|
* Merge tag 'iommu-fix-v5.2-rc5' of ↵Linus Torvalds2019-06-221-4/+3
|\ | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu Pull iommu fix from Joerg Roedel: "Revert a commit from the previous pile of fixes which causes new lockdep splats. It is better to revert it for now and work on a better and more well tested fix" * tag 'iommu-fix-v5.2-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: Revert "iommu/vt-d: Fix lock inversion between iommu->lock and device_domain_lock"
| * Revert "iommu/vt-d: Fix lock inversion between iommu->lock and ↵Peter Xu2019-06-221-4/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | device_domain_lock" This reverts commit 7560cc3ca7d9d11555f80c830544e463fcdb28b8. With 5.2.0-rc5 I can easily trigger this with lockdep and iommu=pt: ====================================================== WARNING: possible circular locking dependency detected 5.2.0-rc5 #78 Not tainted ------------------------------------------------------ swapper/0/1 is trying to acquire lock: 00000000ea2b3beb (&(&iommu->lock)->rlock){+.+.}, at: domain_context_mapping_one+0xa5/0x4e0 but task is already holding lock: 00000000a681907b (device_domain_lock){....}, at: domain_context_mapping_one+0x8d/0x4e0 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (device_domain_lock){....}: _raw_spin_lock_irqsave+0x3c/0x50 dmar_insert_one_dev_info+0xbb/0x510 domain_add_dev_info+0x50/0x90 dev_prepare_static_identity_mapping+0x30/0x68 intel_iommu_init+0xddd/0x1422 pci_iommu_init+0x16/0x3f do_one_initcall+0x5d/0x2b4 kernel_init_freeable+0x218/0x2c1 kernel_init+0xa/0x100 ret_from_fork+0x3a/0x50 -> #0 (&(&iommu->lock)->rlock){+.+.}: lock_acquire+0x9e/0x170 _raw_spin_lock+0x25/0x30 domain_context_mapping_one+0xa5/0x4e0 pci_for_each_dma_alias+0x30/0x140 dmar_insert_one_dev_info+0x3b2/0x510 domain_add_dev_info+0x50/0x90 dev_prepare_static_identity_mapping+0x30/0x68 intel_iommu_init+0xddd/0x1422 pci_iommu_init+0x16/0x3f do_one_initcall+0x5d/0x2b4 kernel_init_freeable+0x218/0x2c1 kernel_init+0xa/0x100 ret_from_fork+0x3a/0x50 other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(device_domain_lock); lock(&(&iommu->lock)->rlock); lock(device_domain_lock); lock(&(&iommu->lock)->rlock); *** DEADLOCK *** 2 locks held by swapper/0/1: #0: 00000000033eb13d (dmar_global_lock){++++}, at: intel_iommu_init+0x1e0/0x1422 #1: 00000000a681907b (device_domain_lock){....}, at: domain_context_mapping_one+0x8d/0x4e0 stack backtrace: CPU: 2 PID: 1 Comm: swapper/0 Not tainted 5.2.0-rc5 #78 Hardware name: LENOVO 20KGS35G01/20KGS35G01, BIOS N23ET50W (1.25 ) 06/25/2018 Call Trace: dump_stack+0x85/0xc0 print_circular_bug.cold.57+0x15c/0x195 __lock_acquire+0x152a/0x1710 lock_acquire+0x9e/0x170 ? domain_context_mapping_one+0xa5/0x4e0 _raw_spin_lock+0x25/0x30 ? domain_context_mapping_one+0xa5/0x4e0 domain_context_mapping_one+0xa5/0x4e0 ? domain_context_mapping_one+0x4e0/0x4e0 pci_for_each_dma_alias+0x30/0x140 dmar_insert_one_dev_info+0x3b2/0x510 domain_add_dev_info+0x50/0x90 dev_prepare_static_identity_mapping+0x30/0x68 intel_iommu_init+0xddd/0x1422 ? printk+0x58/0x6f ? lockdep_hardirqs_on+0xf0/0x180 ? do_early_param+0x8e/0x8e ? e820__memblock_setup+0x63/0x63 pci_iommu_init+0x16/0x3f do_one_initcall+0x5d/0x2b4 ? do_early_param+0x8e/0x8e ? rcu_read_lock_sched_held+0x55/0x60 ? do_early_param+0x8e/0x8e kernel_init_freeable+0x218/0x2c1 ? rest_init+0x230/0x230 kernel_init+0xa/0x100 ret_from_fork+0x3a/0x50 domain_context_mapping_one() is taking device_domain_lock first then iommu lock, while dmar_insert_one_dev_info() is doing the reverse. That should be introduced by commit: 7560cc3ca7d9 ("iommu/vt-d: Fix lock inversion between iommu->lock and device_domain_lock", 2019-05-27) So far I still cannot figure out how the previous deadlock was triggered (I cannot find iommu lock taken before calling of iommu_flush_dev_iotlb()), however I'm pretty sure that that change should be incomplete at least because it does not fix all the places so we're still taking the locks in different orders, while reverting that commit is very clean to me so far that we should always take device_domain_lock first then the iommu lock. We can continue to try to find the real culprit mentioned in 7560cc3ca7d9, but for now I think we should revert it to fix current breakage. CC: Joerg Roedel <joro@8bytes.org> CC: Lu Baolu <baolu.lu@linux.intel.com> CC: dave.jiang@intel.com Signed-off-by: Peter Xu <peterx@redhat.com> Tested-by: Chris Wilson <chris@chris-wilson.co.uk> Signed-off-by: Joerg Roedel <jroedel@suse.de>
* | Merge tag 'pci-v5.2-fixes-1' of ↵Linus Torvalds2019-06-221-0/+4
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci Pull PCI fix from Bjorn Helgaas: "If an IOMMU is present, ignore the P2PDMA whitelist we added for v5.2 because we don't yet know how to support P2PDMA in that case (Logan Gunthorpe)" * tag 'pci-v5.2-fixes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: PCI/P2PDMA: Ignore root complex whitelist when an IOMMU is present
| * | PCI/P2PDMA: Ignore root complex whitelist when an IOMMU is presentLogan Gunthorpe2019-06-191-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Presently, there is no path to DMA map P2PDMA memory, so if a TLP targeting this memory hits the root complex and an IOMMU is present, the IOMMU will reject the transaction, even if the RC would support P2PDMA. So until the kernel knows to map these DMA addresses in the IOMMU, we should not enable the whitelist when an IOMMU is present. Link: https://lore.kernel.org/linux-pci/20190522201252.2997-1-logang@deltatee.com/ Fixes: 0f97da831026 ("PCI/P2PDMA: Allow P2P DMA between any devices under AMD ZEN Root Complex") Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Cc: Christoph Hellwig <hch@lst.de>
* | | Merge tag 'scsi-fixes' of ↵Linus Torvalds2019-06-224-11/+11
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi Pull SCSI fixes from James Bottomley: "Three driver fixes (and one version number update): a suspend hang in ufs, a qla hard lock on module removal and a qedi panic during discovery" * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: scsi: qla2xxx: Fix hardlockup in abort command during driver remove scsi: ufs: Avoid runtime suspend possibly being blocked forever scsi: qedi: update driver version to 8.37.0.20 scsi: qedi: Check targetname while finding boot target information
| * | | scsi: qla2xxx: Fix hardlockup in abort command during driver removeArun Easi2019-06-181-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [436194.555537] NMI watchdog: Watchdog detected hard LOCKUP on cpu 5 [436194.555558] RIP: 0010:native_queued_spin_lock_slowpath+0x63/0x1e0 [436194.555563] Call Trace: [436194.555564] _raw_spin_lock_irqsave+0x30/0x40 [436194.555564] qla24xx_async_abort_command+0x29/0xd0 [qla2xxx] [436194.555565] qla24xx_abort_command+0x208/0x2d0 [qla2xxx] [436194.555565] __qla2x00_abort_all_cmds+0x16b/0x290 [qla2xxx] [436194.555565] qla2x00_abort_all_cmds+0x42/0x60 [qla2xxx] [436194.555566] qla2x00_abort_isp_cleanup+0x2bd/0x3a0 [qla2xxx] [436194.555566] qla2x00_remove_one+0x1ad/0x360 [qla2xxx] [436194.555566] pci_device_remove+0x3b/0xb0 Fixes: 219d27d7147e (scsi: qla2xxx: Fix race conditions in the code for aborting SCSI commands) Cc: stable@vger.kernel.org # 5.2 Signed-off-by: Arun Easi <aeasi@marvell.com> Signed-off-by: Himanshu Madhani <hmadhani@marvell.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
| * | | scsi: ufs: Avoid runtime suspend possibly being blocked foreverStanley Chu2019-06-181-7/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | UFS runtime suspend can be triggered after pm_runtime_enable() is invoked in ufshcd_pltfrm_init(). However if the first runtime suspend is triggered before binding ufs_hba structure to ufs device structure via platform_set_drvdata(), then UFS runtime suspend will be no longer triggered in the future because its dev->power.runtime_error was set in the first triggering and does not have any chance to be cleared. To be more clear, dev->power.runtime_error is set if hba is NULL in ufshcd_runtime_suspend() which returns -EINVAL to rpm_callback() where dev->power.runtime_error is set as -EINVAL. In this case, any future rpm_suspend() for UFS device fails because rpm_check_suspend_allowed() fails due to non-zero dev->power.runtime_error. To resolve this issue, make sure the first UFS runtime suspend get valid "hba" in ufshcd_runtime_suspend(): Enable UFS runtime PM only after hba is successfully bound to UFS device structure. Fixes: 62694735ca95 ([SCSI] ufs: Add runtime PM support for UFS host controller driver) Cc: stable@vger.kernel.org Signed-off-by: Stanley Chu <stanley.chu@mediatek.com> Reviewed-by: Avri Altman <avri.altman@wdc.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
| * | | scsi: qedi: update driver version to 8.37.0.20Nilesh Javali2019-06-181-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Update qedi driver version to 8.37.0.20 Signed-off-by: Nilesh Javali <njavali@marvell.com> Reviewed-by: Lee Duncan <lduncan@suse.com> Reviewed-by: Chris Leech <cleech@redhat.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
| * | | scsi: qedi: Check targetname while finding boot target informationNilesh Javali2019-06-181-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The kernel panic was observed during iSCSI discovery via offload with below call trace, [ 2115.646901] BUG: unable to handle kernel NULL pointer dereference at (null) [ 2115.646909] IP: [<ffffffffacf7f0cc>] strncmp+0xc/0x60 [ 2115.646927] PGD 0 [ 2115.646932] Oops: 0000 [#1] SMP [ 2115.647107] CPU: 24 PID: 264 Comm: kworker/24:1 Kdump: loaded Tainted: G OE ------------ 3.10.0-957.el7.x86_64 #1 [ 2115.647133] Workqueue: slowpath-13:00. qed_slowpath_task [qed] [ 2115.647135] task: ffff8d66af80b0c0 ti: ffff8d66afb80000 task.ti: ffff8d66afb80000 [ 2115.647136] RIP: 0010:[<ffffffffacf7f0cc>] [<ffffffffacf7f0cc>] strncmp+0xc/0x60 [ 2115.647141] RSP: 0018:ffff8d66afb83c68 EFLAGS: 00010206 [ 2115.647143] RAX: 0000000000000001 RBX: 0000000000000007 RCX: 000000000000000a [ 2115.647144] RDX: 0000000000000100 RSI: 0000000000000000 RDI: ffff8d632b3ba040 [ 2115.647145] RBP: ffff8d66afb83c68 R08: 0000000000000000 R09: 000000000000ffff [ 2115.647147] R10: 0000000000000007 R11: 0000000000000800 R12: ffff8d66a30007a0 [ 2115.647148] R13: ffff8d66747a3c10 R14: ffff8d632b3ba000 R15: ffff8d66747a32f8 [ 2115.647149] FS: 0000000000000000(0000) GS:ffff8d66aff00000(0000) knlGS:0000000000000000 [ 2115.647151] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 2115.647152] CR2: 0000000000000000 CR3: 0000000509610000 CR4: 00000000007607e0 [ 2115.647153] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 2115.647154] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 2115.647155] PKRU: 00000000 [ 2115.647157] Call Trace: [ 2115.647165] [<ffffffffc0634cc5>] qedi_get_protocol_tlv_data+0x2c5/0x510 [qedi] [ 2115.647184] [<ffffffffc05968f5>] ? qed_mfw_process_tlv_req+0x245/0xbe0 [qed] [ 2115.647195] [<ffffffffc05496cb>] qed_mfw_fill_tlv_data+0x4b/0xb0 [qed] [ 2115.647206] [<ffffffffc0596911>] qed_mfw_process_tlv_req+0x261/0xbe0 [qed] [ 2115.647215] [<ffffffffacce0e8e>] ? dequeue_task_fair+0x41e/0x660 [ 2115.647221] [<ffffffffacc2a59e>] ? __switch_to+0xce/0x580 [ 2115.647230] [<ffffffffc0546013>] qed_slowpath_task+0xa3/0x160 [qed] [ 2115.647278] RIP [<ffffffffacf7f0cc>] strncmp+0xc/0x60 Fix kernel panic by validating the session targetname before providing TLV data and confirming the presence of boot targets. Signed-off-by: Nilesh Javali <njavali@marvell.com> Reviewed-by: Lee Duncan <lduncan@suse.com> Reviewed-by: Chris Leech <cleech@redhat.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
* | | | Merge tag 'powerpc-5.2-5' of ↵Linus Torvalds2019-06-228-9/+31
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc fixes from Michael Ellerman: "This is a frustratingly large batch at rc5. Some of these were sent earlier but were missed by me due to being distracted by other things, and some took a while to track down due to needing manual bisection on old hardware. But still we clearly need to improve our testing of KVM, and of 32-bit, so that we catch these earlier. Summary: seven fixes, all for bugs introduced this cycle. - The commit to add KASAN support broke booting on 32-bit SMP machines, due to a refactoring that moved some setup out of the secondary CPU path. - A fix for another 32-bit SMP bug introduced by the fast syscall entry implementation for 32-bit BOOKE. And a build fix for the same commit. - Our change to allow the DAWR to be force enabled on Power9 introduced a bug in KVM, where we clobber r3 leading to a host crash. - The same commit also exposed a previously unreachable bug in the nested KVM handling of DAWR, which could lead to an oops in a nested host. - One of the DMA reworks broke the b43legacy WiFi driver on some people's powermacs, fix it by enabling a 30-bit ZONE_DMA on 32-bit. - A fix for TLB flushing in KVM introduced a new bug, as it neglected to also flush the ERAT, this could lead to memory corruption in the guest. Thanks to: Aaro Koskinen, Christoph Hellwig, Christophe Leroy, Larry Finger, Michael Neuling, Suraj Jitindar Singh" * tag 'powerpc-5.2-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: KVM: PPC: Book3S HV: Invalidate ERAT when flushing guest TLB entries powerpc: enable a 30-bit ZONE_DMA for 32-bit pmac KVM: PPC: Book3S HV: Only write DAWR[X] when handling h_set_dawr in real mode KVM: PPC: Book3S HV: Fix r3 corruption in h_set_dabr() powerpc/32: fix build failure on book3e with KVM powerpc/booke: fix fast syscall entry on SMP powerpc/32s: fix initial setup of segment registers on secondary CPU
| * | | | KVM: PPC: Book3S HV: Invalidate ERAT when flushing guest TLB entriesSuraj Jitindar Singh2019-06-201-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a guest vcpu moves from one physical thread to another it is necessary for the host to perform a tlb flush on the previous core if another vcpu from the same guest is going to run there. This is because the guest may use the local form of the tlb invalidation instruction meaning stale tlb entries would persist where it previously ran. This is handled on guest entry in kvmppc_check_need_tlb_flush() which calls flush_guest_tlb() to perform the tlb flush. Previously the generic radix__local_flush_tlb_lpid_guest() function was used, however the functionality was reimplemented in flush_guest_tlb() to avoid the trace_tlbie() call as the flushing may be done in real mode. The reimplementation in flush_guest_tlb() was missing an erat invalidation after flushing the tlb. This lead to observable memory corruption in the guest due to the caching of stale translations. Fix this by adding the erat invalidation. Fixes: 70ea13f6e609 ("KVM: PPC: Book3S HV: Flush TLB on secondary radix threads") Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
| * | | | powerpc: enable a 30-bit ZONE_DMA for 32-bit pmacChristoph Hellwig2019-06-193-1/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With the strict dma mask checking introduced with the switch to the generic DMA direct code common wifi chips on 32-bit powerbooks stopped working. Add a 30-bit ZONE_DMA to the 32-bit pmac builds to allow them to reliably allocate dma coherent memory. Fixes: 65a21b71f948 ("powerpc/dma: remove dma_nommu_dma_supported") Reported-by: Aaro Koskinen <aaro.koskinen@iki.fi> Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Larry Finger <Larry.Finger@lwfinger.net> Acked-by: Larry Finger <Larry.Finger@lwfinger.net> Tested-by: Aaro Koskinen <aaro.koskinen@iki.fi> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
| * | | | KVM: PPC: Book3S HV: Only write DAWR[X] when handling h_set_dawr in real modeSuraj Jitindar Singh2019-06-181-1/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The hcall H_SET_DAWR is used by a guest to set the data address watchpoint register (DAWR). This hcall is handled in the host in kvmppc_h_set_dawr() which can be called in either real mode on the guest exit path from hcall_try_real_mode() in book3s_hv_rmhandlers.S, or in virtual mode when called from kvmppc_pseries_do_hcall() in book3s_hv.c. The function kvmppc_h_set_dawr() updates the dawr and dawrx fields in the vcpu struct accordingly and then also writes the respective values into the DAWR and DAWRX registers directly. It is necessary to write the registers directly here when calling the function in real mode since the path to re-enter the guest won't do this. However when in virtual mode the host DAWR and DAWRX values have already been restored, and so writing the registers would overwrite these. Additionally there is no reason to write the guest values here as these will be read from the vcpu struct and written to the registers appropriately the next time the vcpu is run. This also avoids the case when handling h_set_dawr for a nested guest where the guest hypervisor isn't able to write the DAWR and DAWRX registers directly and must rely on the real hypervisor to do this for it when it calls H_ENTER_NESTED. Fixes: c1fe190c0672 ("powerpc: Add force enable of DAWR on P9 option") Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
| * | | | KVM: PPC: Book3S HV: Fix r3 corruption in h_set_dabr()Michael Neuling2019-06-181-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit c1fe190c0672 ("powerpc: Add force enable of DAWR on P9 option") screwed up some assembler and corrupted a pointer in r3. This resulted in crashes like the below: BUG: Kernel NULL pointer dereference at 0x000013bf Faulting instruction address: 0xc00000000010b044 Oops: Kernel access of bad area, sig: 11 [#1] LE PAGE_SIZE=64K MMU=Radix MMU=Hash SMP NR_CPUS=2048 NUMA pSeries CPU: 8 PID: 1771 Comm: qemu-system-ppc Kdump: loaded Not tainted 5.2.0-rc4+ #3 NIP: c00000000010b044 LR: c0080000089dacf4 CTR: c00000000010aff4 REGS: c00000179b397710 TRAP: 0300 Not tainted (5.2.0-rc4+) MSR: 800000000280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 42244842 XER: 00000000 CFAR: c00000000010aff8 DAR: 00000000000013bf DSISR: 42000000 IRQMASK: 0 GPR00: c0080000089dd6bc c00000179b3979a0 c008000008a04300 ffffffffffffffff GPR04: 0000000000000000 0000000000000003 000000002444b05d c0000017f11c45d0 ... NIP kvmppc_h_set_dabr+0x50/0x68 LR kvmppc_pseries_do_hcall+0xa3c/0xeb0 [kvm_hv] Call Trace: 0xc0000017f11c0000 (unreliable) kvmppc_vcpu_run_hv+0x694/0xec0 [kvm_hv] kvmppc_vcpu_run+0x34/0x48 [kvm] kvm_arch_vcpu_ioctl_run+0x2f4/0x400 [kvm] kvm_vcpu_ioctl+0x460/0x850 [kvm] do_vfs_ioctl+0xe4/0xb40 ksys_ioctl+0xc4/0x110 sys_ioctl+0x28/0x80 system_call+0x5c/0x70 Instruction dump: 4082fff4 4c00012c 38600000 4e800020 e96280c0 896b0000 2c2b0000 3860ffff 4d820020 50852e74 508516f6 78840724 <f88313c0> f8a313c8 7c942ba6 7cbc2ba6 Fix the bug by only changing r3 when we are returning immediately. Fixes: c1fe190c0672 ("powerpc: Add force enable of DAWR on P9 option") Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Reported-by: Cédric Le Goater <clg@kaod.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
| * | | | powerpc/32: fix build failure on book3e with KVMChristophe Leroy2019-06-162-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Build failure was introduced by the commit identified below, due to missed macro expension leading to wrong called function's name. arch/powerpc/kernel/head_fsl_booke.o: In function `SystemCall': arch/powerpc/kernel/head_fsl_booke.S:416: undefined reference to `kvmppc_handler_BOOKE_INTERRUPT_SYSCALL_SPRN_SRR1' Makefile:1052: recipe for target 'vmlinux' failed The called function should be kvmppc_handler_8_0x01B(). This patch fixes it. Reported-by: Paul Mackerras <paulus@ozlabs.org> Fixes: 1a4b739bbb4f ("powerpc/32: implement fast entry for syscalls on BOOKE") Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
| * | | | powerpc/booke: fix fast syscall entry on SMPChristophe Leroy2019-06-151-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use r10 instead of r9 to calculate CPU offset as r9 contains the value from SRR1 which is used later. Fixes: 1a4b739bbb4f ("powerpc/32: implement fast entry for syscalls on BOOKE") Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
| * | | | powerpc/32s: fix initial setup of segment registers on secondary CPUChristophe Leroy2019-06-151-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The patch referenced below moved the loading of segment registers out of load_up_mmu() in order to do it earlier in the boot sequence. However, the secondary CPU still needs it to be done when loading up the MMU. Reported-by: Erhard F. <erhard_f@mailbox.org> Fixes: 215b823707ce ("powerpc/32s: set up an early static hash table for KASAN") Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
* | | | | Bluetooth: Fix regression with minimum encryption key size alignmentMarcel Holtmann2019-06-222-14/+37
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When trying to align the minimum encryption key size requirement for Bluetooth connections, it turns out doing this in a central location in the HCI connection handling code is not possible. Original Bluetooth version up to 2.0 used a security model where the L2CAP service would enforce authentication and encryption. Starting with Bluetooth 2.1 and Secure Simple Pairing that model has changed into that the connection initiator is responsible for providing an encrypted ACL link before any L2CAP communication can happen. Now connecting Bluetooth 2.1 or later devices with Bluetooth 2.0 and before devices are causing a regression. The encryption key size check needs to be moved out of the HCI connection handling into the L2CAP channel setup. To achieve this, the current check inside hci_conn_security() has been moved into l2cap_check_enc_key_size() helper function and then called from four decisions point inside L2CAP to cover all combinations of Secure Simple Pairing enabled devices and device using legacy pairing and legacy service security model. Fixes: d5bb334a8e17 ("Bluetooth: Align minimum encryption key size for LE and BR/EDR connections") Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=203643 Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds2019-06-2125-112/+123
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull networking fixes from David Miller: 1) Fix leak of unqueued fragments in ipv6 nf_defrag, from Guillaume Nault. 2) Don't access the DDM interface unless the transceiver implements it in bnx2x, from Mauro S. M. Rodrigues. 3) Don't double fetch 'len' from userspace in sock_getsockopt(), from JingYi Hou. 4) Sign extension overflow in lio_core, from Colin Ian King. 5) Various netem bug fixes wrt. corrupted packets from Jakub Kicinski. 6) Fix epollout hang in hvsock, from Sunil Muthuswamy. 7) Fix regression in default fib6_type, from David Ahern. 8) Handle memory limits in tcp_fragment more appropriately, from Eric Dumazet. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (24 commits) tcp: refine memory limit test in tcp_fragment() inet: clear num_timeout reqsk_alloc() net: mvpp2: debugfs: Add pmap to fs dump ipv6: Default fib6_type to RTN_UNICAST when not set net: hns3: Fix inconsistent indenting net/af_iucv: always register net_device notifier net/af_iucv: build proper skbs for HiperTransport net/af_iucv: remove GFP_DMA restriction for HiperTransport net: dsa: mv88e6xxx: fix shift of FID bits in mv88e6185_g1_vtu_loadpurge() hvsock: fix epollout hang from race condition net/udp_gso: Allow TX timestamp with UDP GSO net: netem: fix use after free and double free with packet corruption net: netem: fix backlog accounting for corrupted GSO frames net: lio_core: fix potential sign-extension overflow on large shift tipc: pass tunnel dev as NULL to udp_tunnel(6)_xmit_skb ip6_tunnel: allow not to count pkts on tstats by passing dev as NULL ip_tunnel: allow not to count pkts on tstats by setting skb's dev to NULL tun: wake up waitqueues after IFF_UP is set net: remove duplicate fetch in sock_getsockopt tipc: fix issues with early FAILOVER_MSG from peer ...
| * | | | | tcp: refine memory limit test in tcp_fragment()Eric Dumazet2019-06-211-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | tcp_fragment() might be called for skbs in the write queue. Memory limits might have been exceeded because tcp_sendmsg() only checks limits at full skb (64KB) boundaries. Therefore, we need to make sure tcp_fragment() wont punish applications that might have setup very low SO_SNDBUF values. Fixes: f070ef2ac667 ("tcp: tcp_fragment() should apply sane memory limits") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Christoph Paasch <cpaasch@apple.com> Tested-by: Christoph Paasch <cpaasch@apple.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | inet: clear num_timeout reqsk_alloc()Eric Dumazet2019-06-193-8/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | KMSAN caught uninit-value in tcp_create_openreq_child() [1] This is caused by a recent change, combined by the fact that TCP cleared num_timeout, num_retrans and sk fields only when a request socket was about to be queued. Under syncookie mode, a temporary request socket is used, and req->num_timeout could contain garbage. Lets clear these three fields sooner, there is really no point trying to defer this and risk other bugs. [1] BUG: KMSAN: uninit-value in tcp_create_openreq_child+0x157f/0x1cc0 net/ipv4/tcp_minisocks.c:526 CPU: 1 PID: 13357 Comm: syz-executor591 Not tainted 5.2.0-rc4+ #3 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: <IRQ> __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x191/0x1f0 lib/dump_stack.c:113 kmsan_report+0x162/0x2d0 mm/kmsan/kmsan.c:611 __msan_warning+0x75/0xe0 mm/kmsan/kmsan_instr.c:304 tcp_create_openreq_child+0x157f/0x1cc0 net/ipv4/tcp_minisocks.c:526 tcp_v6_syn_recv_sock+0x761/0x2d80 net/ipv6/tcp_ipv6.c:1152 tcp_get_cookie_sock+0x16e/0x6b0 net/ipv4/syncookies.c:209 cookie_v6_check+0x27e0/0x29a0 net/ipv6/syncookies.c:252 tcp_v6_cookie_check net/ipv6/tcp_ipv6.c:1039 [inline] tcp_v6_do_rcv+0xf1c/0x1ce0 net/ipv6/tcp_ipv6.c:1344 tcp_v6_rcv+0x60b7/0x6a30 net/ipv6/tcp_ipv6.c:1554 ip6_protocol_deliver_rcu+0x1433/0x22f0 net/ipv6/ip6_input.c:397 ip6_input_finish net/ipv6/ip6_input.c:438 [inline] NF_HOOK include/linux/netfilter.h:305 [inline] ip6_input+0x2af/0x340 net/ipv6/ip6_input.c:447 dst_input include/net/dst.h:439 [inline] ip6_rcv_finish net/ipv6/ip6_input.c:76 [inline] NF_HOOK include/linux/netfilter.h:305 [inline] ipv6_rcv+0x683/0x710 net/ipv6/ip6_input.c:272 __netif_receive_skb_one_core net/core/dev.c:4981 [inline] __netif_receive_skb net/core/dev.c:5095 [inline] process_backlog+0x721/0x1410 net/core/dev.c:5906 napi_poll net/core/dev.c:6329 [inline] net_rx_action+0x738/0x1940 net/core/dev.c:6395 __do_softirq+0x4ad/0x858 kernel/softirq.c:293 do_softirq_own_stack+0x49/0x80 arch/x86/entry/entry_64.S:1052 </IRQ> do_softirq kernel/softirq.c:338 [inline] __local_bh_enable_ip+0x199/0x1e0 kernel/softirq.c:190 local_bh_enable+0x36/0x40 include/linux/bottom_half.h:32 rcu_read_unlock_bh include/linux/rcupdate.h:682 [inline] ip6_finish_output2+0x213f/0x2670 net/ipv6/ip6_output.c:117 ip6_finish_output+0xae4/0xbc0 net/ipv6/ip6_output.c:150 NF_HOOK_COND include/linux/netfilter.h:294 [inline] ip6_output+0x5d3/0x720 net/ipv6/ip6_output.c:167 dst_output include/net/dst.h:433 [inline] NF_HOOK include/linux/netfilter.h:305 [inline] ip6_xmit+0x1f53/0x2650 net/ipv6/ip6_output.c:271 inet6_csk_xmit+0x3df/0x4f0 net/ipv6/inet6_connection_sock.c:135 __tcp_transmit_skb+0x4076/0x5b40 net/ipv4/tcp_output.c:1156 tcp_transmit_skb net/ipv4/tcp_output.c:1172 [inline] tcp_write_xmit+0x39a9/0xa730 net/ipv4/tcp_output.c:2397 __tcp_push_pending_frames+0x124/0x4e0 net/ipv4/tcp_output.c:2573 tcp_send_fin+0xd43/0x1540 net/ipv4/tcp_output.c:3118 tcp_close+0x16ba/0x1860 net/ipv4/tcp.c:2403 inet_release+0x1f7/0x270 net/ipv4/af_inet.c:427 inet6_release+0xaf/0x100 net/ipv6/af_inet6.c:470 __sock_release net/socket.c:601 [inline] sock_close+0x156/0x490 net/socket.c:1273 __fput+0x4c9/0xba0 fs/file_table.c:280 ____fput+0x37/0x40 fs/file_table.c:313 task_work_run+0x22e/0x2a0 kernel/task_work.c:113 tracehook_notify_resume include/linux/tracehook.h:185 [inline] exit_to_usermode_loop arch/x86/entry/common.c:168 [inline] prepare_exit_to_usermode+0x39d/0x4d0 arch/x86/entry/common.c:199 syscall_return_slowpath+0x90/0x5c0 arch/x86/entry/common.c:279 do_syscall_64+0xe2/0xf0 arch/x86/entry/common.c:305 entry_SYSCALL_64_after_hwframe+0x63/0xe7 RIP: 0033:0x401d50 Code: 01 f0 ff ff 0f 83 40 0d 00 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 83 3d dd 8d 2d 00 00 75 14 b8 03 00 00 00 0f 05 <48> 3d 01 f0 ff ff 0f 83 14 0d 00 00 c3 48 83 ec 08 e8 7a 02 00 00 RSP: 002b:00007fff1cf58cf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000003 RAX: 0000000000000000 RBX: 0000000000000004 RCX: 0000000000401d50 RDX: 000000000000001c RSI: 0000000000000000 RDI: 0000000000000003 RBP: 00000000004a9050 R08: 0000000020000040 R09: 000000000000001c R10: 0000000020004004 R11: 0000000000000246 R12: 0000000000402ef0 R13: 0000000000402f80 R14: 0000000000000000 R15: 0000000000000000 Uninit was created at: kmsan_save_stack_with_flags mm/kmsan/kmsan.c:201 [inline] kmsan_internal_poison_shadow+0x53/0xa0 mm/kmsan/kmsan.c:160 kmsan_kmalloc+0xa4/0x130 mm/kmsan/kmsan_hooks.c:177 kmem_cache_alloc+0x534/0xb00 mm/slub.c:2781 reqsk_alloc include/net/request_sock.h:84 [inline] inet_reqsk_alloc+0xa8/0x600 net/ipv4/tcp_input.c:6384 cookie_v6_check+0xadb/0x29a0 net/ipv6/syncookies.c:173 tcp_v6_cookie_check net/ipv6/tcp_ipv6.c:1039 [inline] tcp_v6_do_rcv+0xf1c/0x1ce0 net/ipv6/tcp_ipv6.c:1344 tcp_v6_rcv+0x60b7/0x6a30 net/ipv6/tcp_ipv6.c:1554 ip6_protocol_deliver_rcu+0x1433/0x22f0 net/ipv6/ip6_input.c:397 ip6_input_finish net/ipv6/ip6_input.c:438 [inline] NF_HOOK include/linux/netfilter.h:305 [inline] ip6_input+0x2af/0x340 net/ipv6/ip6_input.c:447 dst_input include/net/dst.h:439 [inline] ip6_rcv_finish net/ipv6/ip6_input.c:76 [inline] NF_HOOK include/linux/netfilter.h:305 [inline] ipv6_rcv+0x683/0x710 net/ipv6/ip6_input.c:272 __netif_receive_skb_one_core net/core/dev.c:4981 [inline] __netif_receive_skb net/core/dev.c:5095 [inline] process_backlog+0x721/0x1410 net/core/dev.c:5906 napi_poll net/core/dev.c:6329 [inline] net_rx_action+0x738/0x1940 net/core/dev.c:6395 __do_softirq+0x4ad/0x858 kernel/softirq.c:293 do_softirq_own_stack+0x49/0x80 arch/x86/entry/entry_64.S:1052 do_softirq kernel/softirq.c:338 [inline] __local_bh_enable_ip+0x199/0x1e0 kernel/softirq.c:190 local_bh_enable+0x36/0x40 include/linux/bottom_half.h:32 rcu_read_unlock_bh include/linux/rcupdate.h:682 [inline] ip6_finish_output2+0x213f/0x2670 net/ipv6/ip6_output.c:117 ip6_finish_output+0xae4/0xbc0 net/ipv6/ip6_output.c:150 NF_HOOK_COND include/linux/netfilter.h:294 [inline] ip6_output+0x5d3/0x720 net/ipv6/ip6_output.c:167 dst_output include/net/dst.h:433 [inline] NF_HOOK include/linux/netfilter.h:305 [inline] ip6_xmit+0x1f53/0x2650 net/ipv6/ip6_output.c:271 inet6_csk_xmit+0x3df/0x4f0 net/ipv6/inet6_connection_sock.c:135 __tcp_transmit_skb+0x4076/0x5b40 net/ipv4/tcp_output.c:1156 tcp_transmit_skb net/ipv4/tcp_output.c:1172 [inline] tcp_write_xmit+0x39a9/0xa730 net/ipv4/tcp_output.c:2397 __tcp_push_pending_frames+0x124/0x4e0 net/ipv4/tcp_output.c:2573 tcp_send_fin+0xd43/0x1540 net/ipv4/tcp_output.c:3118 tcp_close+0x16ba/0x1860 net/ipv4/tcp.c:2403 inet_release+0x1f7/0x270 net/ipv4/af_inet.c:427 inet6_release+0xaf/0x100 net/ipv6/af_inet6.c:470 __sock_release net/socket.c:601 [inline] sock_close+0x156/0x490 net/socket.c:1273 __fput+0x4c9/0xba0 fs/file_table.c:280 ____fput+0x37/0x40 fs/file_table.c:313 task_work_run+0x22e/0x2a0 kernel/task_work.c:113 tracehook_notify_resume include/linux/tracehook.h:185 [inline] exit_to_usermode_loop arch/x86/entry/common.c:168 [inline] prepare_exit_to_usermode+0x39d/0x4d0 arch/x86/entry/common.c:199 syscall_return_slowpath+0x90/0x5c0 arch/x86/entry/common.c:279 do_syscall_64+0xe2/0xf0 arch/x86/entry/common.c:305 entry_SYSCALL_64_after_hwframe+0x63/0xe7 Fixes: 336c39a03151 ("tcp: undo init congestion window on false SYNACK timeout") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | net: mvpp2: debugfs: Add pmap to fs dumpNathan Huckleberry2019-06-191-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There was an unused variable 'mvpp2_dbgfs_prs_pmap_fops' Added a usage consistent with other fops to dump pmap to userspace. Cc: clang-built-linux@googlegroups.com Link: https://github.com/ClangBuiltLinux/linux/issues/529 Signed-off-by: Nathan Huckleberry <nhuck@google.com> Tested-by: Nick Desaulniers <ndesaulniers@google.com> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | ipv6: Default fib6_type to RTN_UNICAST when not setDavid Ahern2019-06-191-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A user reported that routes are getting installed with type 0 (RTN_UNSPEC) where before the routes were RTN_UNICAST. One example is from accel-ppp which apparently still uses the ioctl interface and does not set rtmsg_type. Another is the netlink interface where ipv6 does not require rtm_type to be set (v4 does). Prior to the commit in the Fixes tag the ipv6 stack converted type 0 to RTN_UNICAST, so restore that behavior. Fixes: e8478e80e5a7 ("net/ipv6: Save route type in rt6_info") Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | net: hns3: Fix inconsistent indentingKrzysztof Kozlowski2019-06-191-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix wrong indentation of goto return. Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | Merge branch 'af_iucv-fixes'David S. Miller2019-06-191-13/+36
| |\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Julian Wiedmann says: ==================== net/af_iucv: fixes 2019-06-18 I spent a few cycles on transmit problems for af_iucv over regular netdevices - please apply the following fixes to -net. The first patch allows for skb allocations outside of GFP_DMA, while the second patch respects that drivers might use skb_cow_head() and/or want additional dev->needed_headroom. Patch 3 is for a separate issue, where we didn't setup some of the netdevice-specific infrastructure when running as a z/VM guest. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | | | | net/af_iucv: always register net_device notifierJulian Wiedmann2019-06-191-7/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Even when running as VM guest (ie pr_iucv != NULL), af_iucv can still open HiperTransport-based connections. For robust operation these connections require the af_iucv_netdev_notifier, so register it unconditionally. Also handle any error that register_netdevice_notifier() returns. Fixes: 9fbd87d41392 ("af_iucv: handle netdev events") Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Reviewed-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | | | | net/af_iucv: build proper skbs for HiperTransportJulian Wiedmann2019-06-191-5/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The HiperSockets-based transport path in af_iucv is still too closely entangled with qeth. With commit a647a02512ca ("s390/qeth: speed-up L3 IQD xmit"), the relevant xmit code in qeth has begun to use skb_cow_head(). So to avoid unnecessary skb head expansions, af_iucv must learn to 1) respect dev->needed_headroom when allocating skbs, and 2) drop the header reference before cloning the skb. While at it, also stop hard-coding the LL-header creation stage and just use the appropriate helper. Fixes: a647a02512ca ("s390/qeth: speed-up L3 IQD xmit") Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | | | | net/af_iucv: remove GFP_DMA restriction for HiperTransportJulian Wiedmann2019-06-191-1/+5
| |/ / / / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | af_iucv sockets over z/VM IUCV require that their skbs are allocated in DMA memory. This restriction doesn't apply to connections over HiperSockets. So only set this limit for z/VM IUCV sockets, thereby increasing the likelihood that the large (and linear!) allocations for HiperTransport messages succeed. Fixes: 3881ac441f64 ("af_iucv: add HiperSockets transport") Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Reviewed-by: Ursula Braun <ubraun@linux.ibm.com> Reviewed-by: Hendrik Brueckner <brueckner@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | net: dsa: mv88e6xxx: fix shift of FID bits in mv88e6185_g1_vtu_loadpurge()Rasmus Villemoes2019-06-191-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The comment is correct, but the code ends up moving the bits four places too far, into the VTUOp field. Fixes: 11ea809f1a74 (net: dsa: mv88e6xxx: support 256 databases) Signed-off-by: Rasmus Villemoes <rasmus.villemoes@prevas.dk> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller2019-06-183-14/+14
| |\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pablo Neira Ayuso says: ==================== Netfilter fixes for net 1) Module autoload for masquerade and redirection does not work. 2) Leak in unqueued packets in nf_ct_frag6_queue(). Ignore duplicated fragments, pretend they are placed into the queue. Patches from Guillaume Nault. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | | | | netfilter: ipv6: nf_defrag: accept duplicate fragments againGuillaume Nault2019-06-071-3/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When fixing the skb leak introduced by the conversion to rbtree, I forgot about the special case of duplicate fragments. The condition under the 'insert_error' label isn't effective anymore as nf_ct_frg6_gather() doesn't override the returned value anymore. So duplicate fragments now get NF_DROP verdict. To accept duplicate fragments again, handle them specially as soon as inet_frag_queue_insert() reports them. Return -EINPROGRESS which will translate to NF_STOLEN verdict, like any accepted fragment. However, such packets don't carry any new information and aren't queued, so we just drop them immediately. Fixes: a0d56cb911ca ("netfilter: ipv6: nf_defrag: fix leakage of unqueued fragments") Signed-off-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| | * | | | | netfilter: ipv6: nf_defrag: fix leakage of unqueued fragmentsGuillaume Nault2019-06-041-7/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With commit 997dd9647164 ("net: IP6 defrag: use rbtrees in nf_conntrack_reasm.c"), nf_ct_frag6_reasm() is now called from nf_ct_frag6_queue(). With this change, nf_ct_frag6_queue() can fail after the skb has been added to the fragment queue and nf_ct_frag6_gather() was adapted to handle this case. But nf_ct_frag6_queue() can still fail before the fragment has been queued. nf_ct_frag6_gather() can't handle this case anymore, because it has no way to know if nf_ct_frag6_queue() queued the fragment before failing. If it didn't, the skb is lost as the error code is overwritten with -EINPROGRESS. Fix this by setting -EINPROGRESS directly in nf_ct_frag6_queue(), so that nf_ct_frag6_gather() can propagate the error as is. Fixes: 997dd9647164 ("net: IP6 defrag: use rbtrees in nf_conntrack_reasm.c") Signed-off-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| | * | | | | netfilter: nf_tables: fix module autoload with inet familyPablo Neira Ayuso2019-05-312-4/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use MODULE_ALIAS_NFT_EXPR() to make happy the inet family with nat. Fixes: 63ce3940f3ab ("netfilter: nft_redir: add inet support") Fixes: 071657d2c38c ("netfilter: nft_masq: add inet support") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * | | | | | hvsock: fix epollout hang from race conditionSunil Muthuswamy2019-06-181-31/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, hvsock can enter into a state where epoll_wait on EPOLLOUT will not return even when the hvsock socket is writable, under some race condition. This can happen under the following sequence: - fd = socket(hvsocket) - fd_out = dup(fd) - fd_in = dup(fd) - start a writer thread that writes data to fd_out with a combination of epoll_wait(fd_out, EPOLLOUT) and - start a reader thread that reads data from fd_in with a combination of epoll_wait(fd_in, EPOLLIN) - On the host, there are two threads that are reading/writing data to the hvsocket stack: hvs_stream_has_space hvs_notify_poll_out vsock_poll sock_poll ep_poll Race condition: check for epollout from ep_poll(): assume no writable space in the socket hvs_stream_has_space() returns 0 check for epollin from ep_poll(): assume socket has some free space < HVS_PKT_LEN(HVS_SEND_BUF_SIZE) hvs_stream_has_space() will clear the channel pending send size host will not notify the guest because the pending send size has been cleared and so the hvsocket will never mark the socket writable Now, the EPOLLOUT will never return even if the socket write buffer is empty. The fix is to set the pending size to the default size and never change it. This way the host will always notify the guest whenever the writable space is bigger than the pending size. The host is already optimized to *only* notify the guest when the pending size threshold boundary is crossed and not everytime. This change also reduces the cpu usage somewhat since hv_stream_has_space() is in the hotpath of send: vsock_stream_sendmsg()->hv_stream_has_space() Earlier hv_stream_has_space was setting/clearing the pending size on every call. Signed-off-by: Sunil Muthuswamy <sunilmut@microsoft.com> Reviewed-by: Dexuan Cui <decui@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | | net/udp_gso: Allow TX timestamp with UDP GSOFred Klassen2019-06-181-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fixes an issue where TX Timestamps are not arriving on the error queue when UDP_SEGMENT CMSG type is combined with CMSG type SO_TIMESTAMPING. This can be illustrated with an updated updgso_bench_tx program which includes the '-T' option to test for this condition. It also introduces the '-P' option which will call poll() before reading the error queue. ./udpgso_bench_tx -4ucTPv -S 1472 -l2 -D 172.16.120.18 poll timeout udp tx: 0 MB/s 1 calls/s 1 msg/s The "poll timeout" message above indicates that TX timestamp never arrived. This patch preserves tx_flags for the first UDP GSO segment. Only the first segment is timestamped, even though in some cases there may be benefital in timestamping both the first and last segment. Factors in deciding on first segment timestamp only: - Timestamping both first and last segmented is not feasible. Hardware can only have one outstanding TS request at a time. - Timestamping last segment may under report network latency of the previous segments. Even though the doorbell is suppressed, the ring producer counter has been incremented. - Timestamping the first segment has the upside in that it reports timestamps from the application's view, e.g. RTT. - Timestamping the first segment has the downside that it may underreport tx host network latency. It appears that we have to pick one or the other. And possibly follow-up with a config flag to choose behavior. v2: Remove tests as noted by Willem de Bruijn <willemb@google.com> Moving tests from net to net-next v3: Update only relevant tx_flag bits as per Willem de Bruijn <willemb@google.com> v4: Update comments and commit message as per Willem de Bruijn <willemb@google.com> Fixes: ee80d1ebe5ba ("udp: add udp gso") Signed-off-by: Fred Klassen <fklassen@appneta.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | | Merge branch 'net-netem-fix-issues-with-corrupting-GSO-frames'David S. Miller2019-06-181-12/+14
| |\ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Jakub Kicinski says: ==================== net: netem: fix issues with corrupting GSO frames Corrupting GSO frames currently leads to crashes, due to skb use after free. These stem from the skb list handling - the segmented skbs come back on a list, and this list is not properly unlinked before enqueuing the segments. Turns out this condition is made very likely to occur because of another bug - in backlog accounting. Segments are counted twice, which means qdisc's limit gets reached leading to drops and making the use after free very likely to happen. The bugs are fixed in order in which they were added to the tree. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | | | | | net: netem: fix use after free and double free with packet corruptionJakub Kicinski2019-06-181-8/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Brendan reports that the use of netem's packet corruption capability leads to strange crashes. This seems to be caused by commit d66280b12bd7 ("net: netem: use a list in addition to rbtree") which uses skb->next pointer to construct a fast-path queue of in-order skbs. Packet corruption code has to invoke skb_gso_segment() in case of skbs in need of GSO. skb_gso_segment() returns a list of skbs. If next pointers of the skbs on that list do not get cleared fast path list may point to freed skbs or skbs which are also on the RB tree. Let's say skb gets segmented into 3 frames: A -> B -> C A gets hooked to the t_head t_tail list by tfifo_enqueue(), but it's next pointer didn't get cleared so we have: h t |/ A -> B -> C Now if B and C get also get enqueued successfully all is fine, because tfifo_enqueue() will overwrite the list in order. IOW: Enqueue B: h t | | A -> B C Enqueue C: h t | | A -> B -> C But if B and C get reordered we may end up with: h t RB tree |/ | A -> B -> C B \ C Or if they get dropped just: h t |/ A -> B -> C where A and B are already freed. To reproduce either limit has to be set low to cause freeing of segs or reorders have to happen (due to delay jitter). Note that we only have to mark the first segment as not on the list, "finish_segs" handling of other frags already does that. Another caveat is that qdisc_drop_all() still has to free all segments correctly in case of drop of first segment, therefore we re-link segs before calling it. v2: - re-link before drop, v1 was leaking non-first segs if limit was hit at the first seg - better commit message which lead to discovering the above :) Reported-by: Brendan Galloway <brendan.galloway@netronome.com> Fixes: d66280b12bd7 ("net: netem: use a list in addition to rbtree") Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | | | | | net: netem: fix backlog accounting for corrupted GSO framesJakub Kicinski2019-06-181-5/+8
| |/ / / / / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When GSO frame has to be corrupted netem uses skb_gso_segment() to produce the list of frames, and re-enqueues the segments one by one. The backlog length has to be adjusted to account for new frames. The current calculation is incorrect, leading to wrong backlog lengths in the parent qdisc (both bytes and packets), and incorrect packet backlog count in netem itself. Parent backlog goes negative, netem's packet backlog counts all non-first segments twice (thus remaining non-zero even after qdisc is emptied). Move the variables used to count the adjustment into local scope to make 100% sure they aren't used at any stage in backports. Fixes: 6071bd1aa13e ("netem: Segment GSO packets on enqueue") Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | | net: lio_core: fix potential sign-extension overflow on large shiftColin Ian King2019-06-181-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Left shifting the signed int value 1 by 31 bits has undefined behaviour and the shift amount oq_no can be as much as 63. Fix this by using BIT_ULL(oq_no) instead. Addresses-Coverity: ("Bad shift operation") Fixes: f21fb3ed364b ("Add support of Cavium Liquidio ethernet adapters") Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | | Merge branch 'net-fix-quite-a-few-dst_cache-crashes-reported-by-syzbot'David S. Miller2019-06-183-11/+15
| |\ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Xin Long says: ==================== net: fix quite a few dst_cache crashes reported by syzbot There are two kinds of crashes reported many times by syzbot with no reproducer. Call Traces are like: BUG: KASAN: slab-out-of-bounds in rt_cache_valid+0x158/0x190 net/ipv4/route.c:1556 rt_cache_valid+0x158/0x190 net/ipv4/route.c:1556 __mkroute_output net/ipv4/route.c:2332 [inline] ip_route_output_key_hash_rcu+0x819/0x2d50 net/ipv4/route.c:2564 ip_route_output_key_hash+0x1ef/0x360 net/ipv4/route.c:2393 __ip_route_output_key include/net/route.h:125 [inline] ip_route_output_flow+0x28/0xc0 net/ipv4/route.c:2651 ip_route_output_key include/net/route.h:135 [inline] ... or: kasan: GPF could be caused by NULL-ptr deref or user memory access RIP: 0010:dst_dev_put+0x24/0x290 net/core/dst.c:168 <IRQ> rt_fibinfo_free_cpus net/ipv4/fib_semantics.c:200 [inline] free_fib_info_rcu+0x2e1/0x490 net/ipv4/fib_semantics.c:217 __rcu_reclaim kernel/rcu/rcu.h:240 [inline] rcu_do_batch kernel/rcu/tree.c:2437 [inline] invoke_rcu_callbacks kernel/rcu/tree.c:2716 [inline] rcu_process_callbacks+0x100a/0x1ac0 kernel/rcu/tree.c:2697 ... They were caused by the fib_nh_common percpu member 'nhc_pcpu_rth_output' overwritten by another percpu variable 'dev->tstats' access overflow in tipc udp media xmit path when counting packets on a non tunnel device. The fix is to make udp tunnel work with no tunnel device by allowing not to count packets on the tstats when the tunnel dev is NULL in Patches 1/3 and 2/3, then pass a NULL tunnel dev in tipc_udp_tunnel() in Patch 3/3. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | | | | | tipc: pass tunnel dev as NULL to udp_tunnel(6)_xmit_skbXin Long2019-06-181-5/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | udp_tunnel(6)_xmit_skb() called by tipc_udp_xmit() expects a tunnel device to count packets on dev->tstats, a perpcu variable. However, TIPC is using udp tunnel with no tunnel device, and pass the lower dev, like veth device that only initializes dev->lstats(a perpcu variable) when creating it. Later iptunnel_xmit_stats() called by ip(6)tunnel_xmit() thinks the dev as a tunnel device, and uses dev->tstats instead of dev->lstats. tstats' each pointer points to a bigger struct than lstats, so when tstats->tx_bytes is increased, other percpu variable's members could be overwritten. syzbot has reported quite a few crashes due to fib_nh_common percpu member 'nhc_pcpu_rth_output' overwritten, call traces are like: BUG: KASAN: slab-out-of-bounds in rt_cache_valid+0x158/0x190 net/ipv4/route.c:1556 rt_cache_valid+0x158/0x190 net/ipv4/route.c:1556 __mkroute_output net/ipv4/route.c:2332 [inline] ip_route_output_key_hash_rcu+0x819/0x2d50 net/ipv4/route.c:2564 ip_route_output_key_hash+0x1ef/0x360 net/ipv4/route.c:2393 __ip_route_output_key include/net/route.h:125 [inline] ip_route_output_flow+0x28/0xc0 net/ipv4/route.c:2651 ip_route_output_key include/net/route.h:135 [inline] ... or: kasan: GPF could be caused by NULL-ptr deref or user memory access RIP: 0010:dst_dev_put+0x24/0x290 net/core/dst.c:168 <IRQ> rt_fibinfo_free_cpus net/ipv4/fib_semantics.c:200 [inline] free_fib_info_rcu+0x2e1/0x490 net/ipv4/fib_semantics.c:217 __rcu_reclaim kernel/rcu/rcu.h:240 [inline] rcu_do_batch kernel/rcu/tree.c:2437 [inline] invoke_rcu_callbacks kernel/rcu/tree.c:2716 [inline] rcu_process_callbacks+0x100a/0x1ac0 kernel/rcu/tree.c:2697 ... The issue exists since tunnel stats update is moved to iptunnel_xmit by Commit 039f50629b7f ("ip_tunnel: Move stats update to iptunnel_xmit()"), and here to fix it by passing a NULL tunnel dev to udp_tunnel(6)_xmit_skb so that the packets counting won't happen on dev->tstats. Reported-by: syzbot+9d4c12bfd45a58738d0a@syzkaller.appspotmail.com Reported-by: syzbot+a9e23ea2aa21044c2798@syzkaller.appspotmail.com Reported-by: syzbot+c4c4b2bb358bb936ad7e@syzkaller.appspotmail.com Reported-by: syzbot+0290d2290a607e035ba1@syzkaller.appspotmail.com Reported-by: syzbot+a43d8d4e7e8a7a9e149e@syzkaller.appspotmail.com Reported-by: syzbot+a47c5f4c6c00fc1ed16e@syzkaller.appspotmail.com Fixes: 039f50629b7f ("ip_tunnel: Move stats update to iptunnel_xmit()") Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | | | | | ip6_tunnel: allow not to count pkts on tstats by passing dev as NULLXin Long2019-06-181-3/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A similar fix to Patch "ip_tunnel: allow not to count pkts on tstats by setting skb's dev to NULL" is also needed by ip6_tunnel. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | | | | | ip_tunnel: allow not to count pkts on tstats by setting skb's dev to NULLXin Long2019-06-181-3/+6
| |/ / / / / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | iptunnel_xmit() works as a common function, also used by a udp tunnel which doesn't have to have a tunnel device, like how TIPC works with udp media. In these cases, we should allow not to count pkts on dev's tstats, so that udp tunnel can work with no tunnel device safely. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>