aboutsummaryrefslogtreecommitdiffstats
path: root/kernel/sched.c
Commit message (Collapse)AuthorAgeFilesLines
* sched: Fix cross-sched-class wakeup preemptionPeter Zijlstra2010-11-111-11/+26
| | | | | | | | | | | | | | | | Instead of dealing with sched classes inside each check_preempt_curr() implementation, pull out this logic into the generic wakeup preemption path. This fixes a hang in KVM (and others) where we are waiting for the stop machine thread to run ... Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de> Tested-by: Marcelo Tosatti <mtosatti@redhat.com> Tested-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1288891946.2039.31.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: Use group weight, idle cpu metrics to fix imbalances during idleSuresh Siddha2010-11-101-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently we consider a sched domain to be well balanced when the imbalance is less than the domain's imablance_pct. As the number of cores and threads are increasing, current values of imbalance_pct (for example 25% for a NUMA domain) are not enough to detect imbalances like: a) On a WSM-EP system (two sockets, each having 6 cores and 12 logical threads), 24 cpu-hogging tasks get scheduled as 13 on one socket and 11 on another socket. Leading to an idle HT cpu. b) On a hypothetial 2 socket NHM-EX system (each socket having 8 cores and 16 logical threads), 16 cpu-hogging tasks can get scheduled as 9 on one socket and 7 on another socket. Leaving one core in a socket idle whereas in another socket we have a core having both its HT siblings busy. While this issue can be fixed by decreasing the domain's imbalance_pct (by making it a function of number of logical cpus in the domain), it can potentially cause more task migrations across sched groups in an overloaded case. Fix this by using imbalance_pct only during newly_idle and busy load balancing. And during idle load balancing, check if there is an imbalance in number of idle cpu's across the busiest and this sched_group or if the busiest group has more tasks than its weight that the idle cpu in this_group can pull. Reported-by: Nikhil Rao <ncrao@google.com> Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1284760952.2676.11.camel@sbsiddha-MOBL3.sc.intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* Merge branch 'sched-fixes-for-linus' of ↵Linus Torvalds2010-10-291-4/+4
|\ | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched_stat: Update sched_info_queue/dequeue() code comments sched, cgroup: Fixup broken cgroup movement
| * sched, cgroup: Fixup broken cgroup movementPeter Zijlstra2010-10-221-4/+4
| | | | | | | | | | | | | | | | | | | | | | Dima noticed that we fail to correct the ->vruntime of sleeping tasks when we move them between cgroups. Reported-by: Dima Zavin <dima@android.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Tested-by: Mike Galbraith <efault@gmx.de> LKML-Reference: <1287150604.29097.1513.camel@twins> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* | Merge branch 'sched-core-for-linus' of ↵Linus Torvalds2010-10-211-45/+246
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (29 commits) sched: Export account_system_vtime() sched: Call tick_check_idle before __irq_enter sched: Remove irq time from available CPU power sched: Do not account irq time to current task x86: Add IRQ_TIME_ACCOUNTING sched: Add IRQ_TIME_ACCOUNTING, finer accounting of irq time sched: Add a PF flag for ksoftirqd identification sched: Consolidate account_system_vtime extern declaration sched: Fix softirq time accounting sched: Drop group_capacity to 1 only if local group has extra capacity sched: Force balancing on newidle balance if local group has capacity sched: Set group_imb only a task can be pulled from the busiest cpu sched: Do not consider SCHED_IDLE tasks to be cache hot sched: Drop all load weight manipulation for RT tasks sched: Create special class for stop/migrate work sched: Unindent labels sched: Comment updates: fix default latency and granularity numbers tracing/sched: Add sched_pi_setprio tracepoint sched: Give CPU bound RT tasks preference sched: Try not to migrate higher priority RT tasks ...
| * sched: Export account_system_vtime()Ingo Molnar2010-10-181-0/+1
| | | | | | | | | | | | | | | | | | | | | | KVM uses it for example: ERROR: "account_system_vtime" [arch/x86/kvm/kvm.ko] undefined! Cc: Venkatesh Pallipadi <venki@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1286237003-12406-3-git-send-email-venki@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * sched: Call tick_check_idle before __irq_enterVenkatesh Pallipadi2010-10-181-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When CPU is idle and on first interrupt, irq_enter calls tick_check_idle() to notify interruption from idle. But, there is a problem if this call is done after __irq_enter, as all routines in __irq_enter may find stale time due to yet to be done tick_check_idle. Specifically, trace calls in __irq_enter when they use global clock and also account_system_vtime change in this patch as it wants to use sched_clock_cpu() to do proper irq timing. But, tick_check_idle was moved after __irq_enter intentionally to prevent problem of unneeded ksoftirqd wakeups by the commit ee5f80a: irq: call __irq_enter() before calling the tick_idle_check Impact: avoid spurious ksoftirqd wakeups Moving tick_check_idle() before __irq_enter and wrapping it with local_bh_enable/disable would solve both the problems. Fixed-by: Yong Zhang <yong.zhang0@gmail.com> Signed-off-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1286237003-12406-9-git-send-email-venki@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * sched: Remove irq time from available CPU powerVenkatesh Pallipadi2010-10-181-0/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The idea was suggested by Peter Zijlstra here: http://marc.info/?l=linux-kernel&m=127476934517534&w=2 irq time is technically not available to the tasks running on the CPU. This patch removes irq time from CPU power piggybacking on sched_rt_avg_update(). Tested this by keeping CPU X busy with a network intensive task having 75% oa a single CPU irq processing (hard+soft) on a 4-way system. And start seven cycle soakers on the system. Without this change, there will be two tasks on each CPU. With this change, there is a single task on irq busy CPU X and remaining 7 tasks are spread around among other 3 CPUs. Signed-off-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1286237003-12406-8-git-send-email-venki@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * sched: Do not account irq time to current taskVenkatesh Pallipadi2010-10-181-3/+40
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Scheduler accounts both softirq and interrupt processing times to the currently running task. This means, if the interrupt processing was for some other task in the system, then the current task ends up being penalized as it gets shorter runtime than otherwise. Change sched task accounting to acoount only actual task time from currently running task. Now update_curr(), modifies the delta_exec to depend on rq->clock_task. Note that this change only handles CONFIG_IRQ_TIME_ACCOUNTING case. We can extend this to CONFIG_VIRT_CPU_ACCOUNTING with minimal effort. But, thats for later. This change will impact scheduling behavior in interrupt heavy conditions. Tested on a 4-way system with eth0 handled by CPU 2 and a network heavy task (nc) running on CPU 3 (and no RSS/RFS). With that I have CPU 2 spending 75%+ of its time in irq processing. CPU 3 spending around 35% time running nc task. Now, if I run another CPU intensive task on CPU 2, without this change /proc/<pid>/schedstat shows 100% of time accounted to this task. With this change, it rightly shows less than 25% accounted to this task as remaining time is actually spent on irq processing. Signed-off-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1286237003-12406-7-git-send-email-venki@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * sched: Add IRQ_TIME_ACCOUNTING, finer accounting of irq timeVenkatesh Pallipadi2010-10-181-0/+49
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | s390/powerpc/ia64 have support for CONFIG_VIRT_CPU_ACCOUNTING which does the fine granularity accounting of user, system, hardirq, softirq times. Adding that option on archs like x86 will be challenging however, given the state of TSC reliability on various platforms and also the overhead it will add in syscall entry exit. Instead, add a lighter variant that only does finer accounting of hardirq and softirq times, providing precise irq times (instead of timer tick based samples). This accounting is added with a new config option CONFIG_IRQ_TIME_ACCOUNTING so that there won't be any overhead for users not interested in paying the perf penalty. This accounting is based on sched_clock, with the code being generic. So, other archs may find it useful as well. This patch just adds the core logic and does not enable this logic yet. Signed-off-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1286237003-12406-5-git-send-email-venki@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * sched: Fix softirq time accountingVenkatesh Pallipadi2010-10-181-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Peter Zijlstra found a bug in the way softirq time is accounted in VIRT_CPU_ACCOUNTING on this thread: http://lkml.indiana.edu/hypermail//linux/kernel/1009.2/01366.html The problem is, softirq processing uses local_bh_disable internally. There is no way, later in the flow, to differentiate between whether softirq is being processed or is it just that bh has been disabled. So, a hardirq when bh is disabled results in time being wrongly accounted as softirq. Looking at the code a bit more, the problem exists in !VIRT_CPU_ACCOUNTING as well. As account_system_time() in normal tick based accouting also uses softirq_count, which will be set even when not in softirq with bh disabled. Peter also suggested solution of using 2*SOFTIRQ_OFFSET as irq count for local_bh_{disable,enable} and using just SOFTIRQ_OFFSET while softirq processing. The patch below does that and adds API in_serving_softirq() which returns whether we are currently processing softirq or not. Also changes one of the usages of softirq_count in net/sched/cls_cgroup.c to in_serving_softirq. Looks like many usages of in_softirq really want in_serving_softirq. Those changes can be made individually on a case by case basis. Signed-off-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1286237003-12406-2-git-send-email-venki@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * sched: Do not consider SCHED_IDLE tasks to be cache hotNikhil Rao2010-10-181-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | This patch adds a check in task_hot to return if the task has SCHED_IDLE policy. SCHED_IDLE tasks have very low weight, and when run with regular workloads, are typically scheduled many milliseconds apart. There is no need to consider these tasks hot for load balancing. Signed-off-by: Nikhil Rao <ncrao@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1287173550-30365-2-git-send-email-ncrao@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * sched: Drop all load weight manipulation for RT tasksLinus Walleij2010-10-181-6/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Load weights are for the CFS, they do not belong in the RT task. This makes all RT scheduling classes leave the CFS weights alone. This fixes a real bug as well: I noticed the following phonomena: a process elevated to SCHED_RR forks with SCHED_RESET_ON_FORK set, and the child is indeed SCHED_OTHER, and the niceval is indeed reset to 0. However the weight inserted by set_load_weight() remains at 0, giving the task insignificat priority. With this fix, the weight is reset to what the task had before being elevated to SCHED_RR/SCHED_FIFO. Cc: Lennart Poettering <lennart@poettering.net> Cc: stable@kernel.org Signed-off-by: Linus Walleij <linus.walleij@stericsson.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1286807811-10568-1-git-send-email-linus.walleij@stericsson.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * sched: Create special class for stop/migrate workPeter Zijlstra2010-10-181-9/+45
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In order to separate the stop/migrate work thread from the SCHED_FIFO implementation, create a special class for it that is of higher priority than SCHED_FIFO itself. This currently solves a problem where cpu-hotplug consumes so much cpu-time that the SCHED_FIFO class gets throttled, but has the bandwidth replenishment timer pending on the now dead cpu. It is also required for when we add the planned deadline scheduling class above SCHED_FIFO, as the stop/migrate thread still needs to transcent those tasks. Tested-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1285165776.2275.1022.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * sched: Unindent labelsPeter Zijlstra2010-10-181-6/+6
| | | | | | | | | | | | | | | | Labels should be on column 0. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * Merge branch 'linus' into sched/coreIngo Molnar2010-10-141-4/+4
| |\ | | | | | | | | | | | | | | | Merge reason: update from -rc5 to -almost-final Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * | tracing/sched: Add sched_pi_setprio tracepointSteven Rostedt2010-09-211-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | Add a tracepoint that shows the priority of a task being boosted via priority inheritance. Cc: Gregory Haskins <ghaskins@novell.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
| * | Merge commit 'v2.6.36-rc5' into sched/coreIngo Molnar2010-09-211-0/+6
| |\ \ | | | | | | | | | | | | | | | | | | | | Merge reason: Pick up the latest fixes in -rc5. Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * | | sched: Remove branch hints within context_switch()Heiko Carstens2010-09-161-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With 710390d9 "sched: Optimize branch hint in context_switch()" the branch hint logic within context_switch() got inversed. In fact the hints "if (likely(!mm))" and "if (likely(!prev->mm))" mean that it is likely that the previous and next task are kernel threads. That assumption is certainly counter intuitive, but Tim has shown that at least with his workload this is true. Nevertheless the truth is: it depends on the current workload. So just remove the annotations which also improves readability. Reported-by: Tim Blechmann <tim@klingt.org> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> LKML-Reference: <20100916124225.GA2209@osiris.boeblingen.de.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * | | sched: Fix string comparison in /proc/sched_featuresMathieu Desnoyers2010-09-141-4/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix incorrect handling of the following case: INTERACTIVE INTERACTIVE_SOMETHING_ELSE The comparison only checks up to each element's length. Changelog since v1: - Embellish using some Rostedtisms. [ mingo: ^^ == smaller and cleaner ] Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Cc: <stable@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tony Lindgren <tony@atomide.com> LKML-Reference: <20100913214700.GB16118@Krystal> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * | | sched: Add book scheduling domainHeiko Carstens2010-09-091-2/+75
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On top of the SMT and MC scheduling domains this adds the BOOK scheduling domain. This is useful for NUMA like machines which do not have an interface which tells which piece of memory is attached to which node or where the hardware performs striping. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20100831082844.253053798@de.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * | | sched: Merge cpu_to_core_group functionsHeiko Carstens2010-09-091-13/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Merge and simplify the two cpu_to_core_group variants so that the resulting function follows the same pattern like cpu_to_phys_group. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20100831082843.953617555@de.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * | | sched: Remove unnecessary #ifdef CONFIG_SMPChristian Dietrich2010-09-081-3/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The CONFIG_SMP ifdef isn't necessary at this point, because it is checked in an outer ifdef level already and has no effect here. Cleanup only, no functional effect. Signed-off-by: Christian Dietrich <qy03fugy@stud.informatik.uni-erlangen.de> Cc: vamos-dev@i4.informatik.uni-erlangen.de Cc: Peter Zijlstra <peterz@infradead.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Tejun Heo <tj@kernel.org> LKML-Reference: <7a3a39ef3f765a4473cb026b1f204059568a7098.1283782701.git.qy03fugy@stud.informatik.uni-erlangen.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* | | | Merge branch 'perf-core-for-linus' of ↵Linus Torvalds2010-10-211-1/+1
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (163 commits) tracing: Fix compile issue for trace_sched_wakeup.c [S390] hardirq: remove pointless header file includes [IA64] Move local_softirq_pending() definition perf, powerpc: Fix power_pmu_event_init to not use event->ctx ftrace: Remove recursion between recordmcount and scripts/mod/empty jump_label: Add COND_STMT(), reducer wrappery perf: Optimize sw events perf: Use jump_labels to optimize the scheduler hooks jump_label: Add atomic_t interface jump_label: Use more consistent naming perf, hw_breakpoint: Fix crash in hw_breakpoint creation perf: Find task before event alloc perf: Fix task refcount bugs perf: Fix group moving irq_work: Add generic hardirq context callbacks perf_events: Fix transaction recovery in group_sched_in() perf_events: Fix bogus AMD64 generic TLB events perf_events: Fix bogus context time tracking tracing: Remove parent recording in latency tracer graph options tracing: Use one prologue for the preempt irqs off tracer function tracers ...
| * \ \ \ Merge branch 'linus' into perf/coreIngo Molnar2010-09-221-4/+4
| |\ \ \ \ | | | |_|/ | | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Conflicts: kernel/hw_breakpoint.c Merge reason: resolve the conflict. Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * | | | perf: Undo the per cpu-context timer stuffPeter Zijlstra2010-09-171-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Revert the timer per cpu-context timers because of unfortunate nohz interaction. Fixing that would have been somewhat ugly, so go back to driving things from the regular tick. Provide a jiffies interval feature for people who want slower rotations. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Stephane Eranian <eranian@google.com> Cc: Robert Richter <robert.richter@amd.com> Cc: Yinghai Lu <yinghai@kernel.org> LKML-Reference: <20100917093009.519845633@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * | | | Merge branch 'tip/perf/core' of ↵Ingo Molnar2010-09-151-0/+6
| |\ \ \ \ | | | |_|/ | | |/| | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into perf/core
| * | | | perf: Per cpu-context rotation timerPeter Zijlstra2010-09-091-2/+0
| | |_|/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Give each cpu-context its own timer so that it is a self contained entity, this eases the way for per-pmu-per-cpu contexts as well as provides the basic infrastructure to allow different rotation times per pmu. Things to look at: - folding the tick and these TICK_NSEC timers - separate task context rotation Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: paulus <paulus@samba.org> Cc: stephane eranian <eranian@googlemail.com> Cc: Robert Richter <robert.richter@amd.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Lin Ming <ming.m.lin@intel.com> Cc: Yanmin <yanmin_zhang@linux.intel.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* | | | Merge branch 'core-rcu-for-linus' of ↵Linus Torvalds2010-10-211-0/+12
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (52 commits) sched: fix RCU lockdep splat from task_group() rcu: using ACCESS_ONCE() to observe the jiffies_stall/rnp->qsmask value sched: suppress RCU lockdep splat in task_fork_fair net: suppress RCU lockdep false positive in sock_update_classid rcu: move check from rcu_dereference_bh to rcu_read_lock_bh_held rcu: Add advice to PROVE_RCU_REPEATEDLY kernel config parameter rcu: Add tracing data to support queueing models rcu: fix sparse errors in rcutorture.c rcu: only one evaluation of arg in rcu_dereference_check() unless sparse kernel: Remove undead ifdef CONFIG_DEBUG_LOCK_ALLOC rcu: fix _oddness handling of verbose stall warnings rcu: performance fixes to TINY_PREEMPT_RCU callback checking rcu: upgrade stallwarn.txt documentation for CPU-bound RT processes vhost: add __rcu annotations rcu: add comment stating that list_empty() applies to RCU-protected lists rcu: apply TINY_PREEMPT_RCU read-side speedup to TREE_PREEMPT_RCU rcu: combine duplicate code, courtesy of CONFIG_PREEMPT_RCU rcu: Upgrade srcu_read_lock() docbook about SRCU grace periods rcu: document ways of stalling updates in low-memory situations rcu: repair code-duplication FIXMEs ...
| * | | | sched: fix RCU lockdep splat from task_group()Peter Zijlstra2010-10-071-0/+12
| | |_|/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This addresses the following RCU lockdep splat: [0.051203] CPU0: AMD QEMU Virtual CPU version 0.12.4 stepping 03 [0.052999] lockdep: fixing up alternatives. [0.054105] [0.054106] =================================================== [0.054999] [ INFO: suspicious rcu_dereference_check() usage. ] [0.054999] --------------------------------------------------- [0.054999] kernel/sched.c:616 invoked rcu_dereference_check() without protection! [0.054999] [0.054999] other info that might help us debug this: [0.054999] [0.054999] [0.054999] rcu_scheduler_active = 1, debug_locks = 1 [0.054999] 3 locks held by swapper/1: [0.054999] #0: (cpu_add_remove_lock){+.+.+.}, at: [<ffffffff814be933>] cpu_up+0x42/0x6a [0.054999] #1: (cpu_hotplug.lock){+.+.+.}, at: [<ffffffff810400d8>] cpu_hotplug_begin+0x2a/0x51 [0.054999] #2: (&rq->lock){-.-...}, at: [<ffffffff814be2f7>] init_idle+0x2f/0x113 [0.054999] [0.054999] stack backtrace: [0.054999] Pid: 1, comm: swapper Not tainted 2.6.35 #1 [0.054999] Call Trace: [0.054999] [<ffffffff81068054>] lockdep_rcu_dereference+0x9b/0xa3 [0.054999] [<ffffffff810325c3>] task_group+0x7b/0x8a [0.054999] [<ffffffff810325e5>] set_task_rq+0x13/0x40 [0.054999] [<ffffffff814be39a>] init_idle+0xd2/0x113 [0.054999] [<ffffffff814be78a>] fork_idle+0xb8/0xc7 [0.054999] [<ffffffff81068717>] ? mark_held_locks+0x4d/0x6b [0.054999] [<ffffffff814bcebd>] do_fork_idle+0x17/0x2b [0.054999] [<ffffffff814bc89b>] native_cpu_up+0x1c1/0x724 [0.054999] [<ffffffff814bcea6>] ? do_fork_idle+0x0/0x2b [0.054999] [<ffffffff814be876>] _cpu_up+0xac/0x127 [0.054999] [<ffffffff814be946>] cpu_up+0x55/0x6a [0.054999] [<ffffffff81ab562a>] kernel_init+0xe1/0x1ff [0.054999] [<ffffffff81003854>] kernel_thread_helper+0x4/0x10 [0.054999] [<ffffffff814c353c>] ? restore_args+0x0/0x30 [0.054999] [<ffffffff81ab5549>] ? kernel_init+0x0/0x1ff [0.054999] [<ffffffff81003850>] ? kernel_thread_helper+0x0/0x10 [0.056074] Booting Node 0, Processors #1lockdep: fixing up alternatives. [0.130045] #2lockdep: fixing up alternatives. [0.203089] #3 Ok. [0.275286] Brought up 4 CPUs [0.276005] Total of 4 processors activated (16017.17 BogoMIPS). The cgroup_subsys_state structures referenced by idle tasks are never freed, because the idle tasks should be part of the root cgroup, which is not removable. The problem is that while we do in-fact hold rq->lock, the newly spawned idle thread's cpu is not yet set to the correct cpu so the lockdep check in task_group(): lockdep_is_held(&task_rq(p)->lock) will fail. But this is a chicken and egg problem. Setting the CPU's runqueue requires that the CPU's runqueue already be set. ;-) So insert an RCU read-side critical section to avoid the complaint. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
* / | | security: remove unused parameter from security_task_setscheduler()KOSAKI Motohiro2010-10-211-2/+2
|/ / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | All security modules shouldn't change sched_param parameter of security_task_setscheduler(). This is not only meaningless, but also make a harmful result if caller pass a static variable. This patch remove policy and sched_param parameter from security_task_setscheduler() becuase none of security module is using it. Cc: James Morris <jmorris@namei.org> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: James Morris <jmorris@namei.org>
* | / sched: Fix user time incorrectly accounted as system time on 32-bitStanislaw Gruszka2010-09-151-4/+4
| |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We have 32-bit variable overflow possibility when multiply in task_times() and thread_group_times() functions. When the overflow happens then the scaled utime value becomes erroneously small and the scaled stime becomes i erroneously big. Reported here: https://bugzilla.redhat.com/show_bug.cgi?id=633037 https://bugzilla.kernel.org/show_bug.cgi?id=16559 Reported-by: Michael Chapman <redhat-bugzilla@very.puzzling.org> Reported-by: Ciriaco Garcia de Celis <sysman@etherpilot.com> Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Cc: <stable@kernel.org> # 2.6.32.19+ (partially) and 2.6.33+ LKML-Reference: <20100914143513.GB8415@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* | sched: Move sched_avg_update() to update_cpu_load()Suresh Siddha2010-09-091-0/+6
|/ | | | | | | | | | | | | | Currently sched_avg_update() (which updates rt_avg stats in the rq) is getting called from scale_rt_power() (in the load balance context) which doesn't take rq->lock. Fix it by moving the sched_avg_update() to more appropriate update_cpu_load() where the CFS load gets updated as well. Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1282596171.2694.3.camel@sbsiddha-MOBL3> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* mutex: Improve the scalability of optimistic spinningTim Chen2010-08-231-1/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is a scalability issue for current implementation of optimistic mutex spin in the kernel. It is found on a 8 node 64 core Nehalem-EX system (HT mode). The intention of the optimistic mutex spin is to busy wait and spin on a mutex if the owner of the mutex is running, in the hope that the mutex will be released soon and be acquired, without the thread trying to acquire mutex going to sleep. However, when we have a large number of threads, contending for the mutex, we could have the mutex grabbed by other thread, and then another ……, and we will keep spinning, wasting cpu cycles and adding to the contention. One possible fix is to quit spinning and put the current thread on wait-list if mutex lock switch to a new owner while we spin, indicating heavy contention (see the patch included). I did some testing on a 8 socket Nehalem-EX system with a total of 64 cores. Using Ingo's test-mutex program that creates/delete files with 256 threads (http://lkml.org/lkml/2006/1/8/50) , I see the following speed up after putting in the mutex spin fix: ./mutex-test V 256 10 Ops/sec 2.6.34 62864 With fix 197200 Repeating the test with Aim7 fserver workload, again there is a speed up with the fix: Jobs/min 2.6.34 91657 With fix 149325 To look at the impact on the distribution of mutex acquisition time, I collected the mutex acquisition time on Aim7 fserver workload with some instrumentation. The average acquisition time is reduced by 48% and number of contentions reduced by 32%. #contentions Time to acquire mutex (cycles) 2.6.34 72973 44765791 With fix 49210 23067129 The histogram of mutex acquisition time is listed below. The acquisition time is in 2^bin cycles. We see that without the fix, the acquisition time is mostly around 2^26 cycles. With the fix, we the distribution get spread out a lot more towards the lower cycles, starting from 2^13. However, there is an increase of the tail distribution with the fix at 2^28 and 2^29 cycles. It seems a small price to pay for the reduced average acquisition time and also getting the cpu to do useful work. Mutex acquisition time distribution (acq time = 2^bin cycles): 2.6.34 With Fix bin #occurrence % #occurrence % 11 2 0.00% 120 0.24% 12 10 0.01% 790 1.61% 13 14 0.02% 2058 4.18% 14 86 0.12% 3378 6.86% 15 393 0.54% 4831 9.82% 16 710 0.97% 4893 9.94% 17 815 1.12% 4667 9.48% 18 790 1.08% 5147 10.46% 19 580 0.80% 6250 12.70% 20 429 0.59% 6870 13.96% 21 311 0.43% 1809 3.68% 22 255 0.35% 2305 4.68% 23 317 0.44% 916 1.86% 24 610 0.84% 233 0.47% 25 3128 4.29% 95 0.19% 26 63902 87.69% 122 0.25% 27 619 0.85% 286 0.58% 28 0 0.00% 3536 7.19% 29 0 0.00% 903 1.83% 30 0 0.00% 0 0.00% I've done similar experiments with 2.6.35 kernel on smaller boxes as well. One is on a dual-socket Westmere box (12 cores total, with HT). Another experiment is on an old dual-socket Core 2 box (4 cores total, no HT) On the 12-core Westmere box, I see a 250% increase for Ingo's mutex-test program with my mutex patch but no significant difference in aim7's fserver workload. On the 4-core Core 2 box, I see the difference with the patch for both mutex-test and aim7 fserver are negligible. So far, it seems like the patch has not caused regression on smaller systems. Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: <stable@kernel.org> # .35.x LKML-Reference: <1282168827.9542.72.camel@schen9-DESK> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* Merge branch 'sched-core-for-linus' of ↵Linus Torvalds2010-08-061-98/+293
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (27 commits) sched: Use correct macro to display sched_child_runs_first in /proc/sched_debug sched: No need for bootmem special cases sched: Revert nohz_ratelimit() for now sched: Reduce update_group_power() calls sched: Update rq->clock for nohz balanced cpus sched: Fix spelling of sibling sched, cpuset: Drop __cpuexit from cpu hotplug callbacks sched: Fix the racy usage of thread_group_cputimer() in fastpath_timer_check() sched: run_posix_cpu_timers: Don't check ->exit_state, use lock_task_sighand() sched: thread_group_cputime: Simplify, document the "alive" check sched: Remove the obsolete exit_state/signal hacks sched: task_tick_rt: Remove the obsolete ->signal != NULL check sched: __sched_setscheduler: Read the RLIMIT_RTPRIO value lockless sched: Fix comments to make them DocBook happy sched: Fix fix_small_capacity powerpc: Exclude arch_sd_sibiling_asym_packing() on UP powerpc: Enable asymmetric SMT scheduling on POWER7 sched: Add asymmetric group packing option for sibling domain sched: Fix capacity calculations for SMT4 sched: Change nohz idle load balancing logic to push model ...
| * Merge branch 'sched/urgent' into sched/coreIngo Molnar2010-08-051-10/+0
| |\ | | | | | | | | | | | | | | | | | | | | | | | | Conflicts: include/linux/sched.h Merge reason: Add the leftover .35 urgent bits, fix the conflict. Signed-off-by: Ingo Molnar <mingo@elte.hu>
| | * sched: Revert nohz_ratelimit() for nowPeter Zijlstra2010-07-171-10/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Norbert reported that nohz_ratelimit() causes his laptop to burn about 4W (40%) extra. For now back out the change and see if we can adjust the power management code to make better decisions. Reported-by: Norbert Preining <preining@logic.at> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Mike Galbraith <efault@gmx.de> Cc: Arjan van de Ven <arjan@infradead.org> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * | Merge branch 'linus' into sched/coreIngo Molnar2010-07-211-5/+17
| |\| | | | | | | | | | | | | | | | Merge reason: Move from the -rc3 to the almost-rc6 base. Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * | sched: No need for bootmem special casesPekka Enberg2010-07-171-12/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As of commit dcce284 ("mm: Extend gfp masking to the page allocator") and commit 7e85ee0 ("slab,slub: don't enable interrupts during early boot"), the slab allocator makes sure we don't attempt to sleep during boot. Therefore, remove bootmem special cases from the scheduler and use plain GFP_KERNEL instead. Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1279225102-2572-1-git-send-email-penberg@cs.helsinki.fi> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * | sched, cpuset: Drop __cpuexit from cpu hotplug callbacksTejun Heo2010-06-221-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 3a101d05 (sched: adjust when cpu_active and cpuset configurations are updated during cpu on/offlining) added hotplug notifiers marked with __cpuexit; however, ia64 drops text in __cpuexit during link unlike x86. This means that functions which are referenced during init but used only for cpu hot unplugging afterwards shouldn't be marked with __cpuexit. Drop __cpuexit from those functions. Reported-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Tony Luck <tony.luck@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <4C1FDF5B.1040301@kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * | sched: __sched_setscheduler: Read the RLIMIT_RTPRIO value locklessOleg Nesterov2010-06-181-6/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | __sched_setscheduler() takes lock_task_sighand() to access task->signal. This is not needed since ea6d290c, ->signal can't go away. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20100610230944.GA25903@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * | sched: Change nohz idle load balancing logic to push modelVenkatesh Pallipadi2010-06-091-3/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In the new push model, all idle CPUs indeed go into nohz mode. There is still the concept of idle load balancer (performing the load balancing on behalf of all the idle cpu's in the system). Busy CPU kicks the nohz balancer when any of the nohz CPUs need idle load balancing. The kickee CPU does the idle load balancing on behalf of all idle CPUs instead of the normal idle balance. This addresses the below two problems with the current nohz ilb logic: * the idle load balancer continued to have periodic ticks during idle and wokeup frequently, even though it did not have any rebalancing to do on behalf of any of the idle CPUs. * On x86 and CPUs that have APIC timer stoppage on idle CPUs, this periodic wakeup can result in a periodic additional interrupt on a CPU doing the timer broadcast. Also currently we are migrating the unpinned timers from an idle to the cpu doing idle load balancing (when all the cpus in the system are idle, there is no idle load balancing cpu and timers get added to the same idle cpu where the request was made. So the existing optimization works only on semi idle system). And In semi idle system, we no longer have periodic ticks on the idle load balancer CPU. Using that cpu will add more delays to the timers than intended (as that cpu's timer base may not be uptodate wrt jiffies etc). This was causing mysterious slowdowns during boot etc. For now, in the semi idle case, use the nearest busy cpu for migrating timers from an idle cpu. This is good for power-savings anyway. Signed-off-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Thomas Gleixner <tglx@linutronix.de> LKML-Reference: <1274486981.2840.46.camel@sbs-t61.sc.intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * | sched: Avoid side-effect of tickless idle on update_cpu_loadVenkatesh Pallipadi2010-06-091-5/+95
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | tickless idle has a negative side effect on update_cpu_load(), which in turn can affect load balancing behavior. update_cpu_load() is supposed to be called every tick, to keep track of various load indicies. With tickless idle, there are no scheduler ticks called on the idle CPUs. Idle CPUs may still do load balancing (with idle_load_balance CPU) using the stale cpu_load. It will also cause problems when all CPUs go idle for a while and become active again. In this case loads would not degrade as expected. This is how rq->nr_load_updates change looks like under different conditions: <cpu_num> <nr_load_updates change> All CPUS idle for 10 seconds (HZ=1000) 0 1621 10 496 11 139 12 875 13 1672 14 12 15 21 1 1472 2 2426 3 1161 4 2108 5 1525 6 701 7 249 8 766 9 1967 One CPU busy rest idle for 10 seconds 0 10003 10 601 11 95 12 966 13 1597 14 114 15 98 1 3457 2 93 3 6679 4 1425 5 1479 6 595 7 193 8 633 9 1687 All CPUs busy for 10 seconds 0 10026 10 10026 11 10026 12 10026 13 10025 14 10025 15 10025 1 10026 2 10026 3 10026 4 10026 5 10026 6 10026 7 10026 8 10026 9 10026 That is update_cpu_load works properly only when all CPUs are busy. If all are idle, all the CPUs get way lower updates. And when few CPUs are busy and rest are idle, only busy and ilb CPU does proper updates and rest of the idle CPUs will do lower updates. The patch keeps track of when a last update was done and fixes up the load avg based on current time. On one of my test system SPECjbb with warehouse 1..numcpus, patch improves throughput numbers by ~1% (average of 6 runs). On another test system (with different domain hierarchy) there is no noticable change in perf. Signed-off-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Thomas Gleixner <tglx@linutronix.de> LKML-Reference: <AANLkTilLtDWQsAUrIxJ6s04WTgmw9GuOODc5AOrYsaR5@mail.gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * | sched: Simplify the reacquire_kernel_lock() logicOleg Nesterov2010-06-091-7/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - Contrary to what 6d558c3a says, there is no need to reload prev = rq->curr after the context switch. You always schedule back to where you came from, prev must be equal to current even if cpu/rq was changed. - This also means reacquire_kernel_lock() can use prev instead of current. - No need to reassign switch_count if reacquire_kernel_lock() reports need_resched(), we can just move the initial assignment down, under the "need_resched_nonpreemptible:" label. - Try to update the comment after context_switch(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20100519125711.GA30199@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * | sched_clock: Add local_clock() API and improve documentationPeter Zijlstra2010-06-091-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For people who otherwise get to write: cpu_clock(smp_processor_id()), there is now: local_clock(). Also, as per suggestion from Andrew, provide some documentation on the various clock interfaces, and minimize the unsigned long long vs u64 mess. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Jens Axboe <jaxboe@fusionio.com> LKML-Reference: <1275052414.1645.52.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * | Merge branch 'sched-wq' of ↵Ingo Molnar2010-06-081-54/+151
| |\ \ | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq into sched/core
| | * | sched: add hooks for workqueueTejun Heo2010-06-081-2/+51
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Concurrency managed workqueue needs to know when workers are going to sleep and waking up. Using these two hooks, cmwq keeps track of the current concurrency level and throttles execution of new works if it's too high and wakes up another worker from the sleep hook if it becomes too low. This patch introduces PF_WQ_WORKER to identify workqueue workers and adds the following two hooks. * wq_worker_waking_up(): called when a worker is woken up. * wq_worker_sleeping(): called when a worker is going to sleep and may return a pointer to a local task which should be woken up. The returned task is woken up using try_to_wake_up_local() which is simplified ttwu which is called under rq lock and can only wake up local tasks. Both hooks are currently defined as noop in kernel/workqueue_sched.h. Later cmwq implementation will replace them with proper implementation. These hooks are hard coded as they'll always be enabled. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Ingo Molnar <mingo@elte.hu>
| | * | sched: refactor try_to_wake_up()Tejun Heo2010-06-081-34/+49
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Factor ttwu_activate() and ttwu_woken_up() out of try_to_wake_up(). The factoring out doesn't affect try_to_wake_up() much code-generation-wise. Depending on configuration options, it ends up generating the same object code as before or slightly different one due to different register assignment. This is to help future implementation of try_to_wake_up_local(). Mike Galbraith suggested rename to ttwu_post_activation() from ttwu_woken_up() and comment update in try_to_wake_up(). Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Ingo Molnar <mingo@elte.hu>
| | * | sched: adjust when cpu_active and cpuset configurations are updated during ↵Tejun Heo2010-06-081-17/+50
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | cpu on/offlining Currently, when a cpu goes down, cpu_active is cleared before CPU_DOWN_PREPARE starts and cpuset configuration is updated from a default priority cpu notifier. When a cpu is coming up, it's set before CPU_ONLINE but cpuset configuration again is updated from the same cpu notifier. For cpu notifiers, this presents an inconsistent state. Threads which a CPU_DOWN_PREPARE notifier expects to be bound to the CPU can be migrated to other cpus because the cpu is no more inactive. Fix it by updating cpu_active in the highest priority cpu notifier and cpuset configuration in the second highest when a cpu is coming up. Down path is updated similarly. This guarantees that all other cpu notifiers see consistent cpu_active and cpuset configuration. cpuset_track_online_cpus() notifier is converted to cpuset_update_active_cpus() which just updates the configuration and now called from cpuset_cpu_[in]active() notifiers registered from sched_init_smp(). If cpuset is disabled, cpuset_update_active_cpus() degenerates into partition_sched_domains() making separate notifier for !CONFIG_CPUSETS unnecessary. This problem is triggered by cmwq. During CPU_DOWN_PREPARE, hotplug callback creates a kthread and kthread_bind()s it to the target cpu, and the thread is expected to run on that cpu. * Ingo's test discovered __cpuinit/exit markups were incorrect. Fixed. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Ingo Molnar <mingo@elte.hu> Cc: Paul Menage <menage@google.com>
| | * | sched: define and use CPU_PRI_* enums for cpu notifier prioritiesTejun Heo2010-06-081-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Instead of hardcoding priority 10 and 20 in sched and perf, collect them into CPU_PRI_* enums. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Arnaldo Carvalho de Melo <acme@redhat.com>