[prev in list] [next in list] [prev in thread] [next in thread]
List: qemu-devel
Subject: Re: [Qemu-devel] [PATCH 01/15] qemu coroutine: support bypass mode
From: Ming Lei <ming.lei () canonical ! com>
Date: 2014-07-31 8:59:47
Message-ID: CACVXFVMniMoquw-BQ86VZKPT-1n6p6gp7m01MtioZf=+BugidQ () mail ! gmail ! com
[Download RAW message or body]
On Thu, Jul 31, 2014 at 7:37 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 30/07/2014 19:15, Ming Lei ha scritto:
> > On Wed, Jul 30, 2014 at 9:45 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> > > Il 30/07/2014 13:39, Ming Lei ha scritto:
> > > > This patch introduces several APIs for supporting bypass qemu coroutine
> > > > in case of being not necessary and for performance's sake.
> > >
> > > No, this is wrong. Dataplane *must* use the same code as non-dataplane,
> > > anything else is a step backwards.
> >
> > As we saw, coroutine has brought up performance regression
> > on dataplane, and it isn't necessary to use co in some cases, is it?
>
> Yes, and it's not necessary on non-dataplane either. It's not necessary
> on virtio-scsi, and it will not be necessary on virtio-scsi dataplane
> either.
>
> > > If you want to bypass coroutines, bdrv_aio_readv/writev must detect the
> > > conditions that allow doing that and call the bdrv_aio_readv/writev
> > > directly.
> >
> > That is easy to detect, please see the 5th patch.
>
> No, that's not enough. Dataplane right now prevents block jobs, but
> that's going to change and it could require coroutines even for raw devices.
>
> > > To begin with, have you benchmarked QEMU and can you provide a trace of
> > > *where* the coroutine overhead lies?
> >
> > I guess it may be caused by the stack switch, at least in one of
> > my box, bypassing co can improve throughput by ~7%, and by
> > ~15% in another box.
>
> No guesses please. Actually that's also my guess, but since you are
> submitting the patch you must do better and show profiles where stack
> switching disappears after the patches.
Follows the below hardware events reported by 'perf stat' when running
fio randread benchmark for 2min in VM(single vq, 2 jobs):
sudo ~/bin/perf stat -e
L1-dcache-loads,L1-dcache-load-misses,cpu-cycles,instructions,branch-instructions,branch-misses,branch-loads,branch-load-misses,dTLB-loads,dTLB-load-misses
./nqemu-start-mq 4 1
1), without bypassing coroutine via forcing to set 's->raw_format ' as
false, see patch 5/15
- throughout: 95K
Performance counter stats for './nqemu-start-mq 4 1':
69,231,035,842 L1-dcache-loads
[40.10%]
1,909,978,930 L1-dcache-load-misses # 2.76% of all
L1-dcache hits [39.98%]
263,731,501,086 cpu-cycles [40.03%]
232,564,905,115 instructions # 0.88 insns per
cycle [50.23%]
46,157,868,745 branch-instructions
[49.82%]
785,618,591 branch-misses # 1.70% of all
branches [49.99%]
46,280,342,654 branch-loads
[49.95%]
34,934,790,140 branch-load-misses
[50.02%]
69,447,857,237 dTLB-loads
[40.13%]
169,617,374 dTLB-load-misses # 0.24% of all
dTLB cache hits [40.04%]
161.991075781 seconds time elapsed
2), with bypassing coroutinue
- throughput: 115K
Performance counter stats for './nqemu-start-mq 4 1':
76,784,224,509 L1-dcache-loads
[39.93%]
1,334,036,447 L1-dcache-load-misses # 1.74% of all
L1-dcache hits [39.91%]
262,697,428,470 cpu-cycles [40.03%]
255,526,629,881 instructions # 0.97 insns per
cycle [50.01%]
50,160,082,611 branch-instructions
[49.97%]
564,407,788 branch-misses # 1.13% of all
branches [50.08%]
50,331,510,702 branch-loads
[50.08%]
35,760,766,459 branch-load-misses
[50.03%]
76,706,000,951 dTLB-loads
[40.00%]
123,291,001 dTLB-load-misses # 0.16% of all
dTLB cache hits [40.02%]
162.333465490 seconds time elapsed
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic