Linux IPC developer performance test

Recommended for you: Get network issues from WhatsUp Gold. Not end users.

An overview


Linux/UNIX development for decades, IPC is all kinds of, but later POSIX and SUS standardization under a lot of effort, the interface is clear and stable a lot, but the system still has a lot of small pit pit, not only to read a book and check the documents, more practice, can we gradually familiar with, one way this is familiar with IPC.

Performance test code and the main idea of UNIX network programming volume second [Stevens, based on 1999], later referred to as the UNPv2, but do the following adjustments:
1 remove the Linux does not support or common IPC, such as Doors and Sun RPC.
2 pthread API rewrite read write lock, written in Stevens Solaris and Digital UNIX UNPv2 no pthread versions of read and write lock.
3 in order to reduce the external dependence, override the TCP/UDP/UNIX Domain socket bandwidth and delay test, this part of UNPv2 data using the open source lmbench.
4 increase in the GCC version of atomic as a synchronization primitive, atomic now in the C++ 11 and C11 standardization.
5 not contain System V semaphore, because in their machine test, high performance decline, other than the synchronous slow more than 100 times, also can not determine the virtual machine kernel version or use or cause, the conclusion, add.
6 all set to maintain the system default parameters, unless the special instructions.

Two. The test environment


Hardware: CPU dual core (2.3GHz, 6MiB, L3) memory 2GiB
Software: Fedora 20(kernel 3.12.10, gcc 4.8.2, glibc 2.18)

The three IPC bandwidth test


Methods: the procedures were selected | 2KiB 1KiB 4KiB 8KiB 16KiB | | | | 32KiB | 64KiB as a message size, the data transmission process, the size of each message transmission of 500MiB data, run 5 times average per second bandwidth, the numerical.

Chart:

Message size

Bandwidth (MB/sec)

Pipe

POSIX message queue

System V message queue

TCP socket

UNIX domain socket

UDP socket

1024

1,233

405

354

1,756

1,603

15

2048

2,048

869

655

2,625

2,132

30

4096

2,944

1,653

1,075

3,483

3,013

60

8192

3,211

4,250

1,599

4,175

4,779

119

16384

3,300

5,982

2,510

4,552

6,414

231

32768

2,876

6,929

2,888

4,450

7,508

435

65536

2,830

7,483

3,830

4,692

3,953




Explain:
The 1 UDP bandwidth is very low, because UDP is IPC is not the only reliable data transmission mode, no flow control, to achieve a simple PUSH-ACK application layer protocol, but the side effect is severe performance degradation. Implementation of a bulk transfer protocol / asynchronous retransmission can practice, performance will be much higher.
2 TCP and UDP are lo loopback network interface based on, MTU 64KiB. TCP is streaming, application layer does not perceive the MTU limit. UDP is a boundary data packets, To consider the problem of path MTU, IP can theoretically large data slice UDP package, But it will bring greater performance and reliability issues, So most systems would limit UDP packet size, For example, MTU 64KiB minus IP and UDP head and variable length options, The actual can send message data will be lower than 64KiB, In order to chart, UDP message size only to 32KiB.
3 UNIX domain socket can support a byte stream and packet two forms, here only selected byte stream mode, similar to TCP.
4 need to increase the limit on message queue /proc/sys/fs/mqueue/msgsize_max, kernel /proc/sys/kernel/msgmax, /proc/sys/kernel/msgmnb, in front of the three values should be raised to at least 64KiB.
5 from the curve can be clearly divided into two camps: no boundary byte stream, Pipe/TCP socket/UNIX Domain socket, The implementation typically maintain an internal buffer, Data block read every time the user space is no more than a certain size, When the user exceeds the value message size, There will be no significant increase in bandwidth, Even decline, Like the pipe PIPE_BUF restrictions, My this machine is 4KiB datagram boundaries, POSIX message queue/System V message queue/UDP socket, A message size is a call, With the sustained growth of size bandwidth of message.

Four IPC delay testing


Methods: the program process exchange 1 bytes of data 10K time averaging, run 5 times and the mean.

Chart:

Latency (microseconds)

Pipe

POSIX message queue

System V message queue

TCP socket

UDP socket

UNIX domain socket

53

53

57

67

63

54


Explain:
Delay is not an accident... Almost all the.


Five multi process synchronization time


Methods: the procedures were launched 1 2 3 4 | | | | 5 sub process, in the shared memory into a long integer, each sub process of the increment 1M, total time, running 5 times average.

Chart:

# processes

time to count 1M times (microseconds)

atomic

mutex

read-write lock

memory semaphore

named semaphore

fcntl record locking

1

6,082

20,478

34,125

20,736

23,340

544,179

2

40,948

120,419

411,038

192,671

222,371

1,317,033

3

72,817

177,074

726,129

648,390

630,069

2,191,806

4

101,213

287,997

1,012,311

855,711

891,484

6,125,641

5

128,190

371,302

1,129,752

1,122,309

1,198,571

3,757,362


Explain:
1 for the integer increment operation is very simple operation, concurrent access is most suitable atomic operation (__atomic_fetch_add), does not involve the system call, just a few assembly instructions, so the efficiency is the highest, but it can do the operation is very simple, more complex scenes will help mutex methods, here only to illustrate synchronization primitives itself.
2 fcntl record locking most slowly, after all, is the operating file system related, as expected, but the record lock is very simple to use, each platform support, multi process synchronization is good.
3 in addition to fcntl, other synchronization methods are memory operations, law is the more functions, efficiency is low. Atomic can only do some integer type atomic operation, The fastest mutex can lock any code, But only 0-1 two state, After the atomic; read write locks need to distinguish between a read lock and write lock, Complex than mutex, Signal to maintain numerical changes, Can be seen as mutexes and condition variables fit, Than the mutex complex, The two slower than the mutex also makes sense.

Six multi thread synchronization time


Methods: 1 procedures were initiated | 2 | 3 | 4 | 5 threads, in shared memory on a long integer, each thread on the increment 1M, total time, running 5 times average.

Chart:

# threads

time to count 1M times (microseconds)

atomic

mutex

read-write lock

memory semaphore

named semaphore

fcntl record locking

1

6,011

20,161

36,107

20,472

21,136

581,972

2

40,732

197,068

322,856

177,390

196,268


3

60,447

266,436

364,316

281,590

536,500


4

81,705

383,399

468,483

459,140

742,603


5

102,755

476,966

565,017

517,602

1,116,406



Explain:
Multi process synchronization time 1 contrast to the above, we can see the process of switching Linux is not better than the thread switch how much slower.
2 fcntl record locking is process based, so meaningful test only needs to open a thread.

Summary of seven

Can not be directly to the conclusion above apply to any of the actual project, there are many factors that affect the IPC performance, such as system load / process context switch / various parameter settings and so on, need to own the machine test results as the basis, but can refer to the above ideas.

Eight reference data

1. UNIX Network Programming, Volume 2, Second Edition: Interprocess Communications, Prentice Hall, 1999, ISBN 0-13-081081-9.
2. Linux/UNIX system programming manual. Kerrisk, M; Sun Jian et al. The posts and Telecommunications Press
3 UNPv2 source code,
4 Matplotlib to generate the chart,


Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download

Posted by Wanda at May 12, 2014 - 7:33 AM