Memory layout package of the string (0): what is in your hands

(Disclaimer: Please do not reprint without permission. For reprint please contact me first.
Author: RednaxelaFX -> rednaxelafx.iteye.com)

Memory layout series general package of string:
(0): What is in your hands
(1): Metadata and the contents of the string, or the overall separation?

I write this is discussed as a part of memory layout of JavaScript String replies, but more write to more feel digress a bit more so just pumping out alone to write a series of notes. The following discussion is that the background effect:
* JavaScript String allocated on the stack or on the heap?
* Lua string is the copy-on-write of you?
Please pay attention to discuss the background.

Quote
The string is not a piece of memory? This piece of memory is not had on the stack or heap.?

Well... No, not at all. And don't say and global data area and the strings can also be there, it is not necessarily the string "a piece of memory". Have a look below to start.

Memory layout package on the string

Don't say char* this bare string, only said string package object oriented.
To further limit the scope, here only discuss the "flat" (flat) string, that string content according to the sequence of closely packed data structure in memory, instead of discussing likeropeData structure that chain structure to string.

Review mentioned above need to fully dynamic allocation of memory conditions, one of which is not determined in advance data size. String type string content general length is not fixed, so the whole string size cannot be determined in advance, regardless of whether or not the variable string variable. This means that string at least in some cases a part (variable length part) to use dynamic memory allocation, that is allocated on the heap. "".

String type of package can show differences in several dimensions.:
0, "What is in your hand. "?
1, String string content metadata and packaged as a whole or separated storage, storage,
2, A different string if the instance shared string content,
3, The string is explicit record length,
4, If there is a'字符串是否有'\0'结尾(null-terminated), 字符串内容是否允许存'terminated string (null-terminated), the contents of the string is allowed.'\0'(embedded null),
5, External pointers or references to what position the string,
6, The storage capacity of the string (capacity) is greater than the length of the content(length),
7, Whether the alignment requirements, if the end of padding.

0, What is in your hands?

Suppose
mystringtype s = "foobar";
mystringtype s1 = s;

Then in the hands of "s" and "S1" have storage space, it is filled with what?
According to the "real data" distance from near to far, can have the following circumstances:
a) Direct is the contents of the string?
b) Is the pointer to the string entities?
c) Is a pointer to pointer points to the string "entity"?
d) Is a representative of a string token?


a) Direct is the contents of the string

Relatively rare, but not non-existent. Realization of some C++ standard library: basic_string std: by SSO (short string optimization), the short string (7 wchar_t or 15 char, std:: String) direct plug in structure; the length of the string is greater than the threshold or the string content distribution in the reactor. In the realization of such,
std::string s("foobar");
std::string s1 = s;

Inside the s will directly hold "foobar" content, rather than "entity pointer points to the string".

For example, the realization of VS2012/VC11 is such. The VC11 std:: String extreme simplification, data part of it as follows:
class string {
  enum { _BUF_SIZE = 16 };
  union _Bxty {
    // storage for small buffer or pointer to larger one
    char  _Buf[_BUF_SIZE];
    char* _Ptr;
  } _Bx;
  size_t _Mysize; // current length of string
  size_t _Myres;  // current storage reserved for string
};

You can see it in the first instance member _Bx is a size of 16 bytes of union, the contents of the string which can be installed under the length of less than _BUF_SIZE, can hold a pointer (when the string length is not less than _BUF_SIZE). This is called the SSO techniques can make a little string directly embedded in std:: String instance structure, this time without the need for additional allocated on the heap so reduce the space overhead of buffer pile, also improves the data locality. Of course there are cost, also is each time to access the contents of the string must first according to the _Myres and _BUF_SIZE comparison to determine the current in the "short string" or "long string" mode, increase the complexity of a bit of code, but in general because of improving data locality, but not increase the time overhead.

For example, "foobar" VC11 32 x86 std:: string in memory can be so:
0x0042FE54  66 6f 6f 62 61 72 00 00 b9 21 a2 00 68 f7 0c 95
0x0042FE64  06 00 00 00 0f 00 00 00 

S: 0x0042FE54 (24 bytes)
 (+0) [ _Bx._Buf = 0x66 ('f') 0x6F ('o') 0x6F ('o') 0x62 ('b') 0x61 ('a') 0x72 ('r') 0x00 ('\0') ... ]
(+16) [ _Mysize  = 0x00000006 ]
(+20) [ _Myres   = 0x0000000F ]

64 x86 can be so:
0x000000000024F8E8  66 6f 6f 62 61 72 00 00 69 2f d5 a1 1d d9 ce 01
0x000000000024F8F8  06 00 00 00 00 00 00 00 0f 00 00 00 00 00 00 00

S: 0x000000000024F8E8 (32 bytes)
 (+0) [ _Bx._Buf = 0x66 ('f') 0x6F ('o') 0x6F ('o') 0x62 ('b') 0x61 ('a') 0x72 ('r') 0x00 ('\0') ... ]
(+16) [ _Mysize  = 0x0000000000000006 ]
(+24) [ _Myres   = 0x000000000000000F ]

The first 16 bytes are members of the _Bx range, in this case the first 6 bytes is "foobar", then'头16字节就是_Bx成员的范围, 该例中头6字节是"foobar"的内容, 接着是'\0'(null-terminate), 剩余部分是未使用数据(并不保证清零), 然后是_Mysize = 6与_Myres = 15'(null-terminate), the remaining part is unused data (not guarantee clear); then _Mysize = 6 and _Myres = 15.

When S1 = s, S1 will copy the contents of S, then S1 is also embedded with a "foobar", there is no shared data.

b) Is the pointer to the string entities

Many high-level language virtual machine will use this scheme. They limit access to an object, does not allow access to the contents of the object directly, but must be accessed indirectly through references. At least one layer of indirection. When this layer of indirection is achieved through "direct pointers", this kind of memory management is called the pointer-based memory management.

The example of "s" "S1" is a reference, a reference to its own value is a pointer; "s" two "S1" refer to the same instance of String. For example, implemented by CLR.NET and implemented by HotSpot VM Java are all the same. There are examples so now do not start writing.
s:              string object:
[ pointer ] --> [ "foobar" ]
             /
s1:         /
[ pointer ]


c) Is a pointer to pointer points to the string "entity"

A case of more than one layer or multi-layer indirect. A layer of indirection comes out more often called handle (Ju Bing), memory management mode corresponding called handle-based.

Common implementations handle is "pointer" (pointer-to-pointer), which is more indirect than direct pointer:
s:             handle table:   string object:
[ handle ] --> [ pointer ] --> [ "foobar" ]
            /
s1:        /
[ handle ]

Like the Sun JDK 1.0.2 JVM is such.

Use the handle of the benefits are achieved up to be lazy. If the memory manager needs to move objects (such as mark-compact or copying GC), you must fix all related pointer. But through all related pointer requires effort, think lazy like this add a layer of indirection, does not allow direct outside has a pointer to an object, but let the outside holding a handle, the handle can be refers to "handle table" (handle table) pointer, and the handle table of elements was held a pointer to an object. To modify pointer traversal handle table to correct as long as can be.

With the handle of the downside is the cost of time and space is large. Appropriate use scene has two kinds: 1, want to be lazy; 2, want to hide information.

d) Is a representative of a string token

This is a further example on a case. The so-called "handle" not necessarily "pointer", can also be more indirect thing, for example if the "handle table" is an array, the "handle" can only subscript instead of a pointer; if "handle table" is a sparse array (possibly with a hash table to achieve), the "handle" may be only a sparse array subscripts (may use hash keys to achieve). This handle is sometimes called token, such as ID.

Application of Ruby 1.8.7 Symbol is the special handle.
Ruby Symbol and String can be used to represent the string information, difference:
* Symbol is resident (interned), String is not. That means the content of the same "Symbol object" can only be a,
* Symbol is not variable, String can be variable (also frozen string, it is not a variable).

Symbol in Ruby is so special, in the value of the VALUE type Ruby are specialized for Symbol.
The following example for 3 Symbol, assigned to the local variable s:
s = :rednaxelafx
s = :rednaxelapx
s = :rednaxelagx

Assume that all 3 Symbol before does not appear, then they will be in the order of 3 successive intern.

Local variable type s is VALUE from the perspective of C. The content of s three value (VALUE value) were likely to be:
(example Mac OS X 10.7.5/x86-64/Ruby running on 1.8.7)
0x00000000005F390E
0x00000000005F410E
0x00000000005F490E

Can not see have what contact? For a binary view:
ID                                                    | ID_LOCAL | SYMBOL_FLAG
00000000000000000000000000000000000000000101111100111 | 001      | 00001110
00000000000000000000000000000000000000000101111101000 | 001      | 00001110
00000000000000000000000000000000000000000101111101001 | 001      | 00001110

Ruby 1.8.7 VALUE is atagged pointerType: a minimum of 8 is a marker of special type is used to identify the values of (tag), which is used to mark the Symbol SYMBOL_FLAG value is 0x0e,
When the VALUE flag is SYMBOL_FLAG, close to the mark 3 bit used to indicate the scope of Symbol (SCOPE), which is used to mark the local identifier value for ID_LOCAL 0x01,
Then the above high is the only value of one one corresponds to Symbol, is an integer ID.

Part of the ID pulls out alone, can see the examples in s ID are
3047
3048
3049

Is a sequence of integers by increasing. The ID and the role of domain label together constitute the Ruby for Symbol token, the handle can be regarded as a special form of.

Thus, Symbol is not really "object", at least not whole in the heap object instance. The Symbol system consists of 3 parts.:
* One one corresponding to the Symbol ID value, usually embedded in the marker for SYMBOL_FLAG VALUE. Remove the scope marking part of this ID by a global counter generation and. While Symbol#object_id is returned by the ID calculated value. Refer to the rb_intern (Implementation),
* A global symbol table, is a hash table, a record of ID mapping relationship to actual contents of a string,
* The char array is the actual string information.

Know the mapping between Symbol#object_id and the underlying ID can write such a small program:
def id_with_scope(object_id)
  # sizeof(RVALUE) = 40 on 64-bit platform for Ruby 1.8.7
  ((object_id <<1) - (4 <<2)) / 40
end
ID_LOCAL = 1
ID_SHIFT = 3
def to_id(sym)
  return nil unless sym.is_a? Symbol
  id_with_scope(sym.object_id) >> ID_SHIFT
end

(only the Ruby 1.8 Series in the 64 on the right. Other versions / platform details are slightly different, but the same principle. )
And then calculate a Symbol corresponding ID value:
>> to_id :rednaxelafx
=> 3047
>> to_id :rednaxelapx
=> 3048
>> to_id :rednaxelagx
=> 3049


Rubinius Symbol is realized by the similar way.

From the resident perspective, Ruby Symbol and Lua string similarity: all instances are both reside. But the reference value (VALUE) is a special form of expression, is an integer ID; the latter is realized by ordinary pointer reference value (Value). Special resident, examples, and whether to use the pointer to show references, two orthogonal problems.

Posted by Angelia at February 12, 2014 - 11:40 AM