When Arrays aren't Arrays

Dec. 10, 2021

Preface

This topic is aimed at programmers with knowledge of arrays or JavaScript (ideally both). If you aren't one of those, I promise you won't be interested. Even if you are one of those, you probably won't be interested, but maybe you can get something from the rabbit hole I went down when messing around with array performance one day...

Arrays in JS

JavaScript is a very high-level language that runs on different engines depending on your environment. How it behaves is heavily dependent on optimizations made by those engines. For the purposes of this post, I'm going to be using Node v16.13.0, which is built on top of Chrome's V8 engine. Every version of Node and every version of every browser is subject to behaving completely differently than the results I'm about to show in this post. It's nice not have to think about these too hard when working with a language like JavaScript, but it's also good to get a deeper understanding when you're dealing with large data sets and understand some traps you can fall into if you aren't careful.

Arrays are kind of a complicated topic in JavaScript. They're unlike some lower-level languages where you have to manage array size and deal with explicitly resizing it when the time comes. Additionally, you can access and set elements far outside the array size without incident. It's just another type of object with .length property and various helper methods attached to it for our convenience. How it's represented under-the-hood (in the various engines) changes depending on how it's used, which is what this post is about.

Most people familiar with arrays in JavaScript know that it's a convenient way to store a list of related items:


      const shoppingList = [
        'eggs',
        'milk',
        'coffee',
        'bread'
      ];

If you're more familiar with a staticly-typed language, you might be a little surprised by the ability to store different types in the same array:


      const shoppingList = [
        5,           // integer
        1.2,         // float
        'hello',     // string
        { a: 'b' }   // object
      ];

The hand-wavy answer to why this is possible is that "arrays are represented by hash maps", but is that really true? I think people often cite that answer because of the ability to add values at arbitrary positions and dynamically size the array.


      const arr = [];
      arr[100] = 5;
      arr.length;
      // 101
      arr;
      // [ <100 empty items>, 5 ]

A peak under-the-hood

Node lets us get some low-level information by running with the --allow-natives-syntax flag and calling %DebugPrint(). With that, let's look at some low-level information about arrays. I'll be reorganizing the log output a bit for readability and excluding lines that I don't think are relevant to each point.


      const arr = [];

      %DebugPrint(arr);
      000003F5042B80F9: [JSArray]
       - map: 0x016ee2b420e9 <Map(PACKED_SMI_ELEMENTS)> [FastProperties]
       - prototype: 0x023cc07db881 <JSArray[0]>
       - elements: 0x010b17a80b29 <FixedArray[0]> [PACKED_SMI_ELEMENTS]
       - length: 0

Inspecting the debug output we can see that we have a 0 (zero) length array that is represented by a FixedArray[0] of PACKED_SMI_ELEMENTS, which just means that the elements are contiguous in memory. The "SMI" part means "small integer" and is a flag to alert V8 about some assumptions for optimization purposes. Additionally, it does include a map, which is used to track information about the type stored in each index.

Let's see what happens when we add one number:


      arr.push(5);

      %DebugPrint(arr);
      000000CA196DAB51: [JSArray]
       - elements: 0x0016421960d9 <FixedArray[17]> [PACKED_SMI_ELEMENTS]
           0: 5
           1-16: 0x010b17a80541 <the_hole>
       - length: 1

A couple of interesting things happened. First, the address (the 000003F5042B80F9 in the previous code block) changed, meaning a new array was created. This may have happened because there wasn't sufficient space adjacent to the previous array location to create the new array. Second, the size of the underlying array is now 17, despite the JS array having length 1, which implies V8 is optimizing for additional .push() calls.

Let's add another number:


      arr.push(7);

      %DebugPrint(arr);
      000000CA196DAB51: [JSArray]
       - elements: 0x00a5fe028eb1 <FixedArray[17]> [PACKED_SMI_ELEMENTS]
           0: 5
           1: 7
           2-16: 0x010b17a80541 <the_hole>
       - length: 2

Now that the underlying array is set up to accept up to 17 elements, it was inserted without any changes.

Now what happens if we push a different type?


      arr.push('foo');

      %DebugPrint(arr);
      000000CA196DAB51: [JSArray]
       - elements: 0x00a5fe028eb1 <FixedArray[17]> [PACKED_ELEMENTS]
           0: 5
           1: 7
           2: 0x019a886ba739 <String[#3]: foo>
           3-16: 0x010b17a80541 <the_hole>
       - length: 3

At first it might be surprising that nothing significant changed, but it actually makes sense because the first two numbers are represented by 64-bits each and the string is a 64-bit address to the location in memory. So all we'd need is a map (see where that original map came in handy?) and we can easily track these elements in contiguous memory and just reference their location with the help of the map for determining the type of each element in the array.

Additionally, the array is now marked as PACKED_ELEMENTS rather than PACKED_SMI_ELEMENTS (recall the SMI stands for small integers), which means that we've lost some underlying optimizations, but still it's a contiguous array, so it's not so bad.

What happens if we try to add a new value outside the underlying array's size, say element 17?


      arr[17] = 25;

      %DebugPrint(arr);
      000000CA196DAB51: [JSArray]
       - elements: 0x0016421aa2d1 <FixedArray[43]> [HOLEY_ELEMENTS]
           0: 5
           1: 7
           2: 0x019a886ba739 <String[#3]: foo>
           3-16: 0x010b17a80541 <the_hole>
           17: 25
           18-42: 0x010b17a80541 <the_hole>
       - length: 18

This time the address didn't need to change, probably because I'm doing this in an isolated environment and not adding other variables, so this could easily change in your programs. We do also see the underlying array once again expanded. Additionally, we see the array marked as HOLEY_ELEMENTS which represents another degradation in optimization.

I want to show what happens when you add an element way outside the range of the underlying array:


      arr[10000] = 0;

      %DebugPrint(arr);
      000000CA196DAB51: [JSArray]
      - map: 0x016ee2b7ce89 <Map(DICTIONARY_ELEMENTS)>
      - elements: 0x0229ca5ceca9 <NumberDictionary[28]> [DICTIONARY_ELEMENTS]
          max_number_key: 10000
          0: 5 (data, dict_index: 0, attrs: [WEC])
          2: 0x019a886ba739 <String[#3]: foo> (data, dict_index: 0, attrs: [WEC])
          1: 7 (data, dict_index: 0, attrs: [WEC])
          17: 25 (data, dict_index: 0, attrs: [WEC])
          10000: 0 (data, dict_index: 0, attrs: [WEC])
      - length: 10001

Now we get to the title of this post: The array is no longer an array!

The elements are now represented by a NumberDictionary rather than a FixedArray. The memory size has actually shrunk, since the dictionary doesn't have all of those "holes" being tracked, and we can see only the keys that have values are set (again, no holes).

The V8 engine is predicting how we use these arrays, and it is helping to optimize for that use-case. Dictionaries, or hash maps, are optimized for arbitrarily storing and looking up values whereas arrays are optimized for iterating over the collection and random-access by index.

Takeaways

How you use an array may be internally optimized for that usage. Be careful when arbitrarily resizing arrays, favor using Typed Arrays when applicable, or at least consider pre-sizing your array with new Array(someSize). If you want to arbitrarily access values, just use an object or a map so that the intent is clear and you aren't surprised by performance degradation.

These are good things, I think, but they mean there are traps in the language. Very high-level languages like JavaScript let us write expressive code with ease at the cost of distancing ourselves from the memory and processors in our computer. With a little knowledge about the underlying engines, compilers, interpreters, virtual machines, or environments that our code is running on, we can have the best of high-level languages and low-level optimizations.

There's so much more to dig into here, but I hope this at least gets you thinking about how these very high-level JavaScript constructs are represented in a lower level, and this can allow you to question what might be happening with a particularly performance-critical piece of code. You can also test for yourself with the %DebugPrint() statement (or Mozilla/SpiderMonkey equivalent).

I didn't bore you to death?

Here's some additional resources to make sure that happens:

Array data structure from Wikipedia
Memory, Cache Locality, and why Arrays are Fast (Data Structures and Optimization) by Simondev
[V8 Deep Dives] Understanding Array Internals by Andrey Pechkurov
Elements kinds in V8 by Mathias Bynens