2008-08-31 00:36:11

Y'know, I've never SEEN a 3.4 GIG coredump file before...

Right, so there I was tonight, tinkering with my program to find duplicate files. I let it run on my system, 2.4TB of storage, 396,000+ files, 46,660+ directories... (or so find and wc tell me)

And, I remember seeing a sym-link loop in there somewhere in the /var/lib area in the past, so I did rather expect it to get caught and whirr away for a while on that... probably abort with a memory malloc error or something similar.

Well, it *did* abort... but at exactly 16,777,216 items. Doesn't that number ring a bell? Yeah... exactly 16 Meg (16*1024*1024).

Kinda eerie. I've got 4GB of ram, 6 GB of swap... it ate up nearly 7GB of storage real/virtual before it coughed and died. So of course I had to enable core dumps and see....

<run it again!>

Yup... 16,777,216 entries, then:

terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
Aborted (core dumped)

Strange... does 'vector' have an upper limit on how many it can index (beyond memory/ram limits of course)?

So whip up a little test jig, using 'vector <int> tst' just to see how many it can create before it dies...

This time... 536,870,912 items.... gee,that number's familiar too... That's *exactly* 512 Meg (512*1024*1024). Very Weird.

The test jig just stored 'int's, 4 byte values. The original program (dedupe), stores 'md5file' classes...

Hmmm:

512 Meg items @ 4 bytes = 2 GB of storage....
16 Meg items @ ? bytes = ... 2 GB?? (this is a theory...)

Implies the Class is 128 bytes... Let's go add up things and see...

Well... drat. sizeof(md5file) says it's 88 bytes, I'm 40 bytes shy. One-Third too low. Darn. And just for good measure, I verified that the test jig is using 4 byte int's too. Trying 8 byte double's now... testing the theory again. I predict... 256Meg items. Now to see if my theory holds true.

{jeopardy music plays softly in the background}

HEY! 268,435,456 items... 256*1024*1024... 256Meg! The theory seems to hold true...

So what's using the other 40 bytes? {Sigh} Use the Source Luke! Dang-it! I *WROTE* the source...

Not that my machine has anywhere NEAR 16 Million files to tinker with... but still!. Actually, considering that I can predict what it's going to do... maybe it's a feature now. {shrug} dunno.

Had an epiphany while recovering from the last coredump... Efficient memory usage / allocation is usually done by powers of 2... and since 88 > 64, then the next size up is 128. Which would explain the 128 byte per allocation usage predicted by my theory above.

So yeah, it all works. There's not a 16Meg limit to number of items, there's a 2GB limit on the amount of data stored.

Taa Daa!

(Not that this really explains anything, but at least now I know why it's dying at exact 'power-of-two' boundaries.

Loni

(Edit: I figured out the 2G limit too... see next post)


Posted by lornix | Permanent link | File under: linux