Right, so there I was tonight, tinkering with my program to find duplicate files. I let it run on my system, 2.4TB of storage, 396,000+ files, 46,660+ directories... (or so find and wc tell me)
And, I remember seeing a sym-link loop in there somewhere in the /var/lib area in the past, so I did rather expect it to get caught and whirr away for a while on that... probably abort with a memory malloc error or something similar.
Well, it *did* abort... but at exactly 16,777,216 items. Doesn't that number ring a bell? Yeah... exactly 16 Meg (16*1024*1024).
Kinda eerie. I've got 4GB of ram, 6 GB of swap... it ate up nearly 7GB of storage real/virtual before it coughed and died. So of course I had to enable core dumps and see....
<run it again!>
Yup... 16,777,216 entries, then:
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
Aborted (core dumped)
Strange... does 'vector' have an upper limit on how many it can index (beyond memory/ram limits of course)?
So whip up a little test jig, using 'vector <int> tst' just to see how many it can create before it dies...
This time... 536,870,912 items.... gee,that number's familiar too... That's *exactly* 512 Meg (512*1024*1024). Very Weird.
The test jig just stored 'int's, 4 byte values. The original program (dedupe), stores 'md5file' classes...
Hmmm:
512 Meg items @ 4 bytes = 2 GB of storage....
16 Meg items @ ? bytes = ... 2 GB?? (this is a theory...)
Implies the Class is 128 bytes... Let's go add up things and see...
Well... drat. sizeof(md5file) says it's 88 bytes, I'm 40 bytes shy. One-Third too low. Darn. And just for good measure, I verified that the test jig is using 4 byte int's too. Trying 8 byte double's now... testing the theory again. I predict... 256Meg items. Now to see if my theory holds true.
{jeopardy music plays softly in the background}
HEY! 268,435,456 items... 256*1024*1024... 256Meg! The theory seems to hold true...
So what's using the other 40 bytes? {Sigh} Use the Source Luke! Dang-it! I *WROTE* the source...
Not that my machine has anywhere NEAR 16 Million files to tinker with... but still!. Actually, considering that I can predict what it's going to do... maybe it's a feature now. {shrug} dunno.
Had an epiphany while recovering from the last coredump... Efficient memory usage / allocation is usually done by powers of 2... and since 88 > 64, then the next size up is 128. Which would explain the 128 byte per allocation usage predicted by my theory above.
So yeah, it all works. There's not a 16Meg limit to number of items, there's a 2GB limit on the amount of data stored.
Taa Daa!
(Not that this really explains anything, but at least now I know why it's dying at exact 'power-of-two' boundaries.
Loni
(Edit: I figured out the 2G limit too... see next post)