You saved 5 cents, and your code is not readable, congrats!
I found myself reading this post, and at some point, I really wanted to cry:
We had relatively long, descriptive names in MySQL such as timeAdded or valueCached. For a small number of rows, this extra storage only amounts to a few bytes per row, but when you have 10 million rows, each with maybe 100 bytes of field names, then you quickly eat up disk space unnecessarily. 100 * 10,000,000 = ~900MB just for field names!
We cut down the names to 2-3 characters. This is a little more confusing in the code but the disk storage savings are worth it. And if you use sensible names then it isn’t that bad e.g. timeAdded -> tA. A reduction to about 15 bytes per row at 10,000,000 rows means ~140MB for field names – a massive saving.
Let me do the math for a second, okay?
A two terabyte hard drive now costs 120 USD. By my math, that makes:
- 1 TB = 60 USD
- 1 GB = 0.058 USD
In other words, that massive saving that they are talking about? 5 cents!
Let me do another math problem, oaky?
Developer costs about 75,000 USD per year.
- (52 weeks – 2 vacation weeks) x 40 work hours = 2,000 work hours per year.
- 75,000 / 2,000 = 37.5 $ / hr
- 37.5 / 60 minutes = 62 cents per minutes.
In other words, assuming that this change cost a single minute of developer time, the entire saving is worse than moot.
And it is going to take a lot more than one minute.
Update: Fixed decimal placement error in the cost per minute. Fixed mute/moot issue.
To those of you pointing out that real server storage space is much higher. You are correct, of course. I am trying to make a point. Even assuming that it costs two orders of magnitudes higher than what I said, that is still only 5$. Are you going to tell me that saving the price of a single cup of coffee is actually meaningful?
To those of you pointing out that MongoDB effectively stores the entire DB in memory. The post talked about disk size, not about memory, but even so, that is still not relevant. Mostly because MongoDB only requires indexes to fit in memory, and (presumably) indexes don't really need to store the field name per each indexed entry. If they do, then there is something very wrong with the impl.
Comments
Awesome.
Come on... be fare! That post was dated over a year ago! Hard drive prices back then were a lot more expensive! It would be a minute and a half of developer time, easy.
I thought that mysql didn't store the column name in the data/rows.
@Tuna - they probably renamed their fileds/properties AFTER migrating to MongoDB.
@Tuna
If you read the original article, they migrated from mysql to mongodb, and in mysql they had long names, and they used shorter ones for mongodb.
Well the idea is stil stupid but your math are wrong :
active/passive in clustered environment : assuming you are log shipping, it doubles everything
failover environment double that again ...
then you need a reproduction and intégration environment
of course everything is raid10 with 2 spare
So the guy is saving at least 1.20 cents ...
If the values were small the db size might have doubled. That means half the cache size.
Update: "With our database being around 60GB..." => It was a ridiculous decision.
Oh my. This has to be the funniest tech post I've ever seen.
Factor in the hours spent trying to decipher those ridiculous names......bwa ha ha ha ha.
In my opinion such decisions should be evaluated exactly like Ayende did in this post. I remember a project where we have a big database and we spent at least 1 month x 2 developers time to optimize a set of queries to give result under a certain amount of time.
After 1 month x 2 dev time, the database performance is still a little bit slow but acceptable.
The whole problem was due to a 1 GB RAM production server with a 20 GB database . if the company would had spent money to buy 4 GB RAM and a decent SCSI RAID system (the server had 2 7200 RPM IDE DISK in SATA RAID) we could have saved:
1) money: because a SCSI raid and 4 gb of RAM actually would be less expensive than 2 months developers cost
2) time, the 2 developers could have spent time on other projects
3) maintenance nightmare, because maintaining some sort of superoptimized set of stored procedures is a nightmare
What I learned from this story? Never try to save money trying to optimize code to the bone, you better think first how much does it cost to you buy a new metal to run the same software.
alk.
"the entire saving is mute." - you meant moot.
good point, well made, however.
Factor in how many people are going to use the software, if there are 1000 users then you have to multiply that by 1000.
I didn't bother reading the original post :)
Also , remember , if you now buy a TB , you have to back up a TB, also u need redundancy for a TB. you then need to archive a TB and perform maintenance on a TB so while you save in development costs at the beginning you still need to pay a monkey more money to spend longer at a pc because you are using more resources ??
@Gian Maria
I've seem pretty bad things in SP or queries and I've the experience to say that when you have something really bad in queries for instance, you are not going to get an improvement by enhancing your hardware.
But, for sure you are talking about a fine-tunning.
While I agree with the general sentiment, 5c is wildly inaccurate. The drive linked to is consumer grade hardware - 15k fibre channel drives in a SAN, plus additional backup capacity, plus redundancy, plus hardware maintenance, etc, are significantly more than $60/TB.
For comparison, Amazon S3 and Azure Storage charge $1800/TB/year. At that price, this is roughly equivalent to 7 hours of developer productivity per year based on your figures. I think this will still easily be blown by a single bug caused by a typo on one of those idiotic field names.
My only thought on this was that the memory they were talking about was something to do with the indexes being stored in memory - but they didn't say anything about that so I still think the whole thing is a bit... mind boggling
$75k per year is 62.5c per minute, not 6c per minute.
1 GB of disk space therefore costs approximately 6 seconds of developer time. A full minute of developer time would buy approximately 10GB of space.
It's very clear that you know almost nothing about MongoDB and economics.
BTW You guys seem to be improving yourself at trolling in these days. Very clever. But sorry, I can't waste my time by talking with you genius Microsoft fanboys
sid3k - feel free to enlighten, you'll find the majority of us are open to being corrected =)
I'd probably do it too, but not to save disk. I'd do it to save RAM. You want to keep as much of your data as possible in memory. Once it spills over to disk, it's much more expensive (in terms of performance) to retrieve.
Or we can pretend that there are fixed budgets, and that people don't write on 2TB consumer drives, but rather have space constraints put on them by a business unit who not only does not care about code readability, but demands they fit 5lbs of crap into a 2lb bag.
Many companies use 72 or 146GB enterprise drives. They cost hundreds of dollars.
They should still fight the fight and try to make a case for usable and readable code, but don't play the "Newegg says $X so hahaha" game.
I don't get it- since when does a database store the row name in every record of the row?
There's a lot of costs you're hiding.
First of all, the HD you want to use should be a 15000 RPM enterprise level HD - which costs about 6x more.
The other costs have to do with the process of backup and storage and maintaining those logs. Do the logs need long term backup? If so, for how many years? Do the logs contain sensitive data? Add security precautions onto that storage per GB too.
Now you need to calculate redundant data - often times about 3x the cost.
Let's also consider multiple data centers if your data is going to need fast response times around the world.
All I'm pointing out is that the more requirements you need for your data, the cost tends to stack up quickly. You might disagree, but being from a background that holds sensitive financial data together with the need for nodes around the world, the costs you give are better suited for coding an at-home video game than a real world application.
However, it's still cheap ;) But don't think that just because crap sites like foursquare who can deal with an 8 hour outage and use cheap disks that other companies think the same way.
Hah! Did you mean moot?
@sid3k and @Me Again
Even if there is an economical reason to do this "optimization" (and there is not) you doesn't count for Moore's low.
Disk space and memory price per byte will half every 18 months but your code will remain unreadable and will for that reason make additional cost for years or until you don't rename them to be readable.
Costs like introduced bugs, or longer time for developer to understand what code does.
Hence, it doesn't make ANY sense.
Related story:
I used to program in Clipper (ca. 1992/3) and while developing a particularly large DB (15 DBF files), I actually ran into stack limitations within Clipper due to fieldnames. Some DBF files had over 120 columns, and we started out with descriptive fieldnames such as BackOrderedQty and so on. As we were implementing the database, Clipper ran out of stack space that was reserved for field names - while it could hold 255 columns, it couldn't store the names of that many unless they fit into some space limit (I'll assume they reserved 1 or 2k for this).
We were forced to create a data dictionary DBF and use two-letter column names for the actual databases, with lookups back to the data dictionary when we needed to describe the field for the user (e.g. during report building). So, the databases each had field names of AA, AB, AC, etc...it was UGLY. It turned out well in the end - when we went to the next generation of the software, the data dictionary actually turned out to be useful as it had formatting data, descriptions and import/export specifics for each field.
Thankfully, the database fields didn't drive the business rules for the app we were building (custom report generator) and we didn't have to include them in the code, so this impact wasn't there - but I can definitely relate to this.
Guys - I'm sorry to cool down on the comedy here, but this is considered a good approach by MongoDB devs. The reasoning for this is that MongoDB tries to keep things in memory, and minimizing memory footprint is important here. The more records you can fit within the same memory space, the faster your queries will be, period.
and
on top of the hardware costs, there is also an ongoing payments on network load. since the data is BSON which sends field names along with the data like JSON does, the overhead (although small) might not be negligible at all for some scenarios (think about data with very small records, but lots of entries going back and forth)
As for the developer experience with crappy field names - mature MongoDB drivers can do the mapping for you, so you deal with proper field names in your software, and minimize network traffic and ram utilization.
While I agree with the general principal - the costs are a bit off.
For a db server of a decent size, you should be using fast, protected storage, something like sas/scsi 15k drives.
1 300gb 15k drive = $270. 3 disks + parity = (4*270 = 1080) and you'll probably want to double this number if you're going to be replicating the data. If you're using an enterprise level storage system, you're probably going to have even more overhead, and end up looking at two systems, so double it.
Ken :: That is what I suspected, and suddenly that people actually have "mapping" tools for Mongo makes a bit more sense.
I have been wondering about that for a while, half of the joy for me in working with document databases is the simplicity of just dumping things into the store and not worrying about mapping and such.
I thought mapping was just adding needless complexity on top of something was very simple, but it turns out it's to facilitate the covering up of complexity required to dance around Mongo storing as much as it can in RAM (I know indexes are in-memory, but are other aspects too?).
I've been thinking about scaling lately, isn't this a case where we're using a sledgehammer to crack a nut? The vast majority of documents/indexes surely don't need the performance of being kept in memory, and for those that are we could use some specific technology for that purpose?
FURTHER NOTE: The blog entry itself mentioned it was about DISK space, not memory so that's a bit misleading if what you are saying is the case.
Thanks for the explanation anyway though.
37.5 / 60 minutes = 6 cents per minutes? Odd math.
The thing that really really shocks me is that some developers only get 2 WEEKS VACATION? WTF!
You guys get vacations?
@Rob.. What about adding the mapping stuff underneath, so that the serialised JSON has the shorter names? Like an add-on perhaps.
I'm not talking about saving the 5c on storage, because perhaps when used in an embedded scenario, the 5c is not 5c any more. And perhaps the 5c is useless, since the storage is not expandable.
I haven't thought about it or have any plans to use a document DB in an embedded scenario, but I can see that there might just be a use for the smaller storage space, but that the names would be handled under the hood and the server would manage that internally somehow.
just my 5c..
As with everyone else, I saw your math and wanted to cry. It's 62.5 cents per minute.
Uhm, we're talking about database _field names_. MySQL does not store the field names in every record it stores, They are only stored in the table header. So the actual savings here were, what, 200 bytes?
um....check yer math dood.
if you make 6 cents per minute, times 60 minutes, you make 3.60 / hour.
that's three dollars and sixty cents per hour. that's illegal in America. Are you talking about offshore developers?
I'd like to echo Mark Richards' concern: What in the heck DB are they using that stores the name of each column in the data for each row? That doesn't even begin to make any sense!
massive save is possible in terms of time
if you are dealing with million of rows.. retrieving "tA" instead of "t....A..." would be faster ... but surely you can normalize the tables..
When you factor in opportunity cost it's even worse...
priceless!
Thanks for your valuable contribution to rid mankind of stupidity.
Also, look at the record counts he is basing his savings on -- 10 million? How many times do you have that many records in a database? If you use a reasonable figure of 10 thousand records, then your savings drops to 0.005 cents.
OP:
Dude, the savings in doing something like that isn't in disk space, it's in time complexity. By making the computer load and process long strings, it's crunching ALOT more CPU time.
It consumes more RAM. ALOT more RAM, in fact, when it loads the whole DB into RAM so it can run fast like most DB applications need it to, a GB of RAM. A GB of RAM = $50, not 50 cents.
Further, I don't know who you are or where you work, but the average software developer only earns about $30-45k a year in Japan or Canada, and much less in China or India.
If someone decided to make that choice, it's certainly nothing to blog about.
about "mapping" tools for Mongo...
1) you loose the capability to take your database somewhere and understand the data in it if you forget your "mapping" tool. Or not ?
2) if we introduce a mapping tool in the equation aren't we making a trade of in CPU vs. disk space (and network, ....) ?
If so why "mapping" tool doesn't just compress documents on the way to db ? Db operations would be slower, but hey we would save disk space :)
On the other hand 6GB of memory cost about 100€ and on cheap m.board I can have 24GB of memory. And isn't MongoDB distributed database - meaning I can easily have 5 servers ?
Still not convinced. But trying to learn something here....
To those talking about "mapping tools" and how using production-grade hardware would justify the re-write to cryptic tX field names...
The moment the original developer leaves the company and a new junior guy comes in to do some code maintenance and screws-up his first few efforts, the business will loose confidence in the product. They'll stop requesting new features and enhancements- the product won't be able to keep up with new market demands and ideas, lost sales, etc...all because someone decided to re-write the system in some cryptic way...
@Dan
Ayende was talking about GB of disk space not RAM.
"It consumes more RAM" - why ?
If I have Person class, why would I have in memory duplicated info like "FirstName", "LastName", "BirthDate"....
I understand that I must have stored somewhere "Petar", "Dan", "Oren", ..., but FirstName is property / metadata of the class, not of every object instance of that class.
??
To the "WTF" about column names being stored against data, that is how SCHEMA-LESS databases work,
{
}
Ayende was wrong.
They will be talking about shortening field names to same RAM when using MongoDB. RAM is still horribly expensive compared to storage.
For great performance, MongoDB needs the whole dataset in RAM. Saving those few bytes over a large dataset (which, are repeated as they are not necessarily consistent between each document).
Your conclusion is sound despite the failure of your argument. The cost of 900MB of disk space in a HA server environment is a lot more than ~$0.06 -- you have RAID, the portion of the SAN server/etc. needed to serve it, backup, network traffic capacity, etc. Beyond that, you also have the extra RAM and CPU time it takes to handle the larger strings.
The point is, though, that all those resources are almost always cheaper than developer time -- not only the time to make the change in the code, but the additional time it will take you for future developers to understand and maintain it, as well as the additional errors that are likely to creep in (even if you catch them in your unit tests, it takes time to fix).
So I agree with you, but your argument is just a flustercluck.
Premature optimization is the root of all evil -- it's usually cheaper to add iron than development time. But not always (embedded systems, real-time systems, certain functions that get called thousands of times a second, etc. often need optimization rather than more iron).
The cost of a developer is much more than $75k / yr. Take the salary and add 30~50% related to benefits and fringe, and that's the cost to the company.
Russell: The original article talked about disk space, therefore Oren talked about disk space.
Let's be honest: no one with any traffic to speak of is hosting their databases on single $120 drives. Most of the time, you want your database to fit into RAM, if you can help it (RAM being well over 5 cents a gig). And if you are hitting disk, you probably want a good SSD -- also well over 5 cents a gig. And it's probably mirrored, since you don't want to lose your data.
And hey, there's no guarantee that you aren't paying some company monthly to for hosting -- which ends up being several times more expensive than buying the hardware outright.
Really, the "5 cents" number is an exaggeration. Which is a shame, because it distracts from what was otherwise an excellent point.
Looks like who ever wrote this had no understanding of the DB being discussed before making the comment. Now they just come off look stupid and arrogant.
Most businesses that have data on this scale are not buying consumer SATA 2 TB drives. Most likely the data is on a SAN that has mirroring and multiple redundancies. So the analysis of the cost savings here is probably quite a bit off.
Using long names that are repeated in a row violates some normal form doesn't it? They could have/should have used a 4 byte long to reference a table of longer names. Nearly the same data savings and you get to keep the long descriptive names at the price of a join.
"... the entire saving is mute."
MOOT!
I see this sort of thinking all the time - I'm a front-end web developer and some people squish their CSS files into one giant line. Yeah they save spaces, tabs, and line breaks - but it takes so long to track that down.
I mean sure, if you're Google or Facebok go right ahead and make your stuff small because it adds up - but this practice of 'overoptimizing' code for a site that gets >300 hits a week just renders it a nightmare to work on :-/
Good databases like DB2 can notice that certain values in the database repeat over and over and store the table more efficiently behind the scenes.
I love their followup (I only read a few random paragraphs). They're making 40x 2GB files for Mongo's log so it doesn't have to allocate them by itself when restarting (meaning up to 30 seconds of offline time). So they copy /dev/null into them, which makes sense. Then they say:
Note that creating the files will hog the filesystem I/O and slow everything down, but it won’t take your database offline. If you prefer, make the files on another system then transfer them over the network.
I immediately disregarded anything you said when you equated $37.50 an hour to 6 cents a minute.
You can not do math. It's 6 cents per second.
Oh sorry, I can't do math either. It's 62.5 cents a minute.
I'm glad at least someone pointed out that field names don't get duplicated in a database for every row. Thank you Pim. I was starting to worry.
The problem is NOT diskspace, it's memory. Memory is still ridiculously expensive.
It's annoying limitation of mongodb storage model. I'd expect them to fix it at some point, but the way it's implemented is what enables such aesome performance.
Double fail:
MySQL doesn't store the column names with every row.
"Memory is still ridiculously expensive" - especially if we are not using it wisely. I mean what kind of application needs for example 24GB of data in memory at once ? What time is typically needed for app to access let's say 80% of that bytes in RAM ?
And another point - that app must make very little income to call 24GB expensive.
Didn't IT industry long time ago accepted that the most economical way (for most applications) is that we have multiple layers of memory: from very fast and expensive CPU registers, to CPU cache, to RAM, to disk ?
Is everyone missing the fact that column names don't take up ANY extra disk space per row except me? They didn't save a damn thing.
Some facts people:
MongoDB store data in BSON format. which means that for every record, the field names are stored with the record. Just as any other schema-less document db (or KVP store) do
MongoDB is storing data on a Memory-Mapped-File. Meaning that for optimal usage you'd need to have most (if not all) of the data in RAM. This is what makes it so damn fast.
Software Engineer :: Document databases.... read the comments
"This means things are much more flexible for future structure changes but it also means that every row records the field names."
There is the real problem. What kind of pathetic database would make such a chronically stupid design decision?
Ken is right.
The rest of you are a bunch of Ayende fan boys. My god you are like a bunch of blind deaf sheep. If Ayende proclaimed the earth was flat you would happily defend that statement.
Lance, stfu please :)
Your computation is incomplete. Data center server hard disks cost more than 120 $. Shorter variable names mean shorter build times, because they compile faster. And you forgot the saved wattage for both developer work stations and servers. Keep in mind that green computing is all the rage now. Shorter variable names = saving earth!
Just use column constants in your code and then everyone is happy?
I think that talking about price is pretty meaningless. If you have enough records that is is a problem, then you'd rather fit in more in memory by making them smaller.
Once upon a time, shorter variable named did mean faster compilation or faster execution at runtime. Back in the days when a 4MHz Z80 was state of the art, anyway.
Nowdays you'll pay more (in time) for a processor memory cache failure than for the processing time for a longer identifier.
Even if there was a benefit (which I doubt), human time is the controlling factor, not cpu time. If you need to work half an hour late just once fixing a bug caused by misunderstanding code with cryptic names then you've undone all the potential savings.
To all those trying to argue that Oren's argument was bad because MongoDB does store the full column name in memory for each row - all you're doing is explaining why schema-less DBs are generally poorly conceived and poorly implemented.
Also, it's irrelevant to the point of his argument, which was really about misguided optimizations.
Why not just turn on page or row compression and save the space without the work?
Oh. Right. MySQL...
You can save much more money/space by not letting a developer EVER touch a production machine.
"I see this sort of thinking all the time - I'm a front-end web developer and some people squish their CSS files into one giant line. Yeah they save spaces, tabs, and line breaks - but it takes so long to track that down.
I mean sure, if you're Google or Facebok go right ahead and make your stuff small because it adds up - but this practice of 'overoptimizing' code for a site that gets >300 hits a week just renders it a nightmare to work on :-/" - innovati
The practice you are referring to is called "minifying" and there is no excuse not to do it with your .js and .css files. You simply have your development files that aren't minified for easy editing and you minify those files each time a change needs to go to your production server. It takes 2 seconds to minify .js and .css files using any number of free tools out there. It's just plain lazy and disrespectful to your clients not to.
You should also consolidate your .js and .css files whenever possible. Chrome has a wonder auditor built into its developer tools that makes that last bit of optimization easy.
1 TB = 60 USD
1 GB = .058 USD
cost savings: 59.95 per annum,
you're confusing the new cost with the savings delta.
I don't understand why MongoDB storing everything in RAM is bad? If it makes it faster, it's a good thing right? If a SQL db would store everything in RAM and it would make it faster, I'm a happy developer.
I wanted to stab a developer in the head when she suggested heavily abbreviated field names. She came from a mainframe background when they counted bytes, I guess she used core memory or punched cards or something back then.
Thank god she retired, and I got my way.
I do like the very first row, which appeared after your publication:
'A discussion about the value of shortened field names has generated a lot of traffic to this post over the last 24 hours' :D
Ahhh, I can't resist, such nonsense. This brings back memories, all too fresh unfortunately, of silly things I've found in a legacy SQL database: somebody had stuffed an HTML document in an XML document, compressed it with Deflate64 and than stored that in the database. Seriously, what were they thinking? Well perhaps they've read a blog post somewhere along the same lines with the post you mention. This is right up there with one of my leads requesting all variable names use no vowels a few years back. Really?
A well designed/scalable application will run without issues on commodity hardware and disk space is really that cheap. I've seen many more availability issues caused by poor design/code than hardware failure, and let's face it hardware fails. Yeah, that's what it does. Your application better handle it gracefully, and you're definitely not going to be accomplishing that with shortened filed names.
Ok, MongoDB stores data in BSON format in memory. But such old-school optimization does not feel right.
Memory is expensive in the cloud. It's even expensive when you run your own servers.
99% of the people commenting here (including the author), need to familiarize themselves with a schemaless database (yes, it's not so much like your mysql) and also MongoDB.
Say what you want about MongoDB, this isn't really something serverdensity can do about. There's plenty of overhead with each document in MongoDB and they utilize it at a rate most of you guys are never gonna see, ever.
I seriously how those figures are wrong. Two weeks vacation?!?! Thank god I don't draw my salary in USD.
@foobert,
You may wish to note that Ayende is the principal commiter of RavenDB, which is a schema-less database. Also, the article in question specifically mentioned that the abbreviated names were for disk space savings, not memory.
@José F. Romaniello Yes, I mean fine tuning. Buying hardware because each query is a Select * (select star) or because some queries are really bad is WRONG. What I mean is, suppose that queries are good, is not worth triying to do fine tuning (such as forcing execution plan etc etc) only to save hardware costs.
Crazy crazy crazy. There are still some people who argue that hardware cost > software cost, especially for software optimization?
mysql and more data storage systems do not store the name in each row, the data is stored in order, with some sort of system to figure out where what set of data ends and another starts, and some sort of system to till when one row ends and another starts...
in mysql names are only stored in the .myfmt file, in mysql MEMORY tables names are still storied in a .myfmt file, and if this new system they switched to doesn't do the same thing as a mysql MEMORY table, then it cant be any faster then mysql, normal, or MEMORY
I have seen few programmers aware of this. The old thinking schema where economy (of space, processor time, network capacity etc ...) is a priority are still alive. I've seen many costly hours completely lost to the drain trying to debug or adding new features to existing so called optimized programs. Once, a collegue told me that he was reluctant to call his variables with long explicit name because his code lines were superior to 80 char / column ! He had a 26 inch screen ...
More harm than good, I will not do.
Yeah, we get it, ayende can't do math. Get over it, it doesn't matter, because instead of discussing some math and moot problems would you please listen to the POINT of it?
Not only are you getting ONE developer to set it up, but the next guy to change the system. And the BA who is asking the developer how long it will take to get this and that customer values stored. And then the BI guys trying to figure out what customer did what. And then the next developer trying to refactor some code and he can't figure out what it does. That second developer who inherited the code leaves. The next guy they hire doesn't do anything for the first half year other than trying to figure out what the system is storing.
No matter how money much each of those ppl make, they will spend a lot of time deciphering the intend and asking other people about what the system is doing instead of being presented with it.
Not only disk space has a value but also the information stored on it. That can not be measured because information means different things to differnet people. Somebody put a money value on INTENT please and do the maths again.
Last thought: The currency is a developers sanity not the money a company spends on him.
Comment preview