So I’ve been meaning to post this for a couple months to see what you haters have to say.
It’s apparent that everyone hates the thought of storing files in a database, but what would you say if there is a cache layer between the database and public access?
Figure all of our images are stored in a well optimized MySQL table. We’re able to maintain data integrity via INNODB’s cascade so that if any content (blog post, user, etc) that has an image associated with it is deleted all related images are removed as well. We store the image once and output the sizes we need and if our image requirements change we don’t have to re-size everything. Add a cache layer (ie: ) and only hit the DB occasionally depending how long your cache is set.
So what are the cons?
The only issue I can see is the amount of space you’ll use, but FWIW you’re going to need the space somewhere (and figure you’re storing at least 2 image versions on the filesystem).
This is actually something we have in place currently at the company I’m at and we’ve had very good luck with it so far. I’ve been doing some reasearch on scaling it up to apply it to another project and was surprised by the overwhelming agreement against DB storage for files.
store images on flat file
store data about image in database, including relative paths to the images
you can’t leverage CDNs if you store all your images in a database. There is a reason big companies don’t do it, it is a bad idea.
|
store images on flat file
store data about image in database, including relative paths to the images you can’t leverage CDNs if you store all your images in a database. There is a reason big companies don’t do it, it is a bad idea. |
what this gay said.
Are you sure? For all in tents and porpoises, the squid cache layer is returning an image. Why would a CDN not be able to cache that?
looks like we did 32 million page views, a large majority of them were serving up at least 1 image through this sytem.
Storing images in DB is the most retarded thing ever. Enjoy your multi TB database if you have a lot of files.
As far as I know, you cant have a table larger than 4TB either.
store filename in DB… put files in other server.
MySQL is not scalable enough to do this for any large number of files.
Perhaps something like HBase might be worth looking in to, or some other database engine that is designed for scalability over being relational. The cache layer is all well and good (although I’d recommend using Nginx or Varnish as reverse proxies over Squid, personally) but you will run in to scaling issues when you have x million images+ in the database.
Just to add the worst thing from a performance database perspective is that your row lengths won’t be equal.
|
MySQL is not scalable enough to do this for any large number of files.
Perhaps something like HBase might be worth looking in to, or some other database engine that is designed for scalability over being relational. The cache layer is all well and good (although I’d recommend using Nginx or Varnish as reverse proxies over Squid, personally) but you will run in to scaling issues when you have x million images+ in the database. |
Did some checking now that I’m back in the office…
Currently we’re maintaining 7.5GB of full sized images on our file system. I would expect we’ll stay right around that number unless our business expands (and at most I’d expect our numbers to double).
Current file estimate is somewhere between 300k and 500k (yes, file sizes are all pretty small).
This is all revolving content that we retrieve from a 3rd party, process, and display on our websites.
The biggest thing is maintaining the link between that row of data and the images associated to it. Storing a relative path in the DB and storing on file system is all well an good assuming scripts always work 100%. Deleting that relative path from the DB doesn’t always mean that the file is gone on the file system. I’m most worried about having to loop through 400k images to match them against the data (even if it’s once every x months as part of a cleanup).
To an extent I’m playing devil’s advocate here. Eventually I’ll probably put together a proof of concept either way and we’ll beat the hell out of it with our test system and see how it handles and that’ll make our decision for us. …unless of course my sys admin tells me no.
As for using squid, that’s a sys admin decision. I know one of the new devs kept saying we should use Varnish but I know there was a reason that he preferred Squid (but I’ll be damned if I remember what it was).
It sounds like you’re already convinced that your way of storing files is the way to go, and if that’s the case, go nuts. Nobody is going to change your mind otherwise
|
MySQL is not scalable enough to do this for any large number of files.
Perhaps something like HBase might be worth looking in to, or some other database engine that is designed for scalability over being relational. The cache layer is all well and good (although I’d recommend using Nginx or Varnish as reverse proxies over Squid, personally) but you will run in to scaling issues when you have x million images+ in the database. |
Came here to post this.
It’s entirely feasible, but it would be better to go with a noSQL option.
But like everyone else is saying, a file system is significantly less complicated.
What is the actual issue you are trying to solve? Are you looking for speed? Reliability?
data integrity.
Database = storage of information and their relationships.
Images are not information. They are binary data that can’t be searched. Well, can’t is kind of a strong word. You are not going to be searching for data in an image using a select statement and have it work in most databases.
Think about data dependency. You store ALL your images in your database and tomorrow your database goes down. You have to restore ALL of that data. How long will it take to restore? What if you stored the file path info instead? My bet is it will be a quicker restore with the file paths.
Another thought: You are running out of storage for your images. If they are stored in the database, you have to figure out how to split the database. If they are stored in a file system, you just copy the images and update the database.
Data integrity: A good programmer should be able to solve this. Some things to think about: leverage your database to keep track if the image exists or not and setup code to handle the situation during off peak hours. Set up the images on disk so they are not easily removed by users (aka: lock the images down). Setup triggers on the database to remove the file from disk if a row is deleted.
There’s not much you can do if the image is deleted without letting the database know about it. But if you let your users do this, this is more of a business problem then a database problem.
Sticking the images in the database just isn’t good. You have backup problems. You have bandwidth problems (caching with Squid means you had a problem and Squid is the fix - the file system could have done this for you). You have database overhead problems. You have query performance problems. You have DB storage problems. You have image update problems. What if someone stored something that wasn’t an image. What if it was something evil? How will you virus scan it or perform forensics on it?
Store images in the DB if you want. I wouldn’t want to support it though.
To be honest being I know a little bit about where you work I’d look into storing image paths in the database and then using something like memcached to really speed that lookup up.
A process could easily be written to verify database integrity and expire any cache entries that are incorrect. There is no need to use something like squid to feed the images because the filesystem will natively cache the files in memory.