How to store images and other large objects in MongoDB using Python pymongo and mongo Gridfs
MongoDB is a Nosql database, that is it does not have an sql interface.
MongoDB can only store documents up to a certain size. In a way the size is irrelevant because it is a certainty that, no matter what the size is, someone will want to store a larger document. And suddenly you will have the sort of bug we all love to track down. Worse it could impact mission critical applications.
One solution is to split a large object into chunks and keep track of the chunks so you can reassemble the object as needed, or perhaps only fetch the part of the object you need.
MongoDB provides gridFS, which is an API and a convention, and is implemented for all the languages MongoDB supports. It provides the normal create, read and delete options, though updating requires deleting and recreating a document, but the API is not provided natively by the MongoDB server. It handles the chunking mentioned in the last paragraph.
The name gridfs is misleading because gridfs is not a sophisticated file system and does not allow easy browsing, unlike Hadoop's HDFS
As an example code is provided to show how to store an image in a database with an associated gridFS instance and how to get it back and store it in a new location. The code is for example purposes only and would need changes in a production system. The caise has however been run and tested on a MAcBookPro using PyDev and the Mongo Python driver.
This article is a basis for research: for further information check the sources and the MongoDB site
The steps involved, assuming you are working in Python, are
Get a database connection
connection = pymongo.Connection("localhost",27017);
Create a database
database = connection['example']
Create a gridFS instance associated with the database
fs = gridfs.GridFS(database)
Store the document in the gridFS instance
stored = fs.put(thedata)
Retrieve the document
Store the document
outfilename = “where you want to save it”
Since this is experimental (toy) code remove the database. For reference delete the document
The process is rather more verbose in Java
First you need to connect to a running MongoDB server. To do this you need to know the URL (“localhost” here and the port number.
Then you create a database. Here it is called “example”.
The next step is to create a gridFS instance. This is associated with the database. Normally one filesystem per database will be enough. If you want more than one filesystem you need to create each filesystem differently, setting a prefix: for example
photos = gridfs.GridFS(database,”images”)
Internally gridFS uses two collections, files and chunks. Files holds metadata for stored objects. For the default filesystem the prefix is fs and the collections are fs.chunks and fs.files while for filesystem photos the collections are images.chunks and images.files. The chances are you will not need to use this information but it is useful to know.
Now store the data. The put(method) above returns the id of the stored document. The id is the equivalent of a primary key in an SQL database. If you will need to connect and disconnect you need to store the id with a human friendly tag. put() can take extra parameters to hold metadata and the metadata can hold a human friendly tag, for example
theID = photos.put(anImage, filename=”beach with friends”)
Storing a large object in MongoDB is not possible, but the gridFS api, available for all language supported by MongoDB, allows arbitrarily large objects to be stored in chunks. The details of this are hidden from the user This note has given an introduction to the process using Python. For a production system a separate collection to store data needed to recover objects from a gridFS instance will be needed.
Python sample code
import pymongo import gridfs if __name__ == '__main__': # read in the image. filename = "path to input file" datafile = open(filename,"r"); thedata = datafile.read() # connect to database connection = pymongo.Connection("localhost",27017); database = connection['example'] # create a new gridfs object. fs = gridfs.GridFS(database) # store the data in the database. Returns the id of the file in gridFS stored = fs.put(thedata, filename="testimage") # retrieve what was just stored. outputdata =fs.get(stored).read() # create an output file and store the image in the output file outfilename = "path to output file" output= open(outfilename,"w") output.write(outputdata) # close the output file output.close() # for experimental code restore to known state and close connection fs.delete(stored) connection.drop_database('example'); # print(connection.database_names()) connection.close()