ArtsAutosBooksBusinessEducationEntertainmentFamilyFashionFoodGamesGenderHealthHolidaysHomeHubPagesPersonal FinancePetsPoliticsReligionSportsTechnologyTravel

How to store images and other large objects in MongoDB using Python pymongo and mongo Gridfs

Updated on January 5, 2013

MongoDB is a Nosql database, that is it does not have an sql interface.

MongoDB can only store documents up to a certain size. In a way the size is irrelevant because it is a certainty that, no matter what the size is, someone will want to store a larger document. And suddenly you will have the sort of bug we all love to track down. Worse it could impact mission critical applications.

One solution is to split a large object into chunks and keep track of the chunks so you can reassemble the object as needed, or perhaps only fetch the part of the object you need.

MongoDB provides gridFS, which is an API and a convention, and is implemented for all the languages MongoDB supports. It provides the normal create, read and delete options, though updating requires deleting and recreating a document, but the API is not provided natively by the MongoDB server. It handles the chunking mentioned in the last paragraph.

The name gridfs is misleading because gridfs is not a sophisticated file system and does not allow easy browsing, unlike Hadoop's HDFS

As an example code is provided to show how to store an image in a database with an associated gridFS instance and how to get it back and store it in a new location. The code is for example purposes only and would need changes in a production system. The caise has however been run and tested on a MAcBookPro using PyDev and the Mongo Python driver.

This article is a basis for research: for further information check the sources and the MongoDB site

Basic gridFS

Basic gridFS

The steps involved, assuming you are working in Python, are

  1. Get a database connection

    connection = pymongo.Connection("localhost",27017);

  2. Create a database

    database = connection['example']

  3. Create a gridFS instance associated with the database

    fs = gridfs.GridFS(database)

  4. Store the document in the gridFS instance

    stored = fs.put(thedata)

  5. Retrieve the document

    outputdata =fs.get(stored).read()

  6. Store the document

    outfilename = “where you want to save it”

output= open(outfilename,"w")

output.write(outputdata)

  1. Since this is experimental (toy) code remove the database. For reference delete the document

fs.delete(stored)

connection.drop_database('example');

The process is rather more verbose in Java

Details

First you need to connect to a running MongoDB server. To do this you need to know the URL (“localhost” here and the port number.

Then you create a database. Here it is called “example”.

The next step is to create a gridFS instance. This is associated with the database. Normally one filesystem per database will be enough. If you want more than one filesystem you need to create each filesystem differently, setting a prefix: for example

photos = gridfs.GridFS(database,”images”)

Internally gridFS uses two collections, files and chunks. Files holds metadata for stored objects. For the default filesystem the prefix is fs and the collections are fs.chunks and fs.files while for filesystem photos the collections are images.chunks and images.files. The chances are you will not need to use this information but it is useful to know.

Now store the data. The put(method) above returns the id of the stored document. The id is the equivalent of a primary key in an SQL database. If you will need to connect and disconnect you need to store the id with a human friendly tag. put() can take extra parameters to hold metadata and the metadata can hold a human friendly tag, for example

theID = photos.put(anImage, filename=”beach with friends”)

The wrap

Storing a large object in MongoDB is not possible, but the gridFS api, available for all language supported by MongoDB, allows arbitrarily large objects to be stored in chunks. The details of this are hidden from the user This note has given an introduction to the process using Python. For a production system a separate collection to store data needed to recover objects from a gridFS instance will be needed.

Python sample code

import pymongo
import gridfs


if __name__ == '__main__':
   
# read in the image.   
    filename = "path to input file"
    datafile = open(filename,"r");
    thedata = datafile.read()

# connect to database
    
    connection = pymongo.Connection("localhost",27017);
    database = connection['example']

# create a new gridfs object.
    fs = gridfs.GridFS(database)
    
# store the data in the database. Returns the id of the file in gridFS
    stored = fs.put(thedata, filename="testimage")

# retrieve what was just stored. 
    outputdata =fs.get(stored).read() 
    


# create an output file and store the image in the output file
    outfilename = "path to output file" 
    output= open(outfilename,"w")
    
    output.write(outputdata)
# close the output file    
    output.close()


# for experimental code restore to known state and close connection
    fs.delete(stored)
    connection.drop_database('example');
#    print(connection.database_names())
    connection.close()

Comments

    0 of 8192 characters used
    Post Comment

    • profile image

      devang 

      3 years ago

      thanks for helpful code

    • AlexK2009 profile imageAUTHOR

      AlexK2009 

      5 years ago from Edinburgh, Scotland

      Thanks Bernd

      Surely an image is best treated as a binary file, and restricting to binary means other types of blob such as serialised objects, can be stored also. I will look into the code again at some point.

    • profile image

      Bernd 

      5 years ago

      Thanks for the code. Tested it with PNG and JPEG images. It worked only when reading and writing the images as binary files:

      datafile = open(filename,"rb");

      ...

      output = open(outfilename,"wb")

      Best wishes,

      Bernd

    working

    This website uses cookies

    As a user in the EEA, your approval is needed on a few things. To provide a better website experience, hubpages.com uses cookies (and other similar technologies) and may collect, process, and share personal data. Please choose which areas of our service you consent to our doing so.

    For more information on managing or withdrawing consents and how we handle data, visit our Privacy Policy at: https://hubpages.com/privacy-policy#gdpr

    Show Details
    Necessary
    HubPages Device IDThis is used to identify particular browsers or devices when the access the service, and is used for security reasons.
    LoginThis is necessary to sign in to the HubPages Service.
    Google RecaptchaThis is used to prevent bots and spam. (Privacy Policy)
    AkismetThis is used to detect comment spam. (Privacy Policy)
    HubPages Google AnalyticsThis is used to provide data on traffic to our website, all personally identifyable data is anonymized. (Privacy Policy)
    HubPages Traffic PixelThis is used to collect data on traffic to articles and other pages on our site. Unless you are signed in to a HubPages account, all personally identifiable information is anonymized.
    Amazon Web ServicesThis is a cloud services platform that we used to host our service. (Privacy Policy)
    CloudflareThis is a cloud CDN service that we use to efficiently deliver files required for our service to operate such as javascript, cascading style sheets, images, and videos. (Privacy Policy)
    Google Hosted LibrariesJavascript software libraries such as jQuery are loaded at endpoints on the googleapis.com or gstatic.com domains, for performance and efficiency reasons. (Privacy Policy)
    Features
    Google Custom SearchThis is feature allows you to search the site. (Privacy Policy)
    Google MapsSome articles have Google Maps embedded in them. (Privacy Policy)
    Google ChartsThis is used to display charts and graphs on articles and the author center. (Privacy Policy)
    Google AdSense Host APIThis service allows you to sign up for or associate a Google AdSense account with HubPages, so that you can earn money from ads on your articles. No data is shared unless you engage with this feature. (Privacy Policy)
    Google YouTubeSome articles have YouTube videos embedded in them. (Privacy Policy)
    VimeoSome articles have Vimeo videos embedded in them. (Privacy Policy)
    PaypalThis is used for a registered author who enrolls in the HubPages Earnings program and requests to be paid via PayPal. No data is shared with Paypal unless you engage with this feature. (Privacy Policy)
    Facebook LoginYou can use this to streamline signing up for, or signing in to your Hubpages account. No data is shared with Facebook unless you engage with this feature. (Privacy Policy)
    MavenThis supports the Maven widget and search functionality. (Privacy Policy)
    Marketing
    Google AdSenseThis is an ad network. (Privacy Policy)
    Google DoubleClickGoogle provides ad serving technology and runs an ad network. (Privacy Policy)
    Index ExchangeThis is an ad network. (Privacy Policy)
    SovrnThis is an ad network. (Privacy Policy)
    Facebook AdsThis is an ad network. (Privacy Policy)
    Amazon Unified Ad MarketplaceThis is an ad network. (Privacy Policy)
    AppNexusThis is an ad network. (Privacy Policy)
    OpenxThis is an ad network. (Privacy Policy)
    Rubicon ProjectThis is an ad network. (Privacy Policy)
    TripleLiftThis is an ad network. (Privacy Policy)
    Say MediaWe partner with Say Media to deliver ad campaigns on our sites. (Privacy Policy)
    Remarketing PixelsWe may use remarketing pixels from advertising networks such as Google AdWords, Bing Ads, and Facebook in order to advertise the HubPages Service to people that have visited our sites.
    Conversion Tracking PixelsWe may use conversion tracking pixels from advertising networks such as Google AdWords, Bing Ads, and Facebook in order to identify when an advertisement has successfully resulted in the desired action, such as signing up for the HubPages Service or publishing an article on the HubPages Service.
    Statistics
    Author Google AnalyticsThis is used to provide traffic data and reports to the authors of articles on the HubPages Service. (Privacy Policy)
    ComscoreComScore is a media measurement and analytics company providing marketing data and analytics to enterprises, media and advertising agencies, and publishers. Non-consent will result in ComScore only processing obfuscated personal data. (Privacy Policy)
    Amazon Tracking PixelSome articles display amazon products as part of the Amazon Affiliate program, this pixel provides traffic statistics for those products (Privacy Policy)