Luna Part 3 : Populate the Database

Empty databases are boring. In part 2 we made an empty database.

Now we fill the database.

We are going to use a couple of celery functions outlined in tasks.py

Here are the benefits of building this as celery tasks:

The jobs can be easily distributed among several processes, even on one machine
The jobs can be triggered from changes on the operating system (future watchdog daemon)
The jobs can be triggered from CLI or web tasks

Here's the basic function layout:

Find all the files in a given path and index them.
- Process each individual file in its own task
  - Process specific file types (right now, only images)

An important part about celery tasks, and really distributed systems in general is to pass them all the context they need, as simple objects (send object ids, not full objects, send filesystem paths, not file handles). You'll see that pattern replicated here.

Process a Folder

@app.task
def find_and_process_by_path(path):
    if not path.endswith("**"):
        path = path + "**"

    for filename in glob.iglob(path, recursive=True):
        print("Checking", filename)
        will_ignore = False
        for ignore in IGNORE:
            if re.match(ignore, filename):
                print("IGNORING", filename)
                will_ignore = True
                break

        if not will_ignore:
            print("PROCESSING", filename)
            process_file_by_path.delay(filename)

    return path

I start by globbing any path given, checking to see if I should ignore it, and then sending files into the next task.

Process a File

@app.task
def process_file_by_path(path):
    print("PROCESSING ONE FILE", path)
    if not os.path.isfile(path):
        print ("NOT A FILE", path)
        return

    try:
        sf = StoredFile.objects.get(content=path)
    except StoredFile.DoesNotExist:
        # If it does NOT have an entry, create one
        sf = StoredFile(
            content=path
        )
        sf.save()

    uid = find_owner(path)
    user, _ = User.objects.get_or_create(username=uid)
    sf.user = user

    sf.save()

    if path[-4:] in ['.jpg', 'jpeg']:
        process_jpeg_metadata.delay(path)

    return sf.id

This function processes a single file by path name.

First it checks that it's a file (and not a folder)

Next it gets, or creates a database entry for the file.

After that it associates the Filesystem user object with an existing or new Django User object.

If it's a jpeg file it queues up the next celery task called "process_jpeg_metadata"

Process a Photo

@app.task
def process_jpeg_metadata(path):
    # Determine if this object already has an entry in the data_hub DataFile table
    try:
        sf = StoredFile.objects.get(content=path)
    except StoredFile.DoesNotExist:
        # If it does NOT have an entry, create one
        sf = StoredFile(
            content=path
        )
        sf.save()

    if sf.processor_metadata is None:
        sf.processor_metadata = {}

    sf.processor_metadata['jpeg_metadata_started'] = timezone.now().isoformat()

    img = Image.open(sf.content)
    exif = get_exif_data(img)
    lat, lng = get_lat_lng(img)
    if lat and lng:
        sf.location = Point(lng, lat)

    # Set "start" datetime based on EXIF formatted date
    if exif.get('DateTimeOriginal'):
        sf.start = datetime.datetime.strptime(exif['DateTimeOriginal'], "%Y:%m:%d %H:%M:%S")
    elif exif.get('DateTime'):
        sf.start = datetime.datetime.strptime(exif['DateTime'], "%Y:%m:%d %H:%M:%S")

    # Some programs use IPTC data for keywords and tags
    iptc = get_iptc_data(img)
    if iptc.get('Keywords', []):
        for tagname in iptc["Keywords"]:
            tag, _ = Tag.objects.get_or_create(name=tagname)
            sf.tags.add(tag)

    # Not all exif fields are json serializable
    # sf.metadata = exif

    sf.metadata = {
        "width": img.width,
        "height": img.height,
    }
    sf.kind = "Image"
    sf.mime_type = "image/jpeg"
    sf.save()

This task gets some interesting things out of the EXIF and IPTC data, like location, DateTime, and tags.

Put it all together.

To use these tasks to populate the database I open up a python shell: `python manage.py shell` and run:

from backend.tasks import *

find_and_process_by_path.delay("/home/issac/Pictures/")

In another tab I can run `celery -A backend -l info worker` to run 8 celery workers to process the tasks and fill up my database. This took a while but at the end I had over five thousand photos in my database. Not a bad dataset at all.

16th June 2018

Comments and Messages