Luna Part 2 : Repository and Django Models

Luna is what I'm calling my project to take back my photos.

You can follow along with the code here: https://github.com/issackelly/luna

The repository and the data models are the first step in any code base.

I'm not yet going to cover many of the choices that I made in the project (pipenv, dj-database-url, celery, etc). In this installment, I'll be describing the folder structure and the basic model setup at backend/backend/models.py .

The project is a pair of small applications. A "backend" application built on Python 3 and Django 2 to do background processing and provide an API, and a "frontend" application in JavaScript to provide a UI layer.

The top level of the repository looks like this:

issac@galadon:~/Projects/personal/luna$ tree -L 3 -I "node_modules"
.
├── backend
│   ├── backend
│   │   ├── admin.py
│   │   ├── celery.py
│   │   ├── __init__.py
│   │   ├── migrations
│   │   ├── models.py
│   │   ├── settings.py
│   │   ├── tasks.py
│   │   ├── urls.py
│   │   ├── views
│   │   └── wsgi.py
│   └── manage.py
├── config
│   └── nginx.conf
├── frontend
│   ├── package.json
│   ├── package-lock.json
│   ├── public
│   │   ├── favicon.ico
│   │   ├── index.html
│   │   └── manifest.json
│   ├── README.md
│   ├── src
│   │   ├── App.css
│   │   ├── App.js
│   │   ├── App.test.js
│   │   ├── index.css
│   │   ├── index.js
│   │   ├── logo.svg
│   │   └── registerServiceWorker.js
│   └── yarn.lock
├── LICENSE
├── Pipfile
└── Pipfile.lock

8 directories, 27 files

Models

There is one model that everything else sort of latches onto. That's the StoredFile model.

It's called StoredFile instead of "File" so it doesn't collide with other things named file.

I don't like model inheritance. I think it's generally clunky and difficult to think through, and it makes querying a pain. This is why I have one StoredFile models instead of one Video and one Photo model that inherite from a StoredFile model.

Let's start with just the attributes

StoredFile

class StoredFile(models.Model):
    id = models.UUIDField(primary_key=True, default=uuid.uuid4, editable=False)
    filename = models.CharField(max_length=128, db_index=True, blank=True)
    metadata = JSONField(blank=True, null=True)
    kind = models.CharField(max_length=128, blank=True, default="")
    mime_type = models.CharField(max_length=128, blank=True, default="")
    user = models.ForeignKey(User, on_delete=models.SET_NULL, blank=True, null=True)
    location = models.PointField(blank=True, null=True)
    size_bytes = models.IntegerField(default=0, db_index=True)
    related_files = models.ManyToManyField("self", blank=True)
    content = models.FileField(upload_to=_get_upload_path, max_length=1024, unique=True, db_index=True)
    content_sha = models.CharField(max_length=64, editable=False, blank=True, default="", db_index=True)
    start = models.DateTimeField(blank=True, null=True, db_index=True, help_text="For Time-Series Files")
    end = models.DateTimeField(blank=True, null=True, db_index=True, help_text="For Time-Series Files")
    created = models.DateTimeField(auto_now_add=True, editable=False, help_text="DB Insertion Time")
    modified = models.DateTimeField(auto_now=True, editable=False, help_text="DB Modification Time")
    tags = models.ManyToManyField(Tag, blank=True)
    processor_metadata = JSONField(blank=True, null=True)

I've overridden the default "id" attribute with a UUID. UUIDs are good because they are unique and can be generated at either the client or the server side without knowledge of the other, and without making your database care about consistency. They're also hard to guess. UUIDs are cool.

I've used a PointField from GeoDjango to store the latitude and longitude. This will let me use all of the neat GIS tools to query and display that later.

I'm storing the file itself as a Django "FileField". This is stored in the database as a path to the file, but the Django Model represents that as a richer type, and lets me treat it like it's a file handle that I can read/write from.

I'm also storing the sha of the file. This is so I can detect duplicates or see if the contents of the file have changed.

I have four timestamps. Two about the contents of the file (start and end) and two about the database row (created and modified)

Start and End are for files that have a point, or a span of time. A photo will have a point (when it was taken)

A video would have a span, or start and end.

"Created" references when it was put in the database

"Modified" references when the database record for this file was last edited.

There are also a couple of JSONFields to store miscellaneous records related to the file, or related to the processing of the file. Reasons for these fields will become clear later.

StoredFile Methods

There are four methods on a StoredFile right now

    def __str__(self):
        return "{}".format(self.content.name)

The __str__ method could have been written as just "return self.content.name" but it's a habit.

This method is intended for things like the python REPL or the Django Admin to give a single string about the model instance

    def check_sha(self, *args, **kwargs):
        if self.content:
            content_sha = hashlib.sha1()
            self.content.open('rb')
            content_sha.update(str(self.content.read()).encode('utf-8'))

            if content_sha.hexdigest() != self.content_sha:
                self.content_sha = content_sha.hexdigest()

The check_sha method is a little under-developed right now, but it collects and stores the sha of the contents of the file. In the future I might want to do something else if the shas don't match, like store a version number, or save the old version of the file somewhere else.

    def save(self, *args, **kwargs):
        self.check_sha()
        self.size_bytes = self.content.size

        if not self.mime_type:
            self.mime_type, _ = mimetypes.guess_type(self.content.name)
            if not self.mime_type:
                self.mime_type = ""

        super(StoredFile, self).save(*args, **kwargs)

The save method does the first few bits of metadata collection, for the sha, the filesize, and the mime_type.

It's likely that these will all get moved to a celery task later, instead of the model save method.

    def serialize(self):
        return {
            "id": self.id,
            "name": self.content.name,
            "filename": self.filename,
            "metadata": self.metadata,
            "size_bytes": self.size_bytes,
            "kind": self.kind,
            "mime_type": self.mime_type,
            "related_files": [r.id for r in self.related_files.all()],
            "content_path": self.content.path,
            "content_url": "/api/v1/get_file{}".format(self.content.path),
            "content_sha": self.content_sha,
            "start": self.start.isoformat() if self.start else None,
            "end": self.end.isoformat() if self.end else None,
            "created": self.created.isoformat() if self.created else None,
            "modified": self.modified.isoformat() if self.modified else None,
            "location": [self.location[0], self.location[1]] if self.location else None,
            "tags": [t.name for t in self.tags.all()],
            "events": [e.id for e in self.events.all()],
            "processor_metadata": self.processor_metadata,
        }

The serialize method will return a flat dictionary of a StoredFile instance that can be serialized into a simple JSON object for the API.
I find methods like this much easier to get started with than API layers like "Django rest framework"

Album

An album is a collection of StoredFile objects with some idea of permissions about who can view and edit it.

Tag

A tag is a short word or phrase that can be associated with any StoredFile. It's for categorization

Event

An event is either a point or span of time, also used for categorization.

15th June 2018

Comments and Messages

I won't ever give out your email address. I don't publish comments but if you'd like to write to me then you could use this form.