Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LB-167: Create schema for storing statistics in PostgreSQL #192

Merged
merged 6 commits into from Aug 1, 2017

Conversation

paramsingh
Copy link
Collaborator

Create new tables with JSONB columns to store statistics.
One table each for user, artist, release and recordings. All
these tables are inside a new schema statistics

Also minor changes in db/testing.py to bring the SQL there
out of the code and into the admin/sql directory.

Create new tables with JSONB columns to store statistics.
One table each for user, artist, release and recordings. All
these tables are inside a new schema `statistics`

Also minor changes in `db/testing.py` to bring the SQL there
out of the code and into the `admin/sql` directory.
@paramsingh
Copy link
Collaborator Author

@alastair @mayhem Requesting review :)

One thing that I was thinking of: Is it okay to use msids as primary keys or would using a SERIAL like in the "user" table be better for performance?

@paramsingh paramsingh changed the title Create schema for storing statistics in PostgreSQL LB-167: Create schema for storing statistics in PostgreSQL Jun 6, 2017
@mayhem
Copy link
Member

mayhem commented Jun 6, 2017

SERIAL is always best for such things.

@paramsingh
Copy link
Collaborator Author

Okay, noted. I'll change it to use SERIAL then.

Copy link
Collaborator

@alastair alastair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few questions to answer ❓ ❗

connection.execute('DROP TABLE IF EXISTS listen CASCADE')
connection.execute('DROP TABLE IF EXISTS listen_json CASCADE')
connection.execute('DROP TABLE IF EXISTS api_compat.token CASCADE')
connection.execute('DROP TABLE IF EXISTS api_compat.session CASCADE')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I learned recently that you can use DROP SCHEMA CASCADE to drop all tables in a schema. As we have mostly schemas here, perhaps we could just drop the "user" table and statistics and api_compat schemas.
Are we still using the listen and listen_json tables??

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

listen and listen_json aren't being used anymore but the PostgresListenStore tests depend upon them.

I'll update the PR with DROP SCHEMA CASCADE for the schemas and drop the remaining tables normally.


CREATE TABLE statistics.user (
user_id INTEGER NOT NULL, -- PK and FK to "user".id
artists JSONB,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give an example of what these json fields will look like? How often will you read from them or write to them? Will you want to get a specific value from in the field (and therefore require an index)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The plan is to use these fields to store the statistics we calculated from BigQuery. For example, in the artists field of statistics.user, we store the top artists listened to by that user over different time intervals. The json would be something like this, I guess:

{
    "all_time": 
    [
        {
            "name": "example artist 1",
            "msid": uuid,
            "listen_count": 87
        },
        {
            "name": "example artist 2",
            "msid": uuid,
            "listen_count": 80
        }
    ],
    "last_month": 
    [
        {
            "name": "example artist 2",
            "msid": uuid,
            "listen_count": 12
        },
        {
            "name": "example artist 1",
            "msid": uuid,
            "listen_count": 3
        }
    ]
}

These stats would then be used to draw the graphs on the site. I don't think we will need to query specific values inside them ever, and we'll only write to them during our regular stat calculation timings, I guess.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, this seems fine. I wasn't sure if it would be a better idea to split this data into separate fields.

It sounds like you're only ever going to write this field in bulk - that is, compute it all at once and then write it to the field, replacing whatever was there. Same for reading - you're always going to want to get all of the data at once in order to display it. If this is the case, then I'm fine with this way of storing the data.

recordings JSONB,
last_updated TIMESTAMP WITH TIME ZONE
);
ALTER TABLE statistics.user ADD CONSTRAINT user_stats_user_id_uniq UNIQUE (user_id);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need a UNIQUE constraint if we also have a PRIMARY KEY (a primary key is basically a unique constraint + index)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sorry about that, will fix.

@paramsingh
Copy link
Collaborator Author

@alastair Used CASCADE in drop_schema.sql and removed the DROP TABLEs, also used SERIAL for primary keys of entity tables, so the unique constraints are needed now and I haven't removed them (except for the user table).

First drop schemas with cascade and then drop the rest of the
tables
Copy link
Member

@mayhem mayhem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'm not 100% certain that these tables are going to be flexible enough for our needs. But until we see what our needs are, it is pointless to try and hold this up. Let's proceed and fix as needed.

🍆 🍆 🍆 !!

(who says the eggplant emoji is underused?)

CREATE TABLE statistics.artist (
id SERIAL, -- PK
msid UUID NOT NULL,
name VARCHAR,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have VARCHAR and TEXT fields in our schema definition. Under the hood they appear to be the same, but we should pick one and go with it. It seems that we have lots of VARCHAR and only one TEXT in api_compat. So, no action to take on this. \ø/

@mayhem mayhem merged commit d218cd7 into metabrainz:master Aug 1, 2017
@paramsingh paramsingh deleted the stats-schema branch August 1, 2017 19:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants