LB-167: Create schema for storing statistics in PostgreSQL #192

paramsingh · 2017-06-06T19:45:00Z

Create new tables with JSONB columns to store statistics.
One table each for user, artist, release and recordings. All
these tables are inside a new schema statistics

Also minor changes in db/testing.py to bring the SQL there
out of the code and into the admin/sql directory.

Create new tables with JSONB columns to store statistics. One table each for user, artist, release and recordings. All these tables are inside a new schema `statistics` Also minor changes in `db/testing.py` to bring the SQL there out of the code and into the `admin/sql` directory.

paramsingh · 2017-06-06T19:47:55Z

@alastair @mayhem Requesting review :)

One thing that I was thinking of: Is it okay to use msids as primary keys or would using a SERIAL like in the "user" table be better for performance?

mayhem · 2017-06-06T19:49:59Z

SERIAL is always best for such things.

paramsingh · 2017-06-06T19:51:14Z

Okay, noted. I'll change it to use SERIAL then.

alastair

a few questions to answer ❓ ❗

alastair · 2017-06-07T10:14:10Z

listenbrainz/db/testing.py

-            connection.execute('DROP TABLE IF EXISTS listen              CASCADE')
-            connection.execute('DROP TABLE IF EXISTS listen_json         CASCADE')
-            connection.execute('DROP TABLE IF EXISTS api_compat.token    CASCADE')
-            connection.execute('DROP TABLE IF EXISTS api_compat.session  CASCADE')


I learned recently that you can use DROP SCHEMA CASCADE to drop all tables in a schema. As we have mostly schemas here, perhaps we could just drop the "user" table and statistics and api_compat schemas.
Are we still using the listen and listen_json tables??

listen and listen_json aren't being used anymore but the PostgresListenStore tests depend upon them.

I'll update the PR with DROP SCHEMA CASCADE for the schemas and drop the remaining tables normally.

alastair · 2017-06-07T10:16:56Z

admin/sql/create_tables.sql

+
+CREATE TABLE statistics.user (
+    user_id                 INTEGER NOT NULL, -- PK and FK to "user".id
+    artists                 JSONB,


Can you give an example of what these json fields will look like? How often will you read from them or write to them? Will you want to get a specific value from in the field (and therefore require an index)?

The plan is to use these fields to store the statistics we calculated from BigQuery. For example, in the artists field of statistics.user, we store the top artists listened to by that user over different time intervals. The json would be something like this, I guess:

{ "all_time": [ { "name": "example artist 1", "msid": uuid, "listen_count": 87 }, { "name": "example artist 2", "msid": uuid, "listen_count": 80 } ], "last_month": [ { "name": "example artist 2", "msid": uuid, "listen_count": 12 }, { "name": "example artist 1", "msid": uuid, "listen_count": 3 } ] }

These stats would then be used to draw the graphs on the site. I don't think we will need to query specific values inside them ever, and we'll only write to them during our regular stat calculation timings, I guess.

OK, this seems fine. I wasn't sure if it would be a better idea to split this data into separate fields.

It sounds like you're only ever going to write this field in bulk - that is, compute it all at once and then write it to the field, replacing whatever was there. Same for reading - you're always going to want to get all of the data at once in order to display it. If this is the case, then I'm fine with this way of storing the data.

alastair · 2017-06-07T10:18:14Z

admin/sql/create_tables.sql

+    recordings              JSONB,
+    last_updated            TIMESTAMP WITH TIME ZONE
+);
+ALTER TABLE statistics.user ADD CONSTRAINT user_stats_user_id_uniq UNIQUE (user_id);


We don't need a UNIQUE constraint if we also have a PRIMARY KEY (a primary key is basically a unique constraint + index)

Ah, sorry about that, will fix.

paramsingh · 2017-06-09T17:31:29Z

@alastair Used CASCADE in drop_schema.sql and removed the DROP TABLEs, also used SERIAL for primary keys of entity tables, so the unique constraints are needed now and I haven't removed them (except for the user table).

First drop schemas with cascade and then drop the rest of the tables

mayhem

Ok, I'm not 100% certain that these tables are going to be flexible enough for our needs. But until we see what our needs are, it is pointless to try and hold this up. Let's proceed and fix as needed.

🍆 🍆 🍆 !!

(who says the eggplant emoji is underused?)

mayhem · 2017-07-31T14:04:25Z

admin/sql/create_tables.sql

+CREATE TABLE statistics.artist (
+    id                      SERIAL, -- PK
+    msid                    UUID NOT NULL,
+    name                    VARCHAR,


We have VARCHAR and TEXT fields in our schema definition. Under the hood they appear to be the same, but we should pick one and go with it. It seems that we have lots of VARCHAR and only one TEXT in api_compat. So, no action to take on this. \ø/

paramsingh changed the title ~~Create schema for storing statistics in PostgreSQL~~ LB-167: Create schema for storing statistics in PostgreSQL Jun 6, 2017

alastair requested changes Jun 7, 2017

View reviewed changes

paramsingh added 3 commits June 9, 2017 22:33

Drop schemas with cascading

19d5d28

Use SERIAL as primary key in the entity tables

e8b51a8

Remove unique constraints from the statistics.user table

ede158f

Fix drop tables in DatabaseTestCase

913eb12

First drop schemas with cascade and then drop the rest of the tables

alastair approved these changes Jun 16, 2017

View reviewed changes

paramsingh mentioned this pull request Jun 21, 2017

LB-176: Create a stats module and add functions to run queries on Google BigQuery #202

Merged

Rename schema update script to have date too

566bb7c

mayhem approved these changes Aug 1, 2017

View reviewed changes

mayhem merged commit d218cd7 into metabrainz:master Aug 1, 2017

paramsingh deleted the stats-schema branch August 1, 2017 19:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LB-167: Create schema for storing statistics in PostgreSQL #192

LB-167: Create schema for storing statistics in PostgreSQL #192

paramsingh commented Jun 6, 2017

paramsingh commented Jun 6, 2017

mayhem commented Jun 6, 2017

paramsingh commented Jun 6, 2017

alastair left a comment

alastair Jun 7, 2017

paramsingh Jun 7, 2017

alastair Jun 7, 2017

paramsingh Jun 7, 2017

alastair Jun 16, 2017

alastair Jun 7, 2017

paramsingh Jun 7, 2017

paramsingh commented Jun 9, 2017

mayhem left a comment

mayhem Jul 31, 2017

LB-167: Create schema for storing statistics in PostgreSQL #192

LB-167: Create schema for storing statistics in PostgreSQL #192

Conversation

paramsingh commented Jun 6, 2017

paramsingh commented Jun 6, 2017

mayhem commented Jun 6, 2017

paramsingh commented Jun 6, 2017

alastair left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paramsingh commented Jun 9, 2017

mayhem left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment