Millions record query - with and without index

postgres performance database

Dec 31 2023 14:51

Just found out the movie page imdb database on imdb It's a large dataset that contains Movie information use PosgresSQL

Quickly import the name_basics table, curious about what it looks like

image

A simple table with People's names and some other information

Count the table return 12.382.531 records, but what makes me surprise is table has no primary key or index

Wow, it will take a long time to search...

Right away try to search for my favorite actor Jean Reno, think about taking a coffee in the waiting time (find one in 12 million)

SELECT *
    FROM name_basics
WHERE "primaryName" = 'Jean Reno'
  AND "primaryProfession" = 'actor';

The result is another surprise

1 row retrieved starting from 1 in 923 ms (execution: 908 ms, fetching: 15 ms)

Is it because the table is simple? Or postgres has some cache? Query a non-index text field in 12 million just take 1 second

What do we need index for? This is a really good performance

But I also want to see how performance go with index, so I add a simple index on primaryName field

create index "primaryName_index" on name_basics ("primaryName");

And re-try the select query

1 row retrieved starting from 1 in 34 ms (execution: 6 ms, fetching: 28 ms)

image

It's a huge different 900ms vs 6ms (for SQL execute), 150 times faster without index

Conclusion

The Index is magic! but it is not free, when query for index size

SELECT i.relname "Table Name",indexrelname "Index Name",
 pg_size_pretty(pg_total_relation_size(relid)) As "Total Size",
 pg_size_pretty(pg_relation_size(relid)) as "Table Size",
 pg_size_pretty(pg_indexes_size(relid)) as "Index Size",
 reltuples::bigint "Estimated table row count"
 FROM pg_stat_all_indexes i JOIN pg_class c ON i.relid=c.oid
 WHERE i.relname='name_basics'

image

It takes ~ 400MB storage (a single-column index) and also memory when processing.

Anyway, the performance it brings is no doubt.

[UPDATE] After some research, I found some tips about performance

What is a good latency for API? Generally, APIs that are considered high-performing have an average response time between 0.1 and one second. At this speed, end users will likely not experience any interruption. At around one to two seconds, users begin to notice some delay.

So take one second only for query database is a need to improve

My machine: Apple M1, Postgres 11 on docker

me

Pham Duc Minh

Da Nang, Vietnam