Millions record query - with and without index
Dec 31 2023 14:51
Just found out the movie page imdb database on imdb It's a large dataset that contains Movie information use PosgresSQL
Quickly import the name_basics table, curious about what it looks like
A simple table with People's names and some other information
Count the table return 12.382.531 records, but what makes me surprise is table has no primary key or index
Wow, it will take a long time to search...
Right away try to search for my favorite actor Jean Reno, think about taking a coffee in the waiting time (find one in 12 million)
SELECT *
FROM name_basics
WHERE "primaryName" = 'Jean Reno'
AND "primaryProfession" = 'actor';
The result is another surprise
1 row retrieved starting from 1 in 923 ms (execution: 908 ms, fetching: 15 ms)
Is it because the table is simple? Or postgres has some cache? Query a non-index text field in 12 million just take 1 second
What do we need index for? This is a really good performance
But I also want to see how performance go with index, so I add a simple index on primaryName field
create index "primaryName_index" on name_basics ("primaryName");
And re-try the select query
1 row retrieved starting from 1 in 34 ms (execution: 6 ms, fetching: 28 ms)
It's a huge different 900ms vs 6ms (for SQL execute), 150 times faster without index
Conclusion
The Index is magic! but it is not free, when query for index size
SELECT i.relname "Table Name",indexrelname "Index Name",
pg_size_pretty(pg_total_relation_size(relid)) As "Total Size",
pg_size_pretty(pg_relation_size(relid)) as "Table Size",
pg_size_pretty(pg_indexes_size(relid)) as "Index Size",
reltuples::bigint "Estimated table row count"
FROM pg_stat_all_indexes i JOIN pg_class c ON i.relid=c.oid
WHERE i.relname='name_basics'
It takes ~ 400MB storage (a single-column index) and also memory when processing.
Anyway, the performance it brings is no doubt.
[UPDATE] After some research, I found some tips about performance
What is a good latency for API? Generally, APIs that are considered high-performing have an average response time between 0.1 and one second. At this speed, end users will likely not experience any interruption. At around one to two seconds, users begin to notice some delay.
So take one second only for query database is a need to improve
My machine: Apple M1, Postgres 11 on docker
Delete comment
Confirm delete comment
Pham Duc Minh
Da Nang, Vietnam