Fetching rows from million rows table(Optimization)

Posted on

Question :

I have millions of rows in a table of Greenplum database out of which I have to fetch around 45k rows and store them in a python List.

It’s taking more than 2hrs to fetch the data. How can I optimize the time taken to fetch data?

resultList = []
for(item in list):
  result = SELECT column_1, ... column_n from TABLE WHERE column = item

Answer :

To do anything takes a finite amount of time. 
To do the same thing 45K times, takes [at least] 45K times as long. 

This is a classic 1+n query.
Retrieve a list of ids, then process each in turn with another query.

  • It runs like a slug.
  • It’s completely unscaleable.
  • It’s complete untunable. Indeed, your database will be thoroughly enjoying itself, being asked to retrieve a single record based on single value that it can find in a unique index!

Rewrite your code to retrieve the whole Set of data, then loop through the rows within it to process it or, in this case, simply select the data you want, for all the rows you want, in a single query. I assume you used another query to retrieve the values in your “list”: use the “WHERE” clause from that query to retrieve [all] the data you need:


resultlist = 
   SELECT column_1, ... column_n 
   from TABLE 
   WHERE condition-you-originally-used-to-get-the-list-of-ids
   ORDER BY column_1 

No; it’s not very Object Oriented.

It’s Set Based and, for Relational Databases, there’s nothing better.

Leave a Reply

Your email address will not be published. Required fields are marked *