Question :
I have a SQL Server 2012 database that I am using to save data from processed files. We read data from a folder, process it with python and save the results into the database.
One of the first things we do in our ETL process is to check if the file has already been processed. We simply do a:
SELECT id FROM table1 WHERE basename = <basename>
If there is a result we skip the file, if there is no result we process the file. Right now this query takes ~250ms with ~5m records. We have a nonclustered index on the basename
column already.
We will see about 100-200k records added per month. We get the files in batches. So we might see 2k files, and then 2 hours later another 2k files. Some days we will get 10k files, other days we might only get 4k files.
Keeping all other variables the same is there a rule of thumb for projecting when we might run into performance (queries taking longer than 1s) issues with this query other than inserting 15-20 million records into the table and seeing what happens?
Table DDL:
CREATE TABLE [dbo].[raw_records](
[id] [int] IDENTITY(1,1) NOT NULL,
[basename] [varchar](512) NULL,
[filename] [varchar](1024) NULL,
[file_size] [int] NULL,
[machine] [varchar](10) NULL,
[insert_timestamp] [datetime] NULL,
[raw_xml] [xml] NULL,
[process_status] [varchar](2048) NULL,
PRIMARY KEY CLUSTERED
(
[id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
Index:
CREATE NONCLUSTERED INDEX [basename_index] ON [dbo].[raw_records]
(
[basename] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
GO
This table was created long before I started working on it, so I’m assuming someone just made filename
1024 maximum length to hold “enough”. Certainly changeable.
The files that are produced have a timestamp and unique information within the filename itself (e.g. system1_metadata_timestamp.xml
) one “system” couldn’t produce (or should never) produce a file with the same timestamp.
select max(len(basename)), max(len(filename)) from dbo.raw_records;
returns: basename
– 143, filename
– 168. Probably a good thing to change down to a max of 260.
process_status
probably doesn’t need to be that long either, but I feel that one is reasonable to guess at, because the column is there to hold error messages from the processing phase. I ran a query on it and had a max of 600 characters. We don’t typically query on that column though. It is just more informational for debugging.
I’m going through the application cleaning up stuff such as that. In certain places I can’t get away from it but in others can’t do much about it unfortunately (e.g. need to actually retrieve the XML column to extract data from it). This question simply stemmed from seeing the performance of the query in question and not wanting it to get away from me. It is the first thing that gets done for every file so if that doesn’t work, nothing else will either.
Answer :
With a good index present, the time taken to locate a matching row should scale roughly logarithmically, as long as you have room for the index in memory.
I’d make the index UNIQUE
since the basename must be unique otherwise your workflow is invalid, and it makes the index more efficient.
CREATE UNIQUE INDEX IX_raw_records_basename
ON dbo.raw_records (basename);
Check the execution plan for the query to ensure the index is being used.
Make sure you have enough room in memory for the index, and assuming concurrency won’t be a massive problem, you should be good for a very large number of rows.
I’d reconsider the length of the basename
and filename
columns since the query optimizer will use the length when calculating how much memory it will need to allocate to run the query. If, for instance, the basename
column will never hold more than 20 characters, but you have it defined as 512 characters, the memory grant for SELECT basename FROM dbo.raw_records;
would be 25.6 times larger than actually required. Column lengths are actually much more important than most people realize.
You could also change the query to be SELECT 1 FROM table1 WHERE basename = <basename>
that way you wouldn’t even need the id
since all you’re trying to do is verify its existence. Only do what you really need. It looks like the index you show in your question would work fine for that.
In addition, if space is an issue you may want to consider data compression of the indexes and table. This should allow the index to fit in a smaller memory footprint. Evaluate DATA_COMPRESSION = ROW
versus DATA_COMPRESSION = PAGE
to see which is the best compression method for your requirements.
Treat like it is what it is… A staging table. Land the data files there just long enough to complete the ETL process and then truncate the table Just keep keep the file name and date in a “FileProcessLog” table so that you can maintain a record of what files were processed, when they were processed and how long it took to complete. Keeping the actual raw files in your database long term is going to be nothing grief… Do you really want to include terabytes of archived files in all of you nightly backups?
How do you get param that your are querying with. It will be more effective if you can do your queries based on int columns not varchar.
Also if you are interested only in the existence such a row in your table you could change your query to
SELECT 1 FROM table1 WHERE id = <id>
In this case you don`t need to include more columns in the index, so it will use less memory and will perform better.