I have a database in 3 different locations with the same exact static data (nothing updates it, it was a one-time import from another system). In 2 of the 3 locations, a query similar to the following works fine (table and field names changed):
SELECT cr.E_ID , cr.R_ID , GROUP_CONCAT(csp.SP_ID) AS sp_ids , GROUP_CONCAT(sp.SP_Type) AS sp_types FROM db.TableA cr LEFT JOIN db.TableB csp ON cr.R_ID = csp.R_ID LEFT JOIN db.TableC sp ON csp.SP_ID = sp.E_ID WHERE cr.E_Id IN (12345) GROUP BY cr.E_ID, cr.R_ID;
The result for a few lines should look something like this
E_ID R_ID sp_ids sp_types 12345 7777 NULL NULL 12345 7778 333,444,555 AC,CB,LW 12345 7779 666,777,888 CB,CB,LW 12345 7780 111,222,223 AC,AC,CB 12345 7781 NULL NULL
So what’s happening in the one environment is that the result looks something like this (most of the GROUP_CONCAT() information is NULL’d out, with only a few lines having maybe 1 match
E_ID R_ID sp_ids sp_types 12345 7777 NULL NULL 12345 7778 NULL NULL 12345 7779 666 CB 12345 7780 NULL NULL 12345 7781 NULL NULL
I confirmed the table row counts match and, for matching rows, the IDs all exist. I also did things like ANALYZE TABLE to no avail.
The explain plan is showing a full index scan on the intermediary table TableB as well as a GROUP operator just before the result is returned with a “tmp table/filesort”. The other two environments show the same exact explain plan output yet they are working fine.
select_type table type possible_keys key key_len ref rows extra SIMPLE cr ref ix__E_ID ix__E_ID 5 const 12 Using index;Using temporary; Using filesort SIMPLE csp index ix__SP_ID 4 77489 Using where;Using index;Using join buffer (block nested loop) SIMPLE sp eq_ref PRIMARY,ix__E_ID PRIMARY 4 csp.SP_ID 1
I found the following global variables to be different between the “broken” and “working” environments
PARAM BROKEN ENV VAL WORKING ENV VAL aurora_lab_mode ON OFF back_log 450 900 binlog_checksum CRC32 NONE binlog_format STATEMENT ROW event_scheduler OFF ON innodb_buffer_pool_instances 4 8 innodb_buffer_pool_size 20090716160 46017806336 innodb_print_all_deadlocks OFF ON innodb_purge_batch_size 300 900 innodb_purge_threads 1 3 innodb_read_io_threads 8 16 log_bin OFF ON max_connections 2000 6000 query_cache_size 1116163072 2556593152 table_open_cache_instances 8 16 thread_cache_size 26 61
I found a work around. I added an index on cr.R_ID and csp.R_ID and after that the query results return as expected. If I drop the indexes, the query results break again.
After indexes are applied the plan looks like this
select_type table type possible_keys key key_len ref rows extra SIMPLE cr ref ix__E_ID ix__E_ID 5 const 12 Using where;Using index SIMPLE csp ref ix__R_ID ix__R_ID 4 cr.R_ID 1. Using index SIMPLE sp eq_ref PRIMARY,ix__E_ID PRIMARY 4 csp.SP_ID 1
I also noticed that switching from a LEFT JOIN to an INNER JOIN displays the correct results for rows that have data in the joined tables but since we need every row for the provided ID in the resultset I had to switch it back to LEFT JOIN.
Although I’ve found a work around, why this is happening?
It’s very unnerving that a missing index can alter the resultset of a query, not just affect the query performance. If it’s happening here I’m betting this is happening elsewhere and I just don’t know it yet.
MySQL version is Aurora 5.6.10. Please let me know what other information I can add that would be useful in answering this question.
All tables are created using ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci
I found the solution to my own problem and wanted to share here so anyone else that encounters this scenario will know the reason for the problem.
I tracked it down to the parameter group option of aurora_lab_mode. When this option was turned off in that environment the unexpected query result behavior stopped. I was able to verify this by turning it on and off again to verify the behavior.
I’ve since turned it off indefinitely.
This appears to have been fixed in Aurora 2.02 (MySQL v5.7.12) according to this fix list page