I have been searching for an answer to this, but can’t seem to find anything. So my problem is this – we have an environment with MS SQL Server 2008, MySQL, and RedShift, and have some complex dataflows between the databases. Right now, the scheduling is done through independent systems, but I want to have one scheduler that controls the dataflows from beginning-to-end, and is able to script flows from MS SQL to RedShift, etc. Is there a system that can accomplish this already? I’m not a DBA, so I am guessing someone has had this problem before…
Thanks in advance!
EDIT: So one of our dataflows might look like this – file posted on SFTP –> run normal ETL routines –>
compile final complete file –> send to customer/push to S3 –> Run SQL commands on Redshift to load* –> Nightly batch processing on RedShift* –> Unload to S3* –> Load into MySQL*
*These are manually run using a tool that just connects via jdbc (can’t remember the program)
My DB-related experience is very light, so I was about to write some python scripts and schedule them in CRON, but that is custom and hard to expand – surely someone has had this problem before. We would like to be able to see a status of the job in one place, create new dataflows/ETL’s between all three systems (like an SSIS job).
What you are looking for is an Enterprise Job Scheduler. Computer Associates has a good one called AutoSys. There’s also JAMS, OpCon, Tivoli as well as others. There’s a decent list available on wikipedia. I’ve only had experience with AutoSys, and it’s expensive, but VERY powerful, but it takes some work to get setup. I don’t know the cost or complexity of the other options.
These sort of schedulers allow you to schedule commands to run on just about any OS or application that the scheduler supports without using the native schedulers (so you don’t need SQL Agent or anything).
SQL Server Agent Jobs are a good way to do this. The scheduling, logging, and error behavior are effective, and when something goes wrong I can usually figure out reasonably quickly what happened. You can assemble multiple steps in a job, and make later steps dependent on the completion of earlier steps.
You can execute stored procedures or SSIS packages in SQL Server, or execute batch files and command lines to steer MySQL and RedShift. If you don’t want this inside SQL Server, you can retreat to using Windows Scheduled Tasks, but the logging and error handling are inferior.
An SSIS package should be able to connect to MySQL and execute SQL commands there. I do not know about RedShift, but if it’ll talk ODBC you should be able to do that as well. SSIS has a built-in widget for FTP, but apparently not SFTP – can you do that off a command line?
It sounds like you should be able to do most of this within SSIS. There’s a bit of a learning curve (read: headache), and it’s very particular about data typing, but it runs much faster than most custom coding once you get it going.