Also, circumventing the EMR infrastructure to add extra jobs does NOT
lose the ability to monitor the jobs. Just tunnel to the jobtracker and
scheduler web pages and all the jobs show up individually --- or did
you mean something else?
emr --describe j-BLAH
Additionally, by circumventing EMR's infrastructure to add extra
jobs, you lose the ability to monitor the job, which is one of EMR's
--hadoop-binary='ssh hadoop@<master node IP> hadoop'
some sort of --ssh option that tells us that we have to run hadoop over ssh, that can scp local files.
an informative message from mrjob.tools.emr.create_job_flow that
tells us the option(s) to pass to the hadoop runner if we'd rather do
Hadoop-over-ssh than use the emr runner.
Given the time I spent getting mrjob to suck down logs over SSH, I'm
convinced that this would be a huge pain and an inappropriately large
amount of code to implement.
I think the only option that would actually need to be sent to the
job invocation would be the EMR job flow ID - which is already output
from the mrjob.tools.emr.create_job_flow script
Yeah, if you have a generally useful patch, it's just common courtesy to submit it back to the main branch.