i am worried that i’m not knowledgeable enough about the technologies here to formulate a sufficiently-precise question. as a result, this is going to be a little bit open-ended and my questions may suffer from misunderstandings.
the situation: i have a data set living in a nicely-normalized mysql database. it is not ‘big’ as of yet (maybe a half million records) but it is anticipated to grow quite rapidly in the coming months. the anticipated use of this data set is to be able to categorize user preferences by activities and some mild demographics. ie, discover that users who live in ohio and use the application daily prefer french fries to onion rings by a factor of two-to-one so that we can, in future, display a ‘want fries with that?’ message to other ohioan (ohioite?) daily-users.
the problem: my initial plan had been simply to use sql to acquire the desired results; sql is a strong team skill in my group. however, i was recently pitched aggressively on using apache spark.
now, after a scant hour or so of reading up on spark, i find myself confused as to what exactly would be the advantage of going this route. in particular:
- is it merely for scalability and speed, or does, say, writing your query in scala offer other advantages?
- is a normalized mysql db a reasonable data source for spark? i ask this because i have previously experimented with cassandra which required dramatic denormalization.
- if this turns out to be a worthwhile thing to pursue, is it realistic that a relative noob can leverage spark to be actually useful in a short period of time? sql, cql and scala skills are all strong in this team. python skills are basically zero.
- are there notable downsides to this over and above the learning overhead and the cost of some extra ec2s?
- is there an accepted or suggested route to learn this stuff?
i realize that is a tremendous amount of question and, again, apologies for being so open-ended.