Learning Machine Learning

Learning Machine Learning

Piloting a new practice at the height of ML mania

Michael Mastrangelo

Machine learning graphic

Machine learning (ML) is everywhere these days. Although it’s been around since the late 80’s, ML mania has reached a fever pitch now that some of the largest companies (Google, Amazon, Microsoft, IBM), and several smaller and open-source players (Elastic, Apache, H20, Databricks, Splunk) have commoditized ML’s potential into hundreds of competing analytical platforms and solutions. The buzz gives companies like Brillient, a fast-growing intelligent-solutions company, the opportunity to work with these larger companies. Machine Learning as a Service (MLaaS) products help government solve some of its biggest challenges with data, predictive analytics and modernization. ML is expected to grow to nearly $40 Bn by 2025 with MLaaS predicted to grow to US $3.75 Bn by 2024 according to a 2018 Market Research Engine report. That MLaaS portion is where companies like Brillient, who know their client’s business closely and have great working relationships with government, can pitch innovative solutions to problems that neither the government nor the consultants previously thought were solvable.

Piloting ML at Brillient

Over the 2018-2019 government shutdown, my team of database engineers and metadata specialists started an internally funded research & development pilot for our team to learn ML, test ML hypothesis with open government data, survey the market for who’s doing interesting things in ML and how these products are being marketed. Over these five weeks my team learned, experimented, engineered, re-engineered and evaluated our ML approaches and understanding, each day deepening our knowledge while also being amazed at how large the field is. Here are some of our biggest takeaways and how they affect companies like Brillient who want to market MLaaS to our clients.

Don’t pick a winner, learn the science

Some companies may be tempted to buy a lot of seats on a particular ML platform and get to know it very well. I’d caution against getting too invested in a single platform. There are so many players in the space right now, and it’s hard to say who will end up on top. The dominant platform will likely take over the whole space even though the current trend is specialization. Making a heavy investment in a platform now is as risky as building a marketing strategy based heavily in MySpace back in 2005. It would have made sense in the short-term, but left you high and dry by 2010.

The big companies in the space are all making huge plays to the government to sell large software contracts, so be ready to work with any or even multiple client-chosen platforms. Python and R will be a part of any solution, but so will several GUI interfaces, data dashboards and of course, data cleanup techniques, that may be legacy Java, Perl or even Bash utilities like Awk and Sed. When it comes to data, especially government data, nothing is ‘too’ legacy to consider.

For our pilot we did learn Microsoft’s Azure Machine Learning Studio, but we trained in Tensorflow and used modules of Python and R in our models. My goal was to have a heterogenous skills set, to acquire high-level understanding and be able to port those skills to solving problems with ML. Obviously if a client had Azure, we’d do great, but if they were invested in Amazon Sagemaker, we could jump into that too, because we understand the overall practice of good ML, which is the practice of good science. Try and build your skills and solutions to be platform agnostic. The team learned Azure in under a week and the community of practice in Python is so big, there is likely no issue you’ll face that isn’t documented somewhere with a full solution or Python library to solve it.

Don’t move that data

A huge drain on data scientists and enterprise IT is data movement and replication. Whatever solution you are a part of, try and assure the client that you’ll work with the data where it is, and not have to migrate, template, or transform it into another datastore. If you look at the IRS’ roster of datastores, you’d be baffled by the number of copies, technologies and dataflow that happens across the enterprise. Don’t add another line to that list. Tap into where the data is, and you’ll be delivering a solution way faster than if you had to do a migration. If you’re thinking cloud, try cloudish (a cloud platform installed on on-prem servers) to get around strict government security protocols. If you’re thinking that the client’s data store is too legacy to work with your solution, then go with another solution. Data is way more valuable than your ML solution, so you must meet data where it lies and on data’s terms.

That’s not to say that you’ll have a space-free solution. Your factor engineering stage will need near-line storage, and you’ll of course need lots of model storage if you’ll be versioning the weights and balances of neural networks. Also, be prepared to store your output and logging in legacy systems.

Machines learn, but you may not

Training an ML model is not the same as investigating a problem. Once the model is trained and generalized, it can make predictions that minimize prediction error. It doesn’t explain how the process works in the real world.

For example, we know that CNN’s (convolutional neural networks) are very good at assigning categorical labels to images (label an image of a cat, ‘cat’, even find the cat in the image and the dog in the image and draw a box around each). This is awesome. But it doesn’t explain how we humans do it. Our consciousness, the relationship between word and image and the ways we identify images are still questions of neurology, linguistics and psychology. Some people are mistaken that when you train an ML model to make accurate predictions, that you’ve ‘solved’ a problem. You have offered a solution that is very helpful, but you haven’t learned much purely from the training.

The pre-training analysis where you dive through metadata searching for the proper factors, labeling data, and then evaluating multiple models with different inputs and hyperparameters is where the actual human understanding happens. That’s still highly valued and will never be replaced by ML. Clients still want to know answers to questions like, ‘Why does this input data explain these outcomes so well” and they will of course want to use answers to those questions to apply their own biases and values to your model’s predictive outcomes.

Conclusions

It’s an exciting time to be in the ML services industry, and we at Brillient are really looking forward to solving complex issues with ML to help our clients achieve their missions and better serve the American people. We see ML not as a job destroyer, but instead an enhancer that will lead to more jobs, more opportunity and better employee satisfaction with ML assistance.