I have experience in developing “classical” machine learning solutions (not much in deep learning). This posts gives some thoughts on machine learning tools. My thoughts on machine learning in general would be the topic of another post.
- Use Python, don’t think about using Scala and its Spark framework. I haven’t really tried Julia, it seems nice for scientists but I can’t say. However Python’s large language ecosystem is a major advantage.
- Scikit-learn is the king of machine learning on Python. When an algorithm or processing method is added to Scikit-learn, I consider it mature and ready to use from exploration to production. The documentation is an excellent learning resource too.
- Gradient boosting libraries are an exception: XGBoost and LightGBM are great tools that provide powerful learning algorithms and efficient implementations. LightGBM has notably been very convenient in my experience. It’s an easy-to-plug algorithm that quickly gives a good model and insight on the predictive power of the data in many situations.
- Grid search is an important part of most machine learning projects, but the open source tooling is still lacking. Scikit-learn’s solution is simple but inefficient for big workloads, as it will repeat the same computations many times. Dask solves that, but its implementation is less versatile (I failed to use a custom dataset splitting method for example). And neither is designed for interrupting and starting again a long and intensive grid search. The best grid search tool I have used is still the one I wrote for my needs at work. Other solutions like Tune seemed not usable and mature enough. I haven’t taken the time to look at search optimization libraries.
- In the same idea, automated machine learning tooling is lacking too. The most practical solutions in Python like auto-sklearn or tpot are simply mixing a bunch of models from Scikit-learn hoping to get something out of it. I admit the AutoML space is still young and seems dominated by proprietary solutions.