For more information on the topic, please see the following article.
When it comes to equipment failure, how you transform your variables is usually the most critical step. In data science, there are almost always multiple ways to approach a problem. Feature engineering is no different. My background is in econometrics and classical statistics, so I tend to examine these problems from that perspective.
In this article, I walk through the type of transformations that I use when facing a panel data problem. Clearly, I do not present an exhaustive list of feature engineering. That would be impossible. …
Pricing is a complex topic, and every business in the world has to deal with it. Whether you are a nine-year-old with a lemonade stand or a transnational corporation, price is the mechanism that you use to operate your business. You provide a good or service. A customer pays the price for that good or service.
Over the last twenty-five years or so, I have had the privilege to work on several pricing related projects. Each one was unique, and each required considerable thought and, ultimately, a specific solution. That said, some general concepts are helpful when you tackle…
Data Science, at its core, is about understanding entities. What’s an entity? Well, an entity is the “thing” that you are trying to understand. For example, an entity can be a store, a customer, a machine, or an employee. Data science allows us to gain insight into what an entity does or thinks.
Most entities are complex. They have multiple inclinations and are typically moving targets. Imagine you are a Data scientist working at a telco, and consumers are your entity. Telco consumers, can do many things. They can cancel, increase, decrease, or maintain their spending. They can buy product…
Make sure your groups are independent and avoid deployment disasters
Cross-Sectional data includes individual entities measured in one time period. For example, 10,000 people measured once is cross-sectional data.
Time series includes one entity measured over multiple time periods. For example, a single machine measured every day for ten years is a time-series.
Panel data includes multiple entities measured over multiple time periods. For example, 1,000 consumers measured monthly over ten months is panel data. Or, 100 machines measured daily for 100 days, is panel data.
Panel data is quite common in data science. Sometimes, it is called cross-sectional time-series…
The zip (postal) code you live in says a tremendous amount about you. (At least in North America). Where you live suggests: your annual income, whether you have children, the TV shows you watch and your political leanings.
In the United States, there are over 41,000 unique zip codes. Zip codes are largely categorical. There is some broad meaning in the first two digits of the zip code. For example, Hawaii zip codes start with 9 and Maine zip codes start with 0. Beyond, very general geographic information, the codes themselves really provide little value.
What I if said that…
With predictive maintenance problems, there are two common metrics that represent the health of your asset.
The first is a probability to fail. That is, at a given moment in time, what is the probability that your machine will fail. Sometimes, this is represented by a health score. Typically, the health score is one minus the probability to fail times 100.
The second metric is the time until failure. That is, how many days, weeks, months, hours, minutes or seconds do you have until the asset in question stops working.
There are many different ways to calculate these metrics…
Where is the best place to draw the line?
When predicting a binary dependent variable, the output of your model is usually a probability or is easily converted to a probability. Many times it is desirable to convert this probability to a binary variable to match the dependent variable.
For example, if you are predicting whether a customer will buy a product, you may want to convert the probability to buy into a binary prediction of buy/not buy for each consumer in your scoring data set. The default for most algorithms is 0.50. That is, if a consumer has a…
I spent roughly four years of my life studying equipment failure problems as a Data Scientist. This article includes the better part of what I learned along the way.
Originally published in July 2020. Significantly revised February 2021.
In this notebook, I walk through a predictive maintenance problem in great detail. These types of problems can be tricky for several reasons. The first six sections deal with building a model. The last sections deal with evaluating model effectiveness and ensuring it will be effective when deployed in production.
When it comes to dealing with machines that require periodic maintenance…
I’ve spent the majority of my career dealing with problems related to sales and marketing. This means I have spent most of my time focused on humans. There’s one thing I can say with certainty. Sometimes people are completely rational and logical. Sometimes people are totally, completely and utterly incomprehensible.
Over the past five years or so, I’ve had the opportunity to branch into the industrial sector. In the industrial sector, the primary focus is on optimizing machines, not humans. It’s been interesting for sure.
One focus area that continues to rear its head in the industrial sector is machine…
Economist, Data Scientist and Data Wrangler. Opinions expressed and funny jokes are exclusively mine.