Booz Allen Commercial delivers advanced cyber defenses to the Fortune 500 and Global 2000. We are technical practitioners and cyber-focused management consultants with unparalleled experience – we know how cyber-attacks happen and how to defend against them.
Our strategy and technology consultants have empowered our international clients with the knowledge and experience they need to build their own local resources and capabilities.
In facing challenges of modernization, our Middle East and North Africa clients have complex requirements that benefit from our proven experience in guiding major programs and projects for governments and private-sector organizations. The services we offer in UAE, Qatar, Egypt, Turkey, Kuwait, Morocco, Jordan, and other regional countries build on our consulting legacy.
Our clients call upon us to work on their hardest problems—delivering effective health care, protecting warfighters and their families, keeping our national infrastructure secure, bringing into focus the traditional boundaries between consumer products and manufacturing as those boundaries blur.
Booz Allen was founded on the notion that we could help companies succeed by bringing them expert, candid advice and an outside perspective on their business. The analysis and perspective generated by that talent can be found in the case studies and thought leadership produced by our people.
Learn more about Booz Allen's diverse culture and environment of inclusion that fosters respect and opportunity for all employees.
We've come a long way delivering innovative solutions. But our next chapter is still being written.
Our 22,600 engineers, scientists, software developers, technologists, and consultants live to solve problems that matter. We’re proud of the diversity throughout our organization, from our most junior ranks to our board of directors and leadership team.
Booz Allen takes pride in a culture that encourages and rewards the many dimensions of leadership—innovative thinking, active collaboration, and personal service. We’re particularly proud of the diversity of our Leadership Team and Board of Directors, among the most diverse in corporate America today.
Posted by Drew Farris on February 20, 2015
Recently, Mike Kim wrote an excellent post on overfitting, a common problem data scientists face when applying machine learning. Earlier, Paul Yacci and Aaron Sander [SM1] both tackled the topics of feature selection and feature creation. These are all key problems that data scientists encounter when building models that seek to describe and predict real-world outcomes based on observed data or hidden patterns in datasets.
This week I’ve identified a few other common problems data scientists face when working with data. These problems go beyond technology and machine learning and are broadly encountered regardless of the task at hand: interpreting the problem, sourcing the data, and describing the outcomes.
Interpreting the Problem
One of the most significant challenges a data scientist will encounter in examining a real-world problem is identifying the aspects of that problem can be addressed using data science. A recent article about a University of Chicago Data Science for Social Good Fellow project describes describes how data science was used for health care reform in Illinois. Facing low enrollment in Affordable Care coverage among Illinois' uninsured residents, data scientists cast the problem in terms of data, developing mechanisms to predict which individuals are least likely to be insured. Using this model the group developed targeted efforts to get people signed up. Translating a problem in this way requires both an understanding of the capabilities, tools, and techniques behind data science and the ability to get out from behind the keyboard and ask questions to inform the data process. Interpreting the problem is as much an art as it is a science.
Sourcing the Data
A number of the struggling data science projects suffer from a lack of data. The scientists working the problem have excellent ideas of what can be done, what tools and algorithms can be used, what features will be the most important, and even how to validate their assumptions and outcomes. What’s missing is the data required to support all this. Availability issues can range between not having a sufficient volume or variety of data, to having extremely inconsistent or “dirty” data, where the effort to clean, filter or repair is so monumental that in increases the risk related to the effort beyond what is tolerable to the organization. As with the Illinois insurance case, policy hurdles prevent the analysis of raw, individual level data. Related, problems arise from using representative data, a dataset that is used a stand-in for the real data while the team is waiting for the real data to be obtained. More often than not representative data, especially synthetic or generated data, does not accurately capture the nuances found in real data. Finally, scientists often underestimate the amount of work that is required to acquire, clean and understand data. The time or money budget is exhausted and a great data set is in hand, but little or no real analytic activity has occurred.
Exploring the Outcomes
Making predictions isn't always easy. Even given a dataset with expected outcomes, a supervised learning project becomes an exercise in feature exploration, appropriate algorithm selection, and rigorous model selection. Given sufficient time and processing power, you can crank through hyper-parameters and develop models of reasonable predictive strength requiring little interpretation. Lacking a pre-existing labeled set to use as a gold standard, there are options such as recruiting subject matter experts (friends and relatives or complete strangers, see mechanical turk) to do manual tagging or coding of datasets. When going beyond the realm of supervised learning problems, data scientists must fully leverage their ability to interpret what the data and algorithms tell them, in order to shape, understand, represent, and convey the underlying story locked within their data. Statistical analysis and unsupervised machine learning approaches, such as clustering and topic modeling, power these outcomes, but a practicing data scientist finds that most cases require interpretation and explanation to convey subtle meaning. As with the interpretation of the problem, this kind of narration of results is also an art.
These are just some of the problems all data scientists have faced at one time or another. While machines can help us with a major portion of the work associated with data science, a significant portion of the work depends on the human ability to theorize, interpret, analyze, and associate in the problem exploration or solution space. However, the best approaches can suffer if the data just isn't available to expensive or risky to obtain. Keep these in mind as you embark upon your next data adventure.