From Assisting Crime-solvers to Diagnosing Disease, Text Analytics Finds Clues in Content
First developed to support law enforcement, Booz Allen’s text analysis capability can boost productivity in almost every domain in which the firm is involved.
All criminal investigative analysts share at least two important skills: An eye for detail, and patience. It’s the criminal analyst’s job to find relationships or patterns in different crimes that could potentially solve cases. To find such relationships, the analyst often studies the details written in crime reports.
Details such as a suspect’s penchant for wearing a red baseball cap.
There may not be anything unusual about the baseball cap… until Booz Allen Hamilton’s investigative technology team identifies the words “red baseball cap” in several other recent crime reports. Suddenly, those words provide the analyst with a connection between cases—and a possible clue to the suspect’s identity.
There are two ways to find this type of crucial information. One involves tedious, time-consuming manual research of sentences and phrasing, searching for the few words that might be significant, buried in massive amounts of insignificant text. The second, much faster way is called text analytics.
Text analytics is a process that uses automated algorithmic techniques and leading-edge technology to identify and extract important information from text, such as the location of a crime, the description of a tattoo on a perpetrator’s arm, or the red baseball cap a suspect wore, as well as the relationships that link this information.
Since the 1990s, Booz Allen’s team of certified forensic analysts and specialized software engineers has been developing special-purpose computer software that integrates text analytic capabilities and natural language processing techniques to serve the law enforcement, health, defense, and intelligence communities by demonstrating the power, applicability, and capabilities of text analytics.
The firm’s methodology, which has been developed into a service offering, combines domain and functional expertise to develop client-specific applications. These applications identify and extract quantitative, precise information from qualitative, imprecise data. The knowledge acquired can lead to important discoveries or help solve crimes.
Although a red baseball cap was not involved, Booz Allen’s team did use its text analytics service offering to review dozens of law enforcement case files for a criminal investigative analyst. The tool identified patterns and high-priority data that warranted further investigation by automatically extracting complex but potentially useful information from text.
Senior associate Adam Feldman is the project leader. “We extracted key factors that identified linkages and helped discover new leads,” he says. “Our automated software completed the labor-intensive work for the analyst in a short timeframe, and provided her with valuable data she needed to move the investigations forward and make more efficient use of her time.
“This capability can be also applied to surveillance, monitoring, and automated analyses of many situations relevant to force protection, intelligence, health, and homeland security,” he adds.
Uncovering the Critical Information Buried in Complex Data Sets
Intelligence analysts, criminal investigators, and health researchers face a tremendous challenge: They must decipher the content of large, complex digital data sets to find the basic information they need to know. Relevant information may be available from open sources such as Web pages and documents, but the volume and diversity of such sources are prohibitive, making manual reviews costly and resource intensive. They also face logistical and training challenges, because they seldom have a robust understanding of the technology that could raise their productivity and efficiency, or the resources or time for training.
Booz Allen addresses these challenges and more. For example, one of the firm’s many text analytic prototype software applications may help federal syndromic surveillance efforts by providing early detection of potential disease outbreaks by analyzing electronic records of patient “encounters” (i.e., interactions) at hospital emergency rooms (ERs).
Here’s how it would work: A patient arrives at a hospital ER with serious respiratory symptoms. As he undergoes a medical exam, a nearby computer extracts crucial data from his ER encounter records. This data provides clues to help doctors differentiate and diagnose whether the patient is suffering from the flu—or from inhalational anthrax.
Booz Allen’s prototype application integrates text analytics technology to analyze electronic ER patient encounter records for the occurrences of specific respiratory syndromes such as inhalation anthrax. Beta tests of the application’s precision have been encouraging: The tests used a database of 500 ER records taken from a local hospital weeks after the Brentwood post office anthrax cases occurred in Washington, D.C. in 2001, and have achieved 98% specificity in distinguishing anthrax cases from other respiratory syndromes.
Says Feldman, “This prototype automated the process of identifying key symptoms, signs, and lab results that collectively indicate a specific diagnosis.”
Feldman and his team also support defense and intelligence projects. “Analysis of e-mail can be overwhelming,” he says. “They’re easy to find and can be a treasure trove of useful information, but it’s not enough to just pull them off the computer. To add real value, you need to analyze the content of every message and extract the useful information in an efficient way. This requires automation and advanced analytic techniques.”
Booz Allen works with a community of natural language processing product vendors to integrate its functional capabilities with customized text analytics-enabled solutions. The firm’s forensic technicians and engineers contribute their expertise and practical experience, which satisfies niche client requirements, including providing analysis of Internet activity that could indicate an emerging trend in political, social, and economic environments. This capability is of special interest to the intelligence community (IC): Detection of such signs assists the IC in developing new requirements for intelligence collection.
Building a Text Analytics Solution
To understand text analytics, it’s helpful to understand some of the language used in the process, such as structured text. Structured text is text that is stored and retrieved in a consistent, organized, and finite manner. It has a specific set of values and a predictable format. For example, an officer will enter only one of two values of structured text in the category “Gender” in a crime report: “Female” or “male.”
Conversely, unstructured text is variable, unconstrained, and unpredictable. The narrative sentences and phrases an officer writes in her notebook while interviewing a crime victim is an example of unstructured text.
As an intermediate step between analyzing unstructured text and presenting results to the user, Booz Allen’s approach often stores extracted results in a structured format, such as a database, so that they can be readily processed and queried. “Once we extract useful tidbits, they must ultimately be queried, manipulated, and further analyzed to produce a useful product,” Feldman says. “When we put our extracted information in a structured form, tools such as commercial databases, statistical analysis software, and spreadsheets can perform the next stage of analysis.”
The firm’s text analytics approach focuses on client-specific requirements to make domain-specific applications. Intelligence that’s useful to the law enforcement, for example, may not be useful to the CDC. So when the team develops a text analytic-integrated application, they must first understand the domain-specific information that is being sought, from which they can implement rules related to grammar, semantics, and source data that describes client-specific types, categories, and relationships of information. This enhances the specificity, accuracy, and applicability of the information and terms extracted.
In the example of the syndromic surveillance-related application, the team implemented rules that identify specific medical symptoms, as well as rules that differentiate whether a certain symptom is represented positively or negatively: “Shortness of breath” versus “no shortness of breath,” or “denies shortness of breath” versus “does not deny shortness of breath.”
In this case, rules that distinguish between positive and negative language are critically important because each diagnosis is based on the presence (positive) or absence (negative) of specific symptoms described in a medical record and, when properly analyzed, will indicate inhalation anthrax or something more benign.
“All of our clients struggle with the labor-intensive process of reviewing vast amounts of data to yield high-value information or intelligence,” Feldman says. “The challenge of identifying and extracting key bits of information is common to nearly every domain team with which we’ve collaborated.” For example, health care has nearly as much information in unstructured text form (e.g., journal articles, drug trials, claims data) as the IC does in its various open-source collections and knowledge databases.
Feldman continues, “The combination of our text analytic tools and expertise, our law enforcement-based investigative technologies expertise, and our specialized software engineering backgrounds enable us to understand these complex information extraction challenges and develop very specialized software solutions that automates relevant business processes and raises the productivity of performance.”
story posted September 3, 2008
