| 
          
         | 
        
          
            <<  
             ^ 
              >>
          
          
            
              
                Date: 1999-12-16
                 
                 
                NSAs Semantic Forests: Schneier analysiert
                
                 
-.-. --.- -.-. --.- -.-. --.- -.-. --.- -.-. --.- -.-. --.- 
                 
                
      Bruce Schneier über die technischen Möglichkeiten des  
"Semantic Forests" Patents der NSA. Schlu?satz: "Ich bin  
überrascht, dass die NSA dieses Dokument nicht unter  
Verschluß gehalten hat. 
-.-. --.-  -.-. --.-  -.-. --.-  -.-. --.-  -.-. --.-  -.-. --.-   
The NSA has been patenting, and publishing, technology that  
is relevant to ECHELON. 
 
ECHELON is a code word for an automated global  
interception system operated by the intelligence agencies of  
the U.S., the UK, Canada, Australia and New Zealand.  (The  
NSA takes the lead.) According to reports, it is capable of  
intercepting and processing many types of transmissions,  
throughout the globe.  
 
Over the past few months, the U.S. House of Representatives  
has been investigating ECHELON. As part of these  
investigations, the House Select Committee on Intelligence  
requested documents from the NSA regarding its operating  
standards for intelligence systems like ECHELON that may  
intercept communications of Americans.  To everyone's  
surprise, NSA officials invoked attorney-client privilege and  
refused to disclose the documents.  EPIC has taken the  
NSA to court. 
 
I've seen estimates that ECHELON intercepts as many as 3  
billion communications everyday, including phone calls, e- 
mail messages, Internet downloads, satellite transmissions,  
and so on.  The system gathers all of these transmissions  
indiscriminately, then sorts and distills the information  
through artificial intelligence programs.  Some sources have  
claimed that ECHELON sifts through 90% of the Internet's  
traffic.   
 
How does it do it? Read U.S. Patent 5,937,422,  
"Automatically generating a topic description for text and  
searching and sorting text by topic using the same,"  
assigned to the NSA.  Read two papers titled "Text Retrieval  
via Semantic Forests," written by NSA employees. 
 
Semantic Forests, patented by the NSA (the patent does not  
use the name), were developed to retrieve information "on the  
output of automatic speech-to-text (speech recognition)  
systems" and topic labeling.  It is described as a functional  
software program. 
 
The researchers tested this program on numerous pools of  
data, and improved the test results from one year to the next.  
 All this occurred in the window between when the NSA  
applied for the patent, more than two years ago, and when  
the patent was granted this year. 
 
One of the major technological barriers to implementing  
ECHELON is automatic searching tools for voice  
communications.  Computers need to "think" like humans  
when analyzing the often imperfect computer transcriptions of  
voice conversations. 
 
The patent claims that the NSA has solved this problem.   
First, a computer automatically assigns a label, or topic  
description, to raw data.  This system is far more  
sophisticated than previous systems because it labels data  
based on meaning not on keywords.  
 
Second, the patent includes an optional pre-processing step  
which cleans up text, much of which the agency appears to  
expect will come from human conversations.  This pre- 
processing will remove what the patent calls "stutter  
phrases." These phrases "frequently occurs [sic] in text  
based on speech." The pre-processing step will also remove  
"obvious stop words" such as the article "the."  
 
The invention is designed to sift through foreign language  
documents, either in text, or "where the text may be derived  
from speech and where the text may be in any language," in  
the words of the patent. 
 
The papers go into more detail on the implementation of this  
technology. The NSA team ran the software over several  
pools of documents, some of which were text from spoken  
words (called SDR), and some regular documents. They ran  
the tests over each pool separately.  Some of the text  
documents analyzed appear to include data from "Internet  
discussion groups," though I can't quite determine if these  
were used to train the software program, or illustrate results. 
 
The "30-document average precision" (whatever that is) on  
one test pool rose significantly in one year, from 19% in 1997  
to 27% in 1998.  This shows that they're getting better. 
 
It appears that the tests on the pool of speech- to text-based  
documents came in at between 20% to 23% accuracy (see  
Tables 5 and 6 of the "Semantic Forests TREC7" paper) at  
the 30-document average.  (A "document" in this definition  
can mean a topic query.  In other words, 30 documents can  
actually mean 30 questions to the database).  
 
It's pretty clear to me that this technology can be used to  
support an ECHELON-like system.  I'm surprised the NSA  
hasn't classified this work. 
 
The Semantic Forest papers: 
 
http://trec.nist.gov/pubs/trec6/papers/nsa-rev.ps  
 
http://trec.nist.gov/pubs/trec7/papers/nsa-rev.pdf
                   
 
Source 
 
http://www.counterpane.com
                   
-.-  -.-. --.-   
-.-. --.-  -.-. --.-  -.-. --.-  -.-. --.-  -.-. --.-  -.-. --.-
    
                 
- -.-. --.- -.-. --.- -.-. --.- -.-. --.- -.-. --.- -.-. --.- 
                
edited by Harkank 
published on: 1999-12-16 
comments to office@quintessenz.at
                   
                  
                    subscribe Newsletter
                  
                   
                
- -.-. --.- -.-. --.- -.-. --.- -.-. --.- -.-. --.- -.-. --.- 
                
                  <<  
                   ^ 
                    >> 
                
                
               | 
             
           
         | 
         | 
        
          
         |