The document discusses information extraction from web-scale n-gram data, emphasizing user preferences for structured data that provides direct answers rather than unstructured documents. It outlines the challenges of extraction, such as rare pattern matches and the need for larger corpora, while introducing n-gram statistics derived from extensive textual data as a potential solution. The document also describes an extraction algorithm that relies on seed tuples and n-gram patterns to identify and rank candidate tuples for structured data.
Related topics: