814

From twext

Jump to: navigation, search

http://twext.com/dev/chunkster.pdf
chunkster/perl code once solved this problem with perl code..

updating, editing snip from patent:

now that the computer program has determined how source text files are to be titled (found), the computer program can process the text with either automatic or manual chunking and translation, and then automatically produce the bitextual presentation.

Contents

[edit] prep

First, the source text must be prepared, as indicated by a step 814 shown in FIG. 8 . If the user sends a URL, the computer program obtains the text contained at the URL, opens a new document, copies the text into the document, and removes irrelevant format elements, such as repeated spaces after line breaks. If the user sends a text directly, the computer program opens a new document, copies the text into the new document, and removes irrelevant format elements, such as repeated spaces after line breaks.

[edit] find paragraphs

The computer program then identifies the paragraphs, as indicated by a step 816 shown in FIG. 8 . Paragraphs within a text are typically located after one line break followed by a tab. Alternatively, paragraphs are located after one empty line, or after two line breaks. With HTML coding, paragraphs are identified by the tag “< p> ”. Other paragraph identifiers can be added to such paragraph criteria.

The computer program refers to such criteria, finding all instances of paragraphs within a text, and then preferably replaces those criteria with three returns, as indicated by a step 818 shown in FIG. 8 . It should be noted that a computer program can read a line of text of unlimited width, while humans typically read text from media of limited horizontal width, and thus employ “line breaks” or “returns” to access an entire text. Consequently, within a single line of text, “returns” can be inserted to instruct the computer program to format, display, or print upon multiple lines, and thus to be legible to a user.

Next, the computer program locates all single returns, or returns which are not in groups of three returns, as indicated by a step 820 shown in FIG. 8 , and then converts them to single spaces, as indicated by a step 822 shown in FIG. 8 . Before any new returns can be inserted to identify chunks and sentences, the computer program preferably inserts a mark after the second and third returns of three-return strings, as indicated by a step 824 shown in FIG. 8 . These marks protect the structure of the paragraphs in case any accidental triple returns are created by the ensuing steps.

[edit] sentences and chunks

In order to format sentences and chunks, the computer program refers to specific criteria which the computer program applies to insert returns before and/or after certain strings of text. For example, the computer program typically adds two returns after every period followed by a space, or “.” text string. But since periods also identify abbreviations such as “St.” or “Mrs.”, the computer program must first identify such exceptions, before applying the general rule. The computer program handles the exceptions as follows.

[edit] exceptions first

Before formatting individual sentences and chunks, the computer program first refers to a modifiable list of exceptions to the criteria by which such sentences and chunks are identified, as indicated by a step 826 shown in FIG. 8 . Each exception has a corresponding identifier, such as “< x 1 > ” or “< x 2 > ”, as indicated by a step 828 shown in FIG. 8 . For example, if the exceptions include an abbreviation, such as “St.”, and this exception corresponds to the identifier “< x 3 > ”, then the computer program searches the entire text for the text string “St.”, temporarily replacing all such strings with “< x 3 > ”. Thus, in accordance with the method of the present invention, during the process of automatically chunking a text, every individual example found within an exceptions file is searched for, and when found, is replaced with its corresponding identifier, thus protecting the exception from the subsequent application of rules for inserting returns after sentences and chunks.


[edit] then sentences

The method in accordance with the present invention then identifies sentences and chunks. The computer program returns to the beginning of the text and looks for elements of punctuation that identify the end of a sentence, including, in the English language, such punctuation as periods, exclamation points, question marks, and others, each followed by at least one space, as indicated by a step 830 shown in FIG. 8 . Any space selected which is followed by a triple return, for example, is then deselected, as indicated by a step 832 shown in FIG. 8 . Next, all selected spaces are replaced with two returns, the second return being followed by the previously used mark, thus protecting the sentence structure from accidental double returns which may be created in adding returns before and/or after specific chunk text strings, as indicated by a step 834 shown in FIG. 8 . Again, it should be noted that that while two returns are inserted after a sentence, the computer program continues to read the entire text as one line, but presents each new sentence to the human reader after two returns, in accordance with the method of the present invention.

[edit] then chunks

Once sentences have been identified, the computer program proceeds to identify likely chunks of meaning within the sentences. To do so, the computer program refers to the user selected chunk depth, which, as described above, allows a user to choose the frequency of new chunks. Typically, this frequency is defined by three criteria, namely, certain sequences of characters or text strings before which a single return is added, text strings after which a single return is added, and previously discussed exceptions to these rules.

[edit] BEFO, AFTE, BOTH

text strings before which a single return is added are included in “prechunk” criteria. Text strings after which a single return is added are included in “postchunk” criteria. A single text string can be included in both the prechunk and postchunk criteria.

  • BEFO = prechunk
  • AFTE = postchunk
  • BOTH = prechunk AND postchunk

[edit] befo

Proceeding to identify likely chunks according to a selected chunk depth, the computer program returns to the beginning of the line and refers to postchunk criteria, as indicated by a step 836 shown in FIG. 8 , which identify specific text strings after which the computer program must add a single return, as indicated by a step 838 shown in FIG. 8 , thereby ending a chunk of meaning. In the English language, punctuation marks, such as commas, colons, semi-colons, and a closing parenthesis or bracket, can typically identify the end of a chunk, as can commonly used words such as “and,” “or,” “with,” and “that.”

In the case of punctuation marks, the text string typically includes an empty space after the mark, while with words, the text string includes the empty space before the word and the empty space after the word. Such punctuation marks and text strings can be included within postchunk criteria of a selected chunk depth. Thus, the computer program refers to a modifiable list of postchunk criteria, as indicated by the step 836 shown in FIG. 8 , selects all text strings which meet the postchunk criteria, as indicated by the step 838 shown in FIG. 8 , and adds a single return after each example that is found, as indicated by a step 840 shown in FIG. 8 .

[edit] afte

Once the computer program has applied the postchunk criteria, the computer program returns to reread the line and add new returns according to prechunk criteria, as indicated by a step 842 shown in FIG. 8 . To do so, the computer program refers to the prechunk criteria to identify the text strings before which a return should be inserted. Such text strings in the English language may typically include common words, such as “and,” “or,” “with,” “that,” and “to;” common verbs, such as “do” and “have;” various conjugations of verbs, such as “done,” “had,” “will,” and “would;” and other words, such as “when,” “why,” “how,” “where,” “who,” and “what.” A variety of these and other words can be included within a modifiable list of prechunk criteria.

As with the postchunks, these prechunk text strings include the space before and the space after the identified words. Thus, the computer program inserts a return before the word “to,” but not the word “toe.” Punctuation, such as an opening parenthesis or opening bracket can also be included in the prechunk criteria. Thus, the computer program refers to a modifiable list of prechunk criteria, as indicated by the step 842 shown in FIG. 8 , selects all text strings which meet the prechunk criteria, as indicated by the step 844 shown in FIG. 8 , and adds a single return before each example that is found, as indicated by a step 846 shown in FIG. 8 .

[edit] chunked

Now the computer program has

  1. identified all paragraphs, sentences, and individual chunks, and has
  2. inserted returns between them in accordance with the method of the present invention.
  3. Any accidental double, triple, or quadruple returns are eliminated by finding all second, third, and forth returns that are not followed by the previously inserted mark and then removing those unmarked returns, as indicated by a step 848 shown in FIG. 8 .

[edit] cleanup marks

Then, the marks are removed, as indicated by a step 850 shown in FIG. 8 , leaving the text formatted with single, double, and triple returns.

After the rules for chunking the text have been applied, the computer program returns to replace all of the previously inserted coded exceptions, such as < “x 3 ”> , with their corresponding original text strings. The computer program selects all of the previously marked exceptions, as indicated by a step 858 shown in FIG. 8 , and substitutes the actual text string, such as “St.”, for the mark, as indicated by a step 860 shown in FIG. 8 . Thus, the text retains its original sequence of words, but is reformatted by the method of the present invention. While the computer program reads just one line of text with precisely inserted returns, it displays the text to the human reader with one return after each new chunk, two returns after each new sentence, and three returns after each new paragraph.

image:ChunkstaFlowchart.png
image:ChunkstaFlowchart.png

Retrieved from "http://twext.com/814"
Personal tools