Diff for "GetFromCelex" - MRC CBU Language Wiki
location: Diff for "GetFromCelex"
Differences between revisions 5 and 6
Revision 5 as of 2013-08-01 18:15:10
Size: 7929
Comment:
Revision 6 as of 2013-08-01 18:17:36
Size: 8006
Comment:
Deletions are marked like this. Additions are marked like this.
Line 10: Line 10:
 . {{{ {{{
Line 60: Line 60:
{{{
Line 62: Line 62:
}}}
Line 66: Line 67:
{{{
Line 68: Line 69:
}}}
Line 74: Line 76:
{{{
   SetCelexDir C:\celex2\ENGLISH
Line 75: Line 79:
SetCelexDir C:\celex2\ENGLISH    OutputFile C:\celex2\Utilities\Output.txt //specifies outputfile
Line 77: Line 81:
OutputFile C:\celex2\Utilities\Output.txt //specifies outputfile    Filter ClassNum == 4 //limits search to verbs
Line 79: Line 83:
Filter ClassNum == 4 //limits search to verbs    MasterFile -Celex -EF C:\celex2\Utilities\InputMasterFile.txt mstr //specifies masterfile with relevant words
Line 81: Line 85:
MasterFile -Celex -EF C:\celex2\Utilities\InputMasterFile.txt mstr //specifies masterfile with relevant words    Filter mstr[1] == Word //uses masterfile as filter so only relevant words are returned
Line 83: Line 87:
Filter mstr[1] == Word    //uses masterfile as filter so only relevant words are returned

Output Word W:CobSLog //specifies log spoken frequency as output variable after word
   Output Word W:CobSLog //specifies log spoken frequency as output variable after word
}}}
Line 88: Line 91:
{{{
Line 90: Line 93:
}}}
Line 94: Line 98:
{{{
Line 96: Line 100:
}}}
Line 100: Line 105:

 . Filter Count FlatSA "S" == 2
{{{
 Filter Count FlatSA "S" == 2
Line 103: Line 108:
}}}
Line 107: Line 113:
{{{
Line 109: Line 115:
}}}
Line 113: Line 120:
{{{
Line 115: Line 122:
}}}

The GetFromCelex utility

GetFromCelex is a simple tool to produce files from Celex that combine fields from several Celex files. You can filter the output to select the lemmas or wordforms you need.

For more information on Celex have a look at the Celex manual in the language group directory or downlad it from here: EUG_A4.PS. You will need a postscript viewer, like GSView, to be able to open and read this manual. The manual will tell you all the fieldnames in Celex and the data they contain. A table of Celex-DISC-IPA transcriptions is here DISC.pdf

Not all fields that are mentioned int the Celex manual are actually in our version of the database. If you want to know which fields you can use, there is a file named FieldListEnglish.txt that lists all fields, with some additional information attached.

GetFromCelex will only run on a Windows system (95, 98, NT or 2000), and has to be called from a (DOS) command prompt like this:

L:\> GetFromCelex myscript.txt

Where the file called 'myscript.txt' should be a GetFromCelex script file that you created yourself, containing a few simple commands. You can create a script file using any text editor, like notepad.

Alternatively, you can just double-click on the GetFromCelex program in the Windows explorer. It will launch a file-open dialog, asking you for the scriptfile. This enables people without DOS skills to still run the program.

An example of a GetFromCelex scripfile is:

 . SetCelexDir L:\Celex2\English
 . OutputFile C:\Compounds\output.txt
 . Filter FlatSA == "'''S'''S*"    // Select only compounds
 . Filter W:Cob > 500          // Cobuild frequency
 . Output Word W:Cob MorphStatus FlatSA W:PhonStrsDISC

The first thing to know is that the first word on each line should be a valid GetFromCelex command. All commands are case sensitive. Everything following "//" is a comment, and will be ignored by the program.

The first line, with the SetCelexDir command, tells GetFromCelex where the Celex files are to be found. The normal location is on Home, in the language group directory. If you cannot access this, you can install them on your own computer: just ask me for the Celex CD. The directory that has to be provided is the one containing all the subdirectories for the different Celex files. This will normally be the 'ENGLISH' directory, or the 'DUTCH' or 'GERMAN' directories for the other languages.

The next command is "OutputFile" and should specify the name of the file where the output is written. An existing file with the same name will be overwritten!

The next two commands are "Filter" commands. Only entries that satisfy these are written to the outputfile. If you supply more than one filter, only entries that satisfy all filters are written.

Wildcard expressions are accepted: '*' means any character or no character at all, and '?' means precisely one character. All other characters have to be matched literally.

Wildcards and string literals must be in double quotes, anything else will be interpreted as a numeric value or a CELEX fieldname.

The operators you can use are:

  ==      Equal, can be used with wildcards 
  !=      Not equal, can also be used with wildcards
  <=      Smaller than or equal to 
  >=      Greater than or equal to
  <       Smaller than
  >       Greater than

Filters can be combined using an OR operator like this:

 . Filter FlectType == "S" OR FlectType == "P"

The last command, "Output", specifies which fields should be written to the outputfile. Fields will be written in the order you specify, with '\' characters as field seperators like in the original Celex files.

The names of the fields, like 'FlatSA', are identical to the names used in the CELEX manual. The only difference is that you will have to prefix some with "L:" or "W:" to select fields from either the Lemma or the Wordform lexicon. In the example script "Cob" could refer to the Cobuild frequency of the Lemma or that of the Wordform. Using "W:Cob" disambiguates this for the program.

Advanced features

Sometimes you only want information from Celex for a limited number of words. You can do this by using a 'master' file, like this:

 . MasterFile C:\Experiment1\Condition2 mstr

The last field ('mstr') is a nickname that you give the file so you can refer to it.

This masterfile can be used in filters so you can limit your output to the words in the masterfile like this (assuming the words to filter on are on the first field of your masterfile):

 . Filter mstr[1] == Word

As you can see you use the nickname and a fieldposition to refer to fields in your masterfile. Master files must have spaces or tabs as fieldseperators, and cannot have empty fields.

When using a masterfile, GetFromCelex will need much more time to produce the output: every line in the masterfile will take several seconds (or more if your computer is slow). So, a masterfile of 500 lines could easily take more than 15 minutes!

This option is most useful when you already have a list of words and you want to extract particular variables for them (e.g. a list of verbs that you want to get the frequency for). To do this, a short script would be written as follows:

   SetCelexDir C:\celex2\ENGLISH

   OutputFile C:\celex2\Utilities\Output.txt    //specifies outputfile

   Filter ClassNum == 4   //limits search to verbs

   MasterFile -Celex -EF C:\celex2\Utilities\InputMasterFile.txt mstr   //specifies masterfile with relevant words

   Filter mstr[1] == Word      //uses masterfile as filter so only relevant words are returned

   Output Word W:CobSLog    //specifies log spoken frequency as output variable after word

There is a filter with which you can select fields on the basis of their length:

 . Filter Length Word > 5 Filter Length Word <= 15

The example will select only 'Word' fields that contain strings longer than 5 characters and shorter than 16.

You can tell GetFromCelex to ignore certain characters in the length count:

 . Filter Length L:PhonSylBCLX < 3 Ignore "[]" // number of phonemes

In this case the [ and ] brackets indicating syllables are ignores, resulting in a count of the number of phonemes in a word.

There is also a special filter available that can count the number of substrings in a field. It is used like this:

 Filter Count FlatSA "S" == 2
 Filter Count FlatSA "SA" < 3

In the first case only entries that contain exactly 2 occurrences of the character "S" will be outputted. The second filter makes GetFromCelex only output entries that have no more that two occurrences of "SA" embedded in the given field.

If you only want to include fields that contain certain characters, or that do not contain certain characters, you can use these commands:

 . Filter CharSet Word "abcdefghijklmnopqrstuvwxyz" Filter NotCharSet Word "_&$"

The first line will select 'Word' values that are only lowercase, and the second one will exclude Word's that contain a '_', '&' or '$' character.

Normally GetFromCelex will only output wordforms for which all filters (if any) are true. But sometimes you want to see the other wordforms too. If you want to look for words that have a plural form that is identical to the singular, you would probably be interested to see if there is another, not identical, plural. You can make GetFromCelex output all wordforms for each lemma that has at least one wordform that gets through all filters by using this command:

 . OutputAllWordforms

By default, only the wordforms that matches the criteria for all filters are copied to the output.

If you have problems, or want to ask a question, please contact me.

Maarten

Maarten.van-Casteren@mrc-cbu.cam.ac.uk

-- Main.MaartenVanCasteren - 07 Apr 2006

CbuLanguage: GetFromCelex (last edited 2013-08-01 18:17:36 by RussellThompson)