Question on arrays and text processing

Svetoslav.Marinov at isp.his.se Svetoslav.Marinov at isp.his.se
Fri Jan 9 09:38:33 CET 2004


Hi everyone,

I have a question on arrays and a request for help on useful data-structures
for the task below.

The scenario is as follows: 

The task - extract Subcategorization frames of verbs.

I have a preprocessed treebank file with the relevant for me tags, 1 sentence
per line. I read the file 1 line at a time, put it in an array, then find the
verb and then possible arguments. This information is stored in a dictionary.


The problems: After certain amount of time the execution slows down and
eventually halts on certain inputs (i.e. sentences). The debugger shows an
infinite (at least more than 1200!!!! runs) loop while checking an array. How
can it possibly be so, if my arrays are not more than 300 elements long
(since every tag and every word is a separate entry in the array).

The question is why is it so? Are arrays not good idea for this task? Is
there a better Oz-option? Where can I read about it? If the original xml file
is not a valid one, can one still use already available xml-parser? 

For those willing to look at my chunky code it is in the attachment plus a
test file. It processes only the first 6 sentences. It can be problem of the
data file, or it may not be :) 

Thank you very much in advance!

Svetoslav

-------------- next part --------------
A non-text attachment was scrubbed...
Name: new3.oz
Type: application/octet-stream
Size: 3302 bytes
Desc: new3.oz
Url : http://lists.gforge.info.ucl.ac.be/pipermail/mozart-users/attachments/20040109/b8741eba/new3.obj
-------------- next part --------------
<S <VPC <V <T Ne <ta T </T <V <Pron mi <ta Pp-d1s-t </Pron <V e <ta Vx---f-r3s </V </V </V <PP <Prep do <ta R </Prep <N smyah <ta Ncmsi </N </PP </VPC . </S
<S <VPS <NPA <N TSigarata <ta Ncfsd </N <APA , <Participle padnala <ta Vppi+cao-sfi </Participle <PP <Prep na <ta R </Prep <N poda <ta Ncmsh </N </PP , </APA </NPA <V dimeshe <ta Vpii+f-m3s </V </VPS . </S 
<S <VPA <VPS <Pron Tya <ta Pp-o3sf </Pron <V govori <ta Vpit+f-o3s </V </VPS <AdvPA <Pron tolkova <ta Pdq </Pron <CLR , <VPA <Pron kolkoto <ta Prq </Pron <V tryabvashe <ta Vni-+f-m3s </V </VPA </CLR </AdvPA </VPA . </S 
<S <VPF <DiscE idref="001660" <N sort="NE-Pers" Ivan <ta Npmsi </N </DiscE <VPS <V izglezhda <ta Vni-+f-r3s </V <CLDA <VPS <nid idref="001660" <V <C da <ta C </C <V e <ta Vx---f-r3s </V <Participle pokanen <ta Vppt+cv--smi </Participle </V </VPS </CLDA </VPS </VPF . </S 
<S <VPF <DiscE idref="00168" <N sort="NE-Pers" Ivan <ta Npmsi </N </DiscE <VPS <VPC <V e <ta Vx---f-r3s </V <Adv trudno <ta D </Adv </VPC <CLDA <VPS <nid idref="00168" <V <C da <ta C </C <V razbere <ta Vppt+f-r3s </V </V </VPS </CLDA </VPS </VPF . </S 
<S <VPF <DiscE idref="00169" <N sort="NE-Pers" Ivan <ta Npmsi </N </DiscE <VPS <V <V beshe <ta Vx---f-t3s </V <Participle resheno <ta Vpit+cv--sni </Participle </V <CLDA <VPS <nid idref="00169" <V <C da <ta C </C <V zamine <ta Vppt+f-r3s </V </V </VPS </CLDA </VPS </VPF . </S
<S <Pragmatic <V <V Razbira <ta Vpit+f-r3s </V <Pron se <ta Ppxa---t </Pron </V , </Pragmatic <CoordP <ConjArg <VPS <VPC <A choveshko <ta Ansi </A <V e <ta Vx---f-r3s </V </VPC <CLDA <V <C da <ta C </C <V sb'rka <ta Vppt+f-r3s </V </V </CLDA </VPS </ConjArg <Conj , <C no <ta C </C </Conj <ConjArg <VPA <Pron tuk <ta Pdl </Pron <VPA <nid idref="001475" <VPC <V znam <ta Vpii+f-r1s </V <DiscA idref="001475" <AdvPA <Adv mnogo <ta D </Adv <Adv dobre <ta D </Adv </AdvPA </DiscA <CLR <VPF <DiscE idref="001476" <PP <Prep za <ta R </Prep <Pron kakvo <ta Pia--sn </Pron </PP </DiscE <VPC <V stava <ta Vpii+f-r3s </V <NPA <nid idref="001476" <N duma <ta Ncfsi </N </NPA </VPC </VPF </CLR </VPC </VPA </VPA </ConjArg </CoordP . </S
<S <CoordP <ConjArg <V <V <pro-ss idref="0014161" S'glasi <ta Vppi+f-o3s </V <Pron se <ta Ppxa---t </Pron </V </ConjArg <Pragmatic , <V znaesh <ta Vpii+f-r2s </V , </Pragmatic <Conj <C i <ta C </C </Conj <ConjArg <VPA <V <pro-ss idref="0014161" tr'gnahme <ta Vppi+f-o1p </V <PP <Prep k'm <ta R </Prep <N doma <ta Ncmsh </N </PP </VPA </ConjArg </CoordP . </S
<S <CoordP <ConjArg <CoordP <ConjArg <PP <Prep Bez <ta R </Prep <N s'kraschenie <ta Ncnsi </N </PP </ConjArg <Pragmatic , <V znachi <ta Vpii+f-r3s </V , </Pragmatic <Conj <C i <ta C </C </Conj <ConjArg <PP <Prep bez <ta R </Prep <NPA <N mushkane <ta Ncnsi </N <PP <Prep v <ta R </Prep <N rebrata <ta Ncnpd </N </PP </NPA </PP </ConjArg </CoordP </ConjArg <Conj , <C a <ta C </C </Conj <ConjArg <AdvPA <CoordP <ConjArg <Adv krotko <ta D </Adv </ConjArg <Conj <C i <ta C </C </Conj <ConjArg <Adv tolerantno <ta D </Adv </ConjArg </CoordP <CLR , <VPA <Pron kakto <ta Prm </Pron <VPA <V pishe <ta Vpit+f-r3s </V <PP <Prep v <ta R </Prep <N zakona <ta Ncmsh </N </PP </VPA </VPA </CLR </AdvPA </ConjArg </CoordP . </S
<S <Pragmatic <V <T Ne <ta T </T <V schesh <ta Vt---f-r2s </V <T li <ta T </T </V , </Pragmatic <VPA <PP <Prep po <ta R </Prep <NPA <Pron tova <ta Pdno-sn </Pron <N vreme <ta Ncnsi </N </NPA </PP <VPA <nid idref="001322" <VPS <Participle minaval <ta Vpit+cao-smi </Participle <DiscA idref="001322" <Adv nablizo <ta D </Adv </DiscA <N kmet't <ta Ncmsf </N </VPS </VPA </VPA . </S
<S <CoordP <ConjArg <V <V <pro-ss idref="0013122" Otvorihme <ta Vppt+f-o1p </V <Pron idref="0013123" mu <ta Pp-d3smt </Pron </V </ConjArg <Conj <C i <ta C </C </Conj <ConjArg <VPC <V <Pron idref="0013123" mu <ta Pp-d3smt </Pron <V <pro-ss idref="0013122" razpravihme <ta Vppt+f-o1p </V </V <CLQ <VPS <VPC <PP <Prep za <ta R </Prep <Pron kakvo <ta Pia--sn </Pron </PP <V <Pron idref="0013122" ni <ta Psz-1--t </Pron <V e <ta Vx---f-r3s </V </V </VPC <N spor't <ta Ncmsf </N </VPS </CLQ </VPC </ConjArg </CoordP . </S
<S <CoordP <ConjArg <V <V <pro-ss idref="0009129" Izm'knahme <ta Vppt+f-o1p </V <Pron se <ta Ppxa---t </Pron </V </ConjArg <Conj <C i <ta C </C </Conj <ConjArg <VPC <V <pro-ss idref="0009129" poehme <ta Vppt+f-o1p </V <PP <Prep za <ta R </Prep <NPA <N selo <ta Ncnsi </N </NPA </PP </VPC </ConjArg </CoordP . </S
<S <CoordP <ConjArg <CoordP <ConjArg <VPA <PP <Prep Iz <ta R </Prep <N p'tya <ta Ncmsh </N </PP <VPC <V <pro-ss idref="000930" nakarah <ta Vppt+f-o1s </V <N idref="0009130" kmeta <ta Ncmsh </N <CLDA <VPA <VPC <V <C da <ta C </C <V <pro-ss idref="0009130" zastane <ta Vppi+f-r3s </V </V <PP <Prep na <ta R </Prep <NPA <Pron edin <ta Pfeo-smi </Pron <N kam'k <ta Ncmsi </N </NPA </PP </VPC <PP <Prep v'v <ta R </Prep <NPA <N vid <ta Ncmsi </N <PP <Prep na <ta R </Prep <N pametnik <ta Ncmsi </N </PP </NPA </PP </VPA </CLDA </VPC </VPA </ConjArg <Conj <C i <ta C </C </Conj <ConjArg <VPC <V <Pron idref="000930" si <ta Ppxd---t </Pron <V <pro-ss idref="000930" kazah <ta Vppt+f-o1s </V </V <N rechta <ta Ncfsd </N </VPC </ConjArg </CoordP </ConjArg <Conj , </Conj <ConjArg <CoordP <ConjArg <VPA <Adv posle <ta D </Adv <VPS <Pron idref="000930" az <ta Pp-o1s </Pron <V zastanah <ta Vppi+f-o1s </V </VPS </VPA </ConjArg <Conj , </Conj <ConjArg <VPS <Pron idref="0009130" toj <ta Pp-o3sm </Pron <VPC <V <Pron idref="0009130" si <ta Ppxd---t </Pron <V kaza <ta Vppt+f-o3s </V </V <N slovoto <ta Ncnsd </N </VPC </VPS </ConjArg </CoordP </ConjArg <Conj <C i <ta C </C </Conj <ConjArg <V <Pron ni <ta Pp-d1p-t </Pron <V olekna <ta Vppi+f-o3s </V </V </ConjArg </CoordP . </S


More information about the mozart-users mailing list