Archive for the ‘PanLex’ Category

Announcing PanLinx

Friday, June 1st, 2012

PanLinx is the latest experimental interface for PanLex. You can try it now at http://panlex.org/try/plxl.shtml.

In case you asked “What is PanLex”, here’s a quick answer: It’s a database that aims to include every known translation from every word (or dictionary-type phrase) into any language in the world. We’re talking about potentially hundreds of millions of words and trillions of translations.

The PanLex project is sponsored by The Long Now Foundation in San Francisco. The project’s main activity is building the database, but incidentally we have created some interfaces to give people (and machines) access to it. Before PanLinx, the interfaces relied on forms to be completed by users (fill in a text field, click on a button, etc.). This meant that most of the data would be invisible to most search engines, since search engines generally follow links and don’t fill out forms. We decided to create a different, link-only interface that would allow search engines to navigate across the database and reach data about millions of words and their translations. In principle, then, if you entered some obscure word in a search engine, you might be taken to the PanLinx page about that word.

For example, if you entered “bangunan” in a search engine, the hits would include http://panlex.org/cgi-bin/plxl.cgi?lv=2&ex=63964, a page showing all of PanLex’s translations of that (Malay) word, because the search engine would have crawled the links from the main PanLinx page to its millions of subsidiary pages and indexed them all.

Millions of pages? Yes, roughly 18 million at present. But PanLinx isn’t really a collection of 18 million pages sitting on a disk drive. As systems go, it’s a very small system, with a home page containing about 260 links, plus a program (about 100 lines of code) that regenerates that home page periodically to incorporate additions to the database, plus another program (less than 200 lines of code) that creates a new momentary page (also containing about 260 links) whenever anybody clicks on any of those links, and so forth.

Will search engines actually fall for this trick? Well, from our perspective, it isn’t a trick. PanLinx delivers real information about translations among millions of words in thousands of languages. The mission of search engines is to get people to the information that they want. We don’t know which search engines will crawl how far from the root to the leaves of the PanLinx tree, but 3 days after PanLinx went live Google was already showing some hits 2 hops into the tree. Search engines are somewhat secretive about their rules. PanLinx gives us a platform to experiment with methods of making PanLex data findable through search engines. And, even though we built PanLinx primarily with search engines in mind, you are free to explore it yourself. If you have anything to report (such as “I converted PanLinx into a parlor game”), please comment below. Thanks.

 

PanLex joins Long Now Foundation

Monday, February 27th, 2012

Today’s announcement by The Long Now Foundation, headquartered in San Francisco, makes public the transfer of sponsorship of the PanLex project from Utilika Foundation to Long Now. There, PanLex will be working in partnership with The Rosetta Project, which curates a massive collection of documentation on the languages of the world. PanLex is creating a database that aims to document every known translation of every word in every language in the world. There are about half a billion translations in it so far.

Google Translate hits 64 with Esperanto

Saturday, February 25th, 2012

Google announced two days ago the addition of Esperanto as the 64th language served by Google Translate.

A quick test suggests that Esperanto is in some cases working a bit better than French, German, or Russian. Here’s a sentence from the home page of the PanLex project: “They dread a world in which only English, only Mandarin, or only Hindi has survived.” Here are the translations:

French: “Ils redoutent un monde dans lequel Hindi seulement l’anglais, seulement le mandarin, ou seulement a survécu.”

German: “Sie fürchten eine Welt, in der nur Englisch, nur Mandarin, Hindi oder nur überlebt hat.”

Russian: “Они боятся мира, в котором только на английском языке, только китайском, хинди или только выжила.”

Esperanto: “Ili timis la mondon en kiu nur angla, nur mandarena, aŭ nur hinda postvivis.”

Not perfect, but Esperanto seems to have escaped a weird parsing error that corrupts the others.

PanLex, Copyright, and Licensing

Friday, October 7th, 2011

PanLex is a compilation of lexical data and a set of procedures facilitating the interrogation and modification of the data.

In other locations I have commented on the issues of intellectual property that can arise from a project such as PanLex. These other comments include “PanLex as Intellectual Property”, “Source Citation in PanLex”, and a paragraph in my report to the 1 June 2011 meeting of the Utilika Foundation board of directors, where I wrote:

Intellectual-property claims impose some limits on the expansion of PanLex. The creators of some resources assert rights that, taken literally, would prohibit a person reading a resource from later even making use of what he or she had learned from it. Other resources are in the public domain. Between these extremes, many resources have been published subject to explicit or implicit copyright and various claims and restrictions, including various copyleft-type licenses and prohibitions of commercial use. The above-mentioned metadata that we record for resources used, or to be used, for PanLex include data on intellectual-property claims and permissions. In directing the PanLex project I take such claims into account, insofar as they appear to be understandable and enforceable, but, in most cases, I believe the owners of lexical resources could not prohibit the foundation from recording in PanLex information contained in those resources. This belief is based on the understanding that what we do with a resource is to record some of the facts asserted in it, in a novel (recoded, normalized, structured, interoperable) form. (In other words, PanLex doesn’t copy source X, but instead tells the world that some user of PanLex who has consulted source X claims that source X either states or implies that word Y is a translation of word Z.) In addition, I believe that PanLex typically advances the purposes of a contributing resource’s creator by making the facts contained in the resource more accessible and usable and referring users of those facts to the original resource for more detailed information. Until now, no claimant has asked us to remove facts based on a resource from PanLex. Some (e.g., LINCOM GmbH and SIL International) have expressly approved our use in PanLex of some or all of their data. However, some possessors of resources have demanded payment for providing easily processable versions for use in PanLex, and others have refused to provide such versions at all. The inclusion of funds for legal services in the 2012 budget reflects an assumption that intellectual-property issues, as well as contractual issues more generally, will likely become more complex as the PanLex project progresses.

One of the related issues is the protection of databases as compilations. A discussion of this issue by Daniel Tysver describes the competing originality and effort criteria for making a database copyrightable. Some compilers of databases in some jurisdictions have found copyright claims unenforceable because their databases were unoriginal. Telephone directories are the classic example.

Now there is litigation on another type of arguably unoriginal database: a collection of data on time zones that much of the world has come to depend on. There is much discussion on the merits of the claim. The outcome of this lawsuit may further clarify the limits of copyright protection on data like those in PanLex.

Native Speech Status

Saturday, August 6th, 2011

Judgments by and the speech of native speakers of a language are the most commonly valued kinds of evidence in linguistics about the grammar of that language. In various domains of applied linguistics the assumption that native speakers of a language have a status superior to those of its other users also appears. Examples are language documentation and language standardization.

The recent Record-a-thon sponsored by the Long Now Foundation’s Rosetta Project invited people around the world to record themselves speaking their native languages.

The Unicode Technical Committee recently invited native speakers of Danish to tell it whether they expect the character U+214D (⅍) to be sorted as if equivalent to the sequence “A/S”.

Such actions prompt a question about the appropriate status of non-native language use in these domains. Should non-native uses of languages be documented along with native uses? Should the expectations and wishes of non-native users of a language be respected, too, by committees that define language standards? These questions are significant in a world in which non-native speech and writing are common, particularly in internationally used languages, pidgins, creoles, and artificial languages.

The PanLex project on which I work accepts lexical translation evidence from native and non-native speakers alike. In fact, its typical source is a bilingual dictionary, which is most commonly compiled by a person who is a native speaker of one but not of the other of the two languages.

Today’s PanLex chuckle: Garbage almost in

Thursday, September 23rd, 2010

Today I added translations to PanLex attested by a French-Finnish dictionary distributed to the public over the Web. It translates French “Connecticut” into Finnish as “kerroskuvaus”.

That seemed strange, so I investigated. Other sources say that Finnish “kerroskuvaus” means “tomography”. What do tomography and Connecticut have in common? Susie noticed that they both share “CT”. So presumably somebody mistakenly unabbreviated “CT” when compiling this dictionary. Or compiled it automatically by inference from sources that report “Connecticut” = “CT”, “CT = “(computerized) tomography”, and “tomography” = “kerroskuvaus”.

Fair linguistic communication

Friday, August 6th, 2010

How can a multilingual supranational government make the language situation in its territory fair for all?

This issue has attracted a bit of scholarly attention, including by me. The latest contribution coming to my attention is Sabine Fiedler’s 2010 article “Approaches to fair linguistic communication” in the European Journal of Language Policy.

Unlike me, Fiedler deals with the practical politics of a concrete case: the European Union. She describes resistance to the growing near-monopoly of English in EU government and commerce and five ideas that have been offered as antidotes to the inequalities that a uniquely privileged language creates between its native speakers and everybody else. Two of these involve individual multilingualism: in one case active and in the other case passive. The other three ideas are to elevate a single language without native speakers to a special universal status; the languages they propose are, respectively, Latin, reduced international English, and Esperanto.

Fiedler considers the last two most practical and meritorious. Recognizing reduced international English as the de jure lingua franca would entail dispossessing the native speakers of English. The non-native-speaker majority would assume majority control over the language, which would diverge from standard domestic English. The more radical alternative, the replacement of English by Esperanto, would be more economical and less discriminatory.

In Fiedler’s judgment, the chief obstacle to the adoption of the Esperanto option is prejudices against human-designed languages. My hunch is that this kind of bias is not as powerful a force as the preference for gradualism. The dominance of English in the EU has been gradual, and the liberation of international English from native-speaker ownership, gradual so far, could gradually continue and accelerate, eventually reaching the point at which all native English speakers would study the reduced international variety of English as a course in school. But could the replacement of English with Esperanto be accomplished in small increments? What would the EU look like when the process reached the 50% point? Fiedler suggests that the advocates of the Esperanto option need to explain why its implementation would be beneficial. Perhaps they could do themselves good also by explaining something else: whether the EU could move gradually from the status quo to the use of Esperanto as a lingua franca, or whether this change (like changing from driving on the left side of the road to the right) would be practical only as an abrupt switch.

PanLex Translation Interface

Tuesday, May 11th, 2010

I’m developing a Web application that you can use to translate words into hundreds of languages.

This application demonstrates the use of PanLex, an emerging panlingual lexical database sponsored by Utilika Foundation. The bare-bones interface lets the user enter a word or phrase (in the user’s language), and offers all the translations of it that PanLex currently contains.

So far, I have implemented the tool in two languages: Esperanto and Turkish. The Esperanto interface is InterVorto and the Turkish interface is TümSöz.

After some more work on portability, it should take about 5 minutes each to extend it to other languages. PanLex currently contains data in about 1300 languages.

The only translations offered by this tool are attested ones: translations approved by contributors. These vary greatly in number. For example, InterVorto can translate the Esperanto word  “balotilo” into only 8 languages, but can translate the word “akvo” into 668 languages. If you want translations into more languages, you can follow a link to PanLem, a more comprehensive (and expert-level) PanLex interface, which allows you to see two-step translations (translations of translations). PanLem is bleeding-edge work, in which I am limiting the UI to purely lemmatic labels so PanLex can localize itself into any language that it covers. So, if you follow that link, be prepared to explore.

Usability comments on the tool are, of course, very welcome.