Banner

Semantic tagging at large scale

AnnoCultor was applied to Europeana metadata records to semantically link 60% of the records to Who-What-Where-When vocabularies. Many records are linked to several vocabularies each. We expect that at least two-thirds of multilingual words in Europeana index are now coming from semantic tags.

Technically, tagging is performed by the free open-source AnnoCultor semantic tagger that is also available as a free web service as semium.org

Europeana portal and metadata

Europeana is a pan-European directory of cultural heritage objects. More than 1500 European institutions (museums, libraries, and archives) donate their metadata (catalogue records) to be published via the Europeana web portal. This metadata represents catalogue entries for the objects, described with the typical Dublin Core fields: title, creator, publisher, coverage, etc. It does not hold the objects themselves, but URLs of the objects. Thus, a catalogue record for a painting would be stored and served by Europeana, and the JPEG with the painting would be served by the provider of this record.

Metadata is expensive to create, and is often quite concise, typically consisting a just few fields. Europeana aims at making the best use of it, and allowing multilingual access to it.

Semantic tagging

The idea of semantic tagging is to look at the value of a metadata tag and try to find a corresponding vocabulary entry. For example, fetch the original metadata tag 'Parijs' and find a description of Paris in a vocabulary of places.

Then, we pull alternative labels for that tag (often multilingual, such as 'Paris' or 'Париж'), and additional information (geographical coordinates, population, etc) and use it to enrich the original record.

Who-What-Where-When

We look at the Who-What-Where-When tags and run AnnoCultor semantic tagger out-of-the-box to discover millions of links:

  • 'Who' comes from the dc_creator field, and we discover 0,01 mln links (low number is explained below),
  • 'What' comes from the dc_subject field, and we discover 2,4 mln links,
  • 'Where' come from the dc_coverage field, and we discover 5,8 mln links,
  • 'When' comes from both dc_date and dc_coverage fields, and we discover 7,9 mln links.

In total 11,2 mln out of 18,7 mln records gets at least one semantic link as the result of this process. Many get two or more, and some get all four.

These numbers are on the low side for two reasons:

  • We only use the tags that are specifically designated by the providers as Who-What-Where-When metadata fields. There are many more WWWW-s mentioned in fields dc_title and dc_description, but we are not using them at the moment to be on the safe side in terms of quality.
  • We use AnnoCultor with virtually no adaptations, and it misses many of values formatted in provider-specific ways.

From preliminary experiments we can conclude that the number of links may be increased tenfold with limited effort.

Using tags for multilingual access

Semantic tagging creates links between records and multilingual term labels. Now, a search for any of these labels allows finding the original record. For example, a query for a horse (лошадь in Russian) would return a toy horse from Germany originally tagged with 'Pferd' that was semantically linked to GEMET term http://www.eionet.europa.eu/gemet/concept/3995 with the following alternative labels: kôň ; cavallo ; hest ; horse ; cheval ; ganado equino ; кон ; hevonen ; pferd ; häst ; konj ; kůň ; hobune ; paard ; arklys ; ίππος/άλογο ; cavalos ; ló ; cal ; лошадь ; koń.

Moreover, it pulls that this horse has broader term 'animal' and it can be, thus, found on a search for an animal in 30 languages.

This shows the tremendous multiliguality improvement brought in by semantic tagging/enrichment: practically, it adds 2,4 mln subject translations into 30 languages, with many more broader terms translations.

These terms are all included into both the main portal search and to web search done by major search engines.

It is also possible to search multilingual labels separately. Let us search for France in Russian, Greek, and Chineese. Yep, the results are all the same and they are all pulled from the semantic tags.

Using tags for range queries: region or time period

Catalogue records tend to be precise, while human queries are often generic. There are millions of records where a town is specified, but the country is not. For example, half-a-million Europeana records are tagged with 'Paris' but do not mention 'France'. A user who queries for France would have no chance to find them.

With semantic tagging we find the corresponding town, and find its broader regions, such as a province and a country. Then we add them to the index and make searchable: Europeana records about France including those that dont mention 'France'. The same result may be achieved by using multilingual labels combined with the broader-narrower search: Europeana records about Франция.

The same thing happens to periods: many records are tagged with an exact year and cannot be found on requests for centuries or historical periods. In a typical example a record would be tagged with 1902, and nothing about its century, or associated cultural periods, would be mentioned. Records from the 20-th century, most of which never state that.

Using tags for precise queries

Place

Europeana records about Venice. These records write 'Venice' in a dozen different languages. AnnoCultor was able to look at these multilingual labels and link them to a single term.

As we know the coordinates of Venice (located at 45.4, 12.3) and can draw a large rectangle around it, limited to coordinates (42,10) and (48,15). We can now retrieve records located in the neighbourhood of Venice. And put them on a map.

Some records mention 'Paris' and never mention 'France'. The fact that Paris is in France is described in the vocabulary. In the (near) future we would be able to automatically pull records about 'Paris' on a request for 'France'.

Time

Europeana records tagged as 'medieval'. These records use two different languages to refer to this period: English and French. In French they also use several synonyms, such as 'Période Médiévale' and 'Moyen Âge'. These labels are all summarised in the definition of Medieval in the vocabulary. And AnnoCultor was able to look at these multilingual labels and link them to a single term.

Europeana records timed with World War I. Here we are not gaining much from tagging, as most of the records are described with an explicit year.

A request for approximately 12-th century records returns records where exact year is represented in different ways, such as (1100) , 1100-01-01 00:00:00 , etc. AnnoCultor can understand them and link to exact years.

What is more interesting, it also includes records where date is not set with an explicit year, but with a phrase '12e siècle'. Here they are retrieved separately: records that do not have a numeric year, but a phrase describing the corresponding period. They would never be found without tagging.

Some records are tagged with a numeric year, such as 1988, some - with 'late 20-th century', or '1980-es', etc. The fact that 1970-es belong to late 20-th century is stated in the vocabulary. In the (near) future we would be able to automatically pull all these records on a request for a broader period, e.g. records dated with 1988 would be returned on a request for the 20-th century.

Isn't it too precise?

Paris is not just a point on a map, and we fully acknowledge that. When we assign it to a point, we just indicate where is it. And we make it possible to find Paris on a map, or to find places located around it.

In the same way, Middle Ages are not strictly limited to 476 - 1453, and their definition varies per country and per textbook. When we assign it to an exact time interval, we again just indicate when was it. And make it possible to place medieval records on a timeline.

How semantic tagging works

Let us look at a view on Dresden, one these records with all four WWWW tags. Click on 'More' and 'Auto-Tags' to see semantic tags.

Specifically, it has two following fields:

 
 dc:coverage = Dresden
 dc:date = 1748
 dc:creator = Canaletto 
 dc:subject = Fotografie

AnnoCultor semantic tagger automatically interprets these values, and tries to find the corresponding terms in specialised databases of places, periods, people, and subjects. It finds the following matches:

  • Place Term: http://sws.geonames.org/2935022/
  • Period Term: http://semium.org/time/1748
  • Concept Term: http://www.eionet.europa.eu/gemet/concept/13123 ; http://www.eionet.europa.eu/gemet/concept/6205
  • Agent Term: http://dbpedia.org/resource/Canaletto

Then, it adds links to these terms, and pulls additional information about this record from the corresponding vocabulary entries, as shown on the object web page. These fields are called semantic tags, or enrichments as they add something (enrich) the original data. The tagging is made by AnnoCultor semantic tagger.

Vocabularies

Semantic tagging is all about linking records to vocabulary terms. We use the following vocabularies:

  • 'Who' - small vocabulary of ca. 10,000 painters pulled form Wikipedia (several languages)
  • 'What' - ca. 10,000 terms from GEMET (30 languages, reduced to EU languages)
  • 'Where' - ca. 140,000 places from Geonames (many languages, reduced to EU languages)
  • 'When' - ca. 2500 periods from AnnoCultor Time Ontology (several languages).

Technical note

All original metadata is stored in Apache Solr. It is then retrieved by AnnoCultor and tagged. The original records plus the enrichment fields are copied to another Solr instance.

(c) 2011, Borys Omelayenko,