Skip to content
Snippets Groups Projects
Commit 120bae68 authored by Marc Feger's avatar Marc Feger
Browse files

Change README; Fix wrong Path in mappings.csv

parent b02e20a3
No related branches found
No related tags found
No related merge requests found
# Data and Knowledge Engineering HP
## Task 1)
This is the elaboration of the home project.
The aim of this project was to preserve all comedy films from DBPedia and Wikidata.
The condition was that each film had at least one director born in 1970 or later.
In this section I am going to shown some statistics of data along with the development of the queries.
## Preparation
This project was developed for Python 3.7. It is based on a Flask server.
In order to start the server and to be able to execute all operations, some preparations have to be made in advance.
### Wikidata
For the data in Wikidata I managed to work out a structure of the data.
First all Enities are searched which are a film or a subform of a film.
The exact structure can be created from any of the green paths P, if P ends at a movie or a movie genre.
Especially important is that a movie genre is a top category for movies.
Therefore, these can be used to obtain more film entities that are not directly marked as film.
For example, `Mulan (wd:Q537407)` is a movie that is marked as `animated feature film (wd:Q29168811) `.
There are gradations of the terms film e.g. `animated feature film (wd:Q29168811)` separates into `feature film (wd:Q24869)`
and `animated film (wd:Q202866)`. These sub-concepts in turn result in the term `film (wd:Q11424)`.
![Structure](images/Structure_Wikidata.jpg)
At first all entities X are searched which were marked directly as film.
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
SELECT DISTINCT ?movie
WHERE{
# X instances of film
{?movie wdt:P31 wd:Q11424.}
}
GROUP BY ?movie
This led to a total of `229738` unique entities.
These collected results are extended with the entities marked with a subclass of film.`
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
SELECT DISTINCT ?movie
WHERE{
# node who instances of films.
{?movie wdt:P31 wd:Q11424.}
UNION
# X instances of Y; Y subclass of films.
{?movie wdt:P31 [wdt:P279 wd:Q11424].}
}
GROUP BY ?movie
This brings together a total of `260305` unique entities. This way a gain of `30567` entities is acquired.
Then all entities that have been marked with a subclass of a film genre will be added.
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
SELECT DISTINCT ?movie
WHERE{
# node who instances of films.
{?movie wdt:P31 wd:Q11424.}
UNION
# X instances of Y; Y subclass of films.
{?movie wdt:P31 [wdt:P279 wd:Q11424].}
UNION
# X instances of Y; Y instance of film genre.
{?movie wdt:P31 [wdt:P31 wd:Q201658].}
}
GROUP BY ?movie
This adds `261471` unique entities. Therefore ` 1166` new entities have been added.
Then all entities X marked with a term derived from a subclass of film can be added.
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
SELECT DISTINCT ?movie
WHERE{
# node who instances of films.
{?movie wdt:P31 wd:Q11424.}
UNION
# X instances of Y; Y subclass of films.
{?movie wdt:P31 [wdt:P279 wd:Q11424].}
UNION
# X instances of Y; Y instance of film genre.
{?movie wdt:P31 [wdt:P31 wd:Q201658].}
UNION
# X instances of Y; Y instance of or subclass of Z; Z is Y; Y subclass of film.
{?movie wdt:P31 [wdt:P31|wdt:P279 [wdt:P279 wd:Q11424]].}
}
GROUP BY ?movie
This again results in a gain of `19816` independent entities.
Thus up to this point a total of `281287` independent entities have been found.
Then all entities X marked with a term derived from a instance of a film genre can be added.
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
SELECT DISTINCT ?movie
WHERE{
# node who instances of films.
{?movie wdt:P31 wd:Q11424.}
UNION
# X instances of Y; Y subclass of films.
{?movie wdt:P31 [wdt:P279 wd:Q11424].}
# X instances of Y; Y instance of film genre.
UNION
{?movie wdt:P31 [wdt:P31 wd:Q201658].}
UNION
# X instances of Y; Y instance of or subclass of Z; Z is Y; Y subclass of film.
{?movie wdt:P31 [wdt:P31|wdt:P279 [wdt:P279 wd:Q11424]].}
UNION
# X instances of Y; Y instance of or subclass of Z; Z is Y; Y subclass of film genre.
{?movie wdt:P31 [wdt:P31|wdt:P279 [wdt:P31 wd:Q201658]].}
}
GROUP BY ?movie
This again results in a gain of `8723` independent entities.
Thus wikidata contains about `290010` many independent entities of a film.
There had been a [project](https://www.wikidata.org/wiki/Wikidata:WikiProject_Movies/Numbers)
In August 2018 a similar project was carried out on wikidata.
The researchers came to the conclusion that a total of `247075` independent film entities are contained in wikidata.
My results beat those of the researchers by `17%`.
This result is possible because more than 9 months have passed.
In addition, the result underlines the quality of the data I found.
From the set of all movies you can search for those that are directly marked with the category Comedy.
This part is indicated by the red lines.
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
SELECT DISTINCT ?movie
WHERE{
# node who instances of films.
{?movie wdt:P31 wd:Q11424.}
UNION
# X instances of Y; Y subclass of films.
{?movie wdt:P31 [wdt:P279 wd:Q11424].}
# X instances of Y; Y instance of film genre.
UNION
{?movie wdt:P31 [wdt:P31 wd:Q201658].}
UNION
# X instances of Y; Y instance of or subclass of Z; Z is Y; Y subclass of film.
{?movie wdt:P31 [wdt:P31|wdt:P279 [wdt:P279 wd:Q11424]].}
UNION
# X instances of Y; Y instance of or subclass of Z; Z is Y; Y subclass of film genre.
{?movie wdt:P31 [wdt:P31|wdt:P279 [wdt:P31 wd:Q201658]].}
# Imagine a inner join
# X genre comedy.
{?movie wdt:P136 wd:Q40831.}
}
GROUP BY ?movie
There are only `208` movies marked directly with comedy.
This can be improved by adding movies whose genre is a subtype of comedy.
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
SELECT DISTINCT ?movie
WHERE{
# node who instances of films.
{?movie wdt:P31 wd:Q11424.}
UNION
# X instances of Y; Y subclass of films.
{?movie wdt:P31 [wdt:P279 wd:Q11424].}
# X instances of Y; Y instance of film genre.
UNION
{?movie wdt:P31 [wdt:P31 wd:Q201658].}
UNION
# X instances of Y; Y instance of or subclass of Z; Z is Y; Y subclass of film.
{?movie wdt:P31 [wdt:P31|wdt:P279 [wdt:P279 wd:Q11424]].}
UNION
# X instances of Y; Y instance of or subclass of Z; Z is Y; Y subclass of film genre.
{?movie wdt:P31 [wdt:P31|wdt:P279 [wdt:P31 wd:Q201658]].}
# Imagine a inner join
# X genre comedy.
{?movie wdt:P136 wd:Q40831.}
UNION
# X genre G; G subclass of comedy.
{?movie wdt:P136 [wdt:P279 wd:Q40831].}
}
GROUP BY ?movie
By considering an intermediate stage for genres, a total of `33393` new
comedy films are added to those already identified.
Thus a total of `33601` comedy movies are included in Wikidata.
This results in a percentage of about `0.1158%` of all identified movies.
The query was structured as follows:
![formula wikidata](images/formula_wikidata.png)
### DBPedia
In this section all comedy movies are extracted from DBPedia. The following scheme is used.
![structure dbpedia](images/Structure_DBpedia.jpg)
This scheme is much smaller than that of Wikidata, as no general concept or consistency can be identified.
However, the films can be requested directly as such, as they are clearly described in
the [ontology](http://mappings.dbpedia.org/server/ontology/classes/) of DBPedia.
First, all entities of the type film are determined.
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX dbc: <http://dbpedia.org/resource/Category:>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dct: <http://purl.org/dc/terms/>
SELECT DISTINCT ?movie
WHERE{
# X type Comedy
{?movie rdf:type dbo:Film.}
}GROUP BY ?movie
This determines a total of `10000` independent entities.
This is a problem because the public endpoint limits the maximum result size to 10000 results.
For details see [here](https://wiki.dbpedia.org/public-sparql-endpoint).
There are comparable works that come on `87000` film entities.
See [here](https://wiki.dbpedia.org/about/facts-figures) for more information.
Next, out of the `10000`, all those entities are extracted that were directly titled with comedy.
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX dbc: <http://dbpedia.org/resource/Category:>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dct: <http://purl.org/dc/terms/>
SELECT DISTINCT ?movie
WHERE{
# X type Film
{?movie rdf:type dbo:Film.}
# Imagine a inner join
# X subject Comedy
{?movie dct:subject dbc:Comedy}
}GROUP BY ?movie
It detects `4` entities that are directly marked with Comedy.
For example one result is: `dbpedia.org/resource/The_Clowns_(film)`.
Then all films whose subject is derived from comedy will be determined.
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX dbc: <http://dbpedia.org/resource/Category:>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dct: <http://purl.org/dc/terms/>
SELECT DISTINCT ?movie
WHERE{
# X type Film
{?movie rdf:type dbo:Film.}
# Imagine a inner join
# X subject Comedy
{?movie dct:subject dbc:Comedy}
UNION
# X subject Y; Y like "Comedy".
{?movie dct:subject ?y FILTER CONTAINS(lcase(str(?y)), "comedy").}
UNION
# X genre Y; Y like "Comedy".
{?movie dbo:genre ?y FILTER CONTAINS (lcase(str(?y)), "comedy").}
}GROUP BY ?movie
This results in `10000` comedy movies (oh wonder !).
The query was structured as follows:
![formula dbpedia](images/formula_dbpedia.png)
## Task 2
Install the enclosed requirements as well:
Next, all films from Wikidata and DBPedia are collected whose directors bore before or in 1970.
$ pip3.7 install -r requirements.py
### Wikidata
The search scheme of Wikidata will be extended by a director.
This is indicated by the blue arrows.
![structure wikidata with director](images/Structure_Wikidata_director.jpg)
The first thing to be checked is how many comedy movies ever have a director.
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
SELECT DISTINCT ?movie
WHERE{
# node who instances of films.
{?movie wdt:P31 wd:Q11424.}
UNION
# X instances of Y; Y subclass of films.
{?movie wdt:P31 [wdt:P279 wd:Q11424].}
# X instances of Y; Y instance of film genre.
UNION
{?movie wdt:P31 [wdt:P31 wd:Q201658].}
UNION
# X instances of Y; Y instance of or subclass of Z; Z is Y; Y subclass of film.
{?movie wdt:P31 [wdt:P31|wdt:P279 [wdt:P279 wd:Q11424]].}
UNION
# X instances of Y; Y instance of or subclass of Z; Z is Y; Y subclass of film genre.
{?movie wdt:P31 [wdt:P31|wdt:P279 [wdt:P31 wd:Q201658]].}
# Imagine a inner join
# X genre comedy.
{?movie wdt:P136 wd:Q40831.}
UNION
# X genre G; G subclass of comedy.
{?movie wdt:P136 [wdt:P279 wd:Q40831].}
# X director exists
{?movie wdt:P57 []}
}GROUP BY ?movie
There are a total of `28964` comedy films that also have a director.
This leads to a loss of a total of `4637` comedy films that do not have a director.
Then it is examined how many of the comedy films with director also have a director with a date of birth.
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
SELECT DISTINCT ?movie #?director ?dob
WHERE{
# node who instances of films.
{?movie wdt:P31 wd:Q11424.}
UNION
# X instances of Y; Y subclass of films.
{?movie wdt:P31 [wdt:P279 wd:Q11424].}
# X instances of Y; Y instance of film genre.
UNION
{?movie wdt:P31 [wdt:P31 wd:Q201658].}
UNION
# X instances of Y; Y instance of or subclass of Z; Z is Y; Y subclass of film.
{?movie wdt:P31 [wdt:P31|wdt:P279 [wdt:P279 wd:Q11424]].}
UNION
# X instances of Y; Y instance of or subclass of Z; Z is Y; Y subclass of film genre.
{?movie wdt:P31 [wdt:P31|wdt:P279 [wdt:P31 wd:Q201658]].}
# Imagine a inner join
# X genre comedy.
{?movie wdt:P136 wd:Q40831.}
UNION
# X genre G; G subclass of comedy.
{?movie wdt:P136 [wdt:P279 wd:Q40831].}
# X director exists and who has a birthday
{?movie wdt:P57 [wdt:P569 ?dob].}
} GROUP BY ?movie
This results in a loss of `1060` comedy movies whose director has no date of birth.
This leaves `27903` comedy movies with directors and a date of birth.
Next, all films are selected that have at least one director with ambiguous birth years.
As the task is limited to the year of birth only, ambiguous information about the month and
the day will be ignored. The following query considers how many directors exist at all.
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
SELECT DISTINCT ?director
WHERE{
# node who instances of films.
{?movie wdt:P31 wd:Q11424.}
UNION
# X instances of Y; Y subclass of films.
{?movie wdt:P31 [wdt:P279 wd:Q11424].}
# X instances of Y; Y instance of film genre.
UNION
{?movie wdt:P31 [wdt:P31 wd:Q201658].}
UNION
# X instances of Y; Y instance of or subclass of Z; Z is Y; Y subclass of film.
{?movie wdt:P31 [wdt:P31|wdt:P279 [wdt:P279 wd:Q11424]].}
UNION
# X instances of Y; Y instance of or subclass of Z; Z is Y; Y subclass of film genre.
{?movie wdt:P31 [wdt:P31|wdt:P279 [wdt:P31 wd:Q201658]].}
# Imagine a inner join
# X genre comedy.
{?movie wdt:P136 wd:Q40831.}
UNION
# X genre G; G subclass of comedy.
{?movie wdt:P136 [wdt:P279 wd:Q40831].}
# X director exists and who has a birthday
{?movie wdt:P57 ?director.
?director wdt:P569 ?dob.}
} GROUP BY ?director
There are a total of `10693` directors for the `27903` comedy films.
Out of these directors, those are selected who have more than one date of birth.
For this purpose it is checked whether the smallest and largest data of the year of
birth correspond with each other. If this is not the case, the information is ambiguous.
Such directors should be ignored.
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
SELECT ?director (MIN(YEAR(?dob)) AS ?min_dob) (MAX(YEAR(?dob)) AS ?max_dob)
WHERE{
# node who instances of films.
{?movie wdt:P31 wd:Q11424.}
UNION
# X instances of Y; Y subclass of films.
{?movie wdt:P31 [wdt:P279 wd:Q11424].}
# X instances of Y; Y instance of film genre.
UNION
{?movie wdt:P31 [wdt:P31 wd:Q201658].}
UNION
# X instances of Y; Y instance of or subclass of Z; Z is Y; Y subclass of film.
{?movie wdt:P31 [wdt:P31|wdt:P279 [wdt:P279 wd:Q11424]].}
UNION
# X instances of Y; Y instance of or subclass of Z; Z is Y; Y subclass of film genre.
{?movie wdt:P31 [wdt:P31|wdt:P279 [wdt:P31 wd:Q201658]].}
# Imagine a inner join
# X genre comedy.
{?movie wdt:P136 wd:Q40831.}
UNION
# X genre G; G subclass of comedy.
{?movie wdt:P136 [wdt:P279 wd:Q40831].}
# X director exists and who has a birthday
{?movie wdt:P57 ?director.
?director wdt:P569 ?dob.}
} GROUP BY ?director
HAVING (?min_dob != ?max_dob)
All directors are grouped according to their ID.
Then the individual directors are applied against the smallest and largest year of birth.
This has the advantage that every director who has more than one year of birth has unequal
entries for the minimum and maximum year of birth.
Altogether `51` directors with ambiguous birth year are recognized.
The `YEAR` operator can interpret dates like `20.century` of the director `Troy Miller (wdt:Q3541033)` with `2000`.
Then all directors with a unique year of birth are searched.
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
SELECT ?director (MIN(YEAR(?dob)) AS ?min_dob) (MAX(YEAR(?dob)) AS ?max_dob)
## Get started
WHERE{
# node who instances of films.
{?movie wdt:P31 wd:Q11424.}
UNION
# X instances of Y; Y subclass of films.
{?movie wdt:P31 [wdt:P279 wd:Q11424].}
# X instances of Y; Y instance of film genre.
UNION
{?movie wdt:P31 [wdt:P31 wd:Q201658].}
UNION
# X instances of Y; Y instance of or subclass of Z; Z is Y; Y subclass of film.
{?movie wdt:P31 [wdt:P31|wdt:P279 [wdt:P279 wd:Q11424]].}
UNION
# X instances of Y; Y instance of or subclass of Z; Z is Y; Y subclass of film genre.
{?movie wdt:P31 [wdt:P31|wdt:P279 [wdt:P31 wd:Q201658]].}
This project can be started with the following command:
# Imagine a inner join
$ python3.7 app.py
# X genre comedy.
{?movie wdt:P136 wd:Q40831.}
UNION
# X genre G; G subclass of comedy.
{?movie wdt:P136 [wdt:P279 wd:Q40831].}
This starts the service on `http://127.0.0.1:5000/`.
# X director exists and who has a birthday
{?movie wdt:P57 ?director.
?director wdt:P569 ?dob.}
} GROUP BY ?director
HAVING (?min_dob = ?max_dob)
## Swagger Documentation
The whole service was documented with Swagger.
This makes it possible to interactively explore the routes and generate new data.
The start page will look like this:
This determines `10627` directors with a unique year.
All directors born after or in 1970 can then be searched out.
![swagger](./images/swagger.png)
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
Here you will find the interactive elements.
How exactly the data is generated will be explained later.
SELECT ?director ?min_dob
WHERE {
SELECT ?movie ?director (MIN(YEAR(?dob)) AS ?min_dob) (MAX(YEAR(?dob)) AS ?max_dob)
WHERE{
# node who instances of films.
{?movie wdt:P31 wd:Q11424.}
UNION
# X instances of Y; Y subclass of films.
{?movie wdt:P31 [wdt:P279 wd:Q11424].}
# X instances of Y; Y instance of film genre.
UNION
{?movie wdt:P31 [wdt:P31 wd:Q201658].}
UNION
# X instances of Y; Y instance of or subclass of Z; Z is Y; Y subclass of film.
{?movie wdt:P31 [wdt:P31|wdt:P279 [wdt:P279 wd:Q11424]].}
UNION
# X instances of Y; Y instance of or subclass of Z; Z is Y; Y subclass of film genre.
{?movie wdt:P31 [wdt:P31|wdt:P279 [wdt:P31 wd:Q201658]].}
## Submission 1
# Imagine a inner join
# X genre comedy.
{?movie wdt:P136 wd:Q40831.}
UNION
# X genre G; G subclass of comedy.
{?movie wdt:P136 [wdt:P279 wd:Q40831].}
# X director exists and who has a birthday
{?movie wdt:P57 ?director.
?director wdt:P569 ?dob.}
} GROUP BY ?movie ?director
HAVING (?min_dob = ?max_dob)
} GROUP BY ?director ?min_dob
HAVING (?min_dob >= 1970)
This will identify `1950` directors born in or after 1970.
In total, the corresponding films can then be selected by searching and grouping `?movie`.
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
SELECT DISTINCT ?movie
WHERE {
SELECT ?movie ?director ?min_dob
WHERE {
SELECT ?movie ?director (MIN(YEAR(?dob)) AS ?min_dob) (MAX(YEAR(?dob)) AS ?max_dob)
WHERE{
# node who instances of films.
{?movie wdt:P31 wd:Q11424.}
UNION
# X instances of Y; Y subclass of films.
{?movie wdt:P31 [wdt:P279 wd:Q11424].}
# X instances of Y; Y instance of film genre.
UNION
{?movie wdt:P31 [wdt:P31 wd:Q201658].}
UNION
# X instances of Y; Y instance of or subclass of Z; Z is Y; Y subclass of film.
{?movie wdt:P31 [wdt:P31|wdt:P279 [wdt:P279 wd:Q11424]].}
UNION
# X instances of Y; Y instance of or subclass of Z; Z is Y; Y subclass of film genre.
{?movie wdt:P31 [wdt:P31|wdt:P279 [wdt:P31 wd:Q201658]].}
# Imagine a inner join
# X genre comedy.
{?movie wdt:P136 wd:Q40831.}
UNION
# X genre G; G subclass of comedy.
{?movie wdt:P136 [wdt:P279 wd:Q40831].}
# X director exists and who has a birthday
{?movie wdt:P57 ?director.
?director wdt:P569 ?dob.}
} GROUP BY ?movie ?director
HAVING (?min_dob = ?max_dob)
} GROUP BY ?movie ?director ?min_dob
HAVING (?min_dob >= 1970)
} GROUP BY ?movie
Thus a total of `2887` Comdey films are recognized which have at least one
author whose year of birth is after or in the year 1970.
An example of the quality of the results can be found in the computer animated comdey film `Brave (wdt:Q126796)`.
This has the three directors: `Steve Purcell (wdt:Q1390504)` with the birth dates `1959` and `1961`
and `Mark Andrews (wdt:Q1408804)` with the birth date `1970` and `Brenda Chapman (wdt:Q429715)`
who was born on `1 November 1962`.
This section shows how to create the data in wikidata.txt and dbpedia.txt.
The exact strategies are given in the presentation.
### DBPedia
To generate the data for DBPedia, the routes must be called in the following order:
The same is done for the comedy movies in DBPedia. First, all comedy movies with a director are searched.
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX dbc: <http://dbpedia.org/resource/Category:>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dct: <http://purl.org/dc/terms/>
SELECT ?director
WHERE{
# X type Film
{?movie rdf:type dbo:Film.}
# Imagine a inner join
# X subject Comedy
{?movie dct:subject dbc:Comedy}
UNION
# X subject Y; Y like "Comedy".
{?movie dct:subject ?y FILTER CONTAINS(lcase(str(?y)), "comedy").}
UNION
# X genre Y; Y like "Comedy".
{?movie dbo:genre ?y FILTER CONTAINS (lcase(str(?y)), "comedy").}
{?movie dbo:director ?director}
}GROUP BY ?director
In total `8565` many comedy movies with one director are found.
Then again all valid directors with a valid year of birth before or in 1970 will be elected.
The following query will be edited it can be seen in the diagram.
![formula dbpedia](images/Structure_DBpedia_year.jpg)
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX dbc: <http://dbpedia.org/resource/Category:>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
SELECT DISTINCT ?movie
WHERE {
SELECT DISTINCT ?movie ?min_year ?max_year
WHERE {
SELECT DISTINCT ?movie ?director (MIN(?year) AS ?min_year) (MAX(?year) AS ?max_year)
WHERE{
# X type Film
{?movie rdf:type dbo:Film.}
# Imagine a inner join
# X subject Comedy
{?movie dct:subject dbc:Comedy}
UNION
# X subject Y; Y like "Comedy".
{?movie dct:subject ?y FILTER CONTAINS(lcase(str(?y)), "comedy").}
UNION
# X genre Y; Y like "Comedy".
{?movie dbo:genre ?y FILTER CONTAINS (lcase(str(?y)), "comedy").}
# X director Person; Person birth date year
{?movie dbo:director ?director.
?director dbo:birthDate ?dob.
BIND(xsd:integer(substr(xsd:string(?dob), 0, 4)) as ?year) .
}
} GROUP BY ?movie ?director
}GROUP BY ?movie HAVING (?min_year = ?max_year)
}GROUP BY ?movie HAVING (?min_year >= 1970)
As a result, `1433` comedy films with a director born after or in 1970 were found.
## Task 3
In this step the additional information is queried by the `OPTIONAL`-operator and the basic queries of the previous step.
1. `/dbpedia/groundtruth` -> This will create a list of all result dicts in `/static/dbpedia/dbpedia_groundtruth.txt`
2. `/dbpedia/n3` -> This will create all Triples in `static/dbpedia/dbpedia.txt`
### Wikidata
To generate the data for Wikidata, the routes must be called in the following order:
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX schema: <http://schema.org/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT
DISTINCT (?movie AS ?Movie)
(GROUP_CONCAT(DISTINCT ?title;separator="|") AS ?Title)
(GROUP_CONCAT(DISTINCT ?director;separator="|") AS ?Director)
(GROUP_CONCAT(DISTINCT ?author;separator="|") AS ?Author)
(GROUP_CONCAT(DISTINCT ?cast_member;separator="|") AS ?Cast)
(GROUP_CONCAT(DISTINCT ?publish_date;separator="|") AS ?Published)
(GROUP_CONCAT(DISTINCT ?genre;separator="|") AS ?Genre)
(GROUP_CONCAT(DISTINCT ?duration;separator="|") AS ?Duration)
(GROUP_CONCAT(DISTINCT ?description;separator="|") AS ?Description)
(GROUP_CONCAT(DISTINCT ?production_company;separator="|") AS ?ProductionCompany)
WHERE {
{
SELECT DISTINCT ?movie
WHERE {
SELECT ?movie ?director ?min_dob ?max_dob
WHERE{
SELECT ?movie ?director (MIN(YEAR(?dob)) AS ?min_dob) (MAX(YEAR(?dob)) AS ?max_dob)
WHERE{
# node who instances of films.
{?movie wdt:P31 wd:Q11424.}
UNION
# X instances of Y; Y subclass of films.
{?movie wdt:P31 [wdt:P279 wd:Q11424].}
# X instances of Y; Y instance of film genre.
UNION
{?movie wdt:P31 [wdt:P31 wd:Q201658].}
UNION
# X instances of Y; Y instance of or subclass of Z; Z is Y; Y subclass of film.
{?movie wdt:P31 [wdt:P31|wdt:P279 [wdt:P279 wd:Q11424]].}
UNION
# X instances of Y; Y instance of or subclass of Z; Z is Y; Y subclass of film genre.
{?movie wdt:P31 [wdt:P31|wdt:P279 [wdt:P31 wd:Q201658]].}
# Imagine a inner join
1. `/wikidata/groundtruth/advanced` -> This will create a list of all result dicts in `/static/wikidata/wikidata_groundtruth.txt`
* Careful, that can take 2 to 3 minutes.
* Alternatively you can use `/wikidata/groundtruth`. However, the data is slightly smaller.
2. `/wikidata/n3` -> This will create all Triples in `static/wikidata/wikidata.txt`
# X genre comedy.
{?movie wdt:P136 wd:Q40831.}
UNION
# X genre G; G subclass of comedy.
{?movie wdt:P136 [wdt:P279 wd:Q40831].}
# X director exists and who has a birthday
{?movie wdt:P57 ?director.
?director wdt:P569 ?dob.}
} GROUP BY ?movie ?director
HAVING (?min_dob = ?max_dob)
} GROUP BY ?movie ?director ?min_dob ?max_dob
HAVING (?min_dob >= 1970)
}GROUP BY ?movie
}
#OPTIONAL {?movie wdt:P1476 ?title.} # titel oder label ?
OPTIONAL {?movie rdfs:label ?title FILTER (LANG(?title) = "en")} # titel oder label ?
OPTIONAL {?movie wdt:P57 ?director.}
OPTIONAL {?movie wdt:P58 ?author.}
OPTIONAL {?movie wdt:P161 ?cast_member.}
OPTIONAL {?movie wdt:P577 ?publish_date.}
OPTIONAL {?movie wdt:P136 ?genre.}
OPTIONAL {?movie wdt:P2047 ?duration.}
OPTIONAL {?movie schema:description ?description FILTER ( lang(?description) = "en" ).} # may take a while
OPTIONAL {?movie wdt:P272 ?production_company}
} GROUP BY ?movie
### DBPedia
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX dbc: <http://dbpedia.org/resource/Category:>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX dbp: <http://dbpedia.org/property/>
SELECT DISTINCT (?movie AS ?Movie)
(GROUP_CONCAT(DISTINCT ?name;separator="|") AS ?Title)
(GROUP_CONCAT(DISTINCT ?director;separator="|") AS ?Director)
(GROUP_CONCAT(DISTINCT ?author;separator="|") AS ?Author)
(GROUP_CONCAT(DISTINCT ?cast_member;separator="|") AS ?Cast)
(GROUP_CONCAT(DISTINCT ?releaseDate;separator="|") AS ?Published)
(GROUP_CONCAT(DISTINCT ?subject;separator="|") AS ?Subject)
(GROUP_CONCAT(DISTINCT ?genre;separator="|") AS ?Genre)
(GROUP_CONCAT(DISTINCT ?duration/60;separator="|") AS ?Duration) # todo: change to original value
(GROUP_CONCAT(DISTINCT ?abstract;separator="|") AS ?Description)
(GROUP_CONCAT(DISTINCT ?distributor;separator="|") AS ?Distributor)
(GROUP_CONCAT(DISTINCT ?productionCompanies;separator="|") AS ?ProductionCompanies)
(GROUP_CONCAT(DISTINCT ?sameDirector;separator="|") AS ?DirectorLink)
(GROUP_CONCAT(DISTINCT ?sameMovie;separator="|") AS ?MovieLink)
WHERE {
{
SELECT DISTINCT ?movie
WHERE {
SELECT DISTINCT ?movie ?min_year ?max_year
WHERE {
SELECT DISTINCT ?movie ?director (MIN(?year) AS ?min_year) (MAX(?year) AS ?max_year)
WHERE{
# X type Film
{?movie rdf:type dbo:Film.}
# Imagine a inner join
# X subject Comedy
{?movie dct:subject dbc:Comedy}
UNION
# X subject Y; Y like "Comedy".
{?movie dct:subject ?y FILTER CONTAINS(lcase(str(?y)), "comedy").}
UNION
# X genre Y; Y like "Comedy".
{?movie dbo:genre ?y FILTER CONTAINS (lcase(str(?y)), "comedy").}
# X director Person; Person birth date year
{?movie dbo:director ?director.
?director dbo:birthDate ?dob.
BIND(xsd:integer(substr(xsd:string(?dob), 0, 4)) as ?year) .
}
} GROUP BY ?movie ?director
}GROUP BY ?movie HAVING (?min_year = ?max_year)
}GROUP BY ?movie HAVING (?min_year >= 1970)
}
OPTIONAL{?movie foaf:name ?name FILTER(LANG(?name)="en").}
OPTIONAL{?movie owl:sameAs ?sameMovie FILTER CONTAINS (LCASE(STR(?sameMovie)), 'wikidata.org').}
OPTIONAL{
?movie dbo:director ?director.
OPTIONAL{
?director owl:sameAs ?sameDirector FILTER CONTAINS (LCASE(STR(?sameDirector)), 'wikidata.org').
}
}
OPTIONAL{?movie dbo:author ?author.}
OPTIONAL{?movie dbo:starring ?cast_member.}
OPTIONAL{?movie dbo:releaseDate ?releaseDate.}
OPTIONAL{?movie dct:subject ?subject FILTER CONTAINS(lcase(str(?subject)), "comedy").}
OPTIONAL{?movie dbo:genre ?genre FILTER CONTAINS(lcase(str(?genre)), "comedy").}
OPTIONAL{?movie dbo:runtime ?duration.}
OPTIONAL{?movie dbo:abstract ?abstract FILTER(LANG(?abstract)="en").}
OPTIONAL{?movie dbo:distributor ?distributor.}
OPTIONAL{?movie dbp:productionCompanies ?productionCompanies FILTER CONTAINS (lcase(str(?productionCompanies)), "http").}
}GROUP BY ?movie
## Task 4
I decided to map the data from DBPedia to Wikidata.
I made this decision because I get some DBPedia properties from several sources.
## Submission 2
This submission specifies how the vocabulary is put together to create a common knowledge base.
The table can be taken from `static/mapping/mapping.csv`.
The mapping will look like this:
Variable |Wikidata |DBPedia
......@@ -829,7 +67,22 @@ Genre |<http://www.wikidata.org/prop/direct/P136> |<http://purl.org
Genre |<http://www.wikidata.org/prop/direct/P136> |<http://dbpedia.org/ontology/genre>
Duration |<http://www.wikidata.org/prop/direct/P2047> |<http://dbpedia.org/ontology/runtime>
Description |<http://schema.org/description> |<http://dbpedia.org/ontology/abstract>
Production company|<http://dbpedia.org/ontology/P272> |<http://dbpedia.org/ontology/distributor>
Production company|<http://dbpedia.org/ontology/P272> |<http://dbpedia.org/property/productionCompanies>
Production company|<http://www.wikidata.org/prop/direct/P272> |<http://dbpedia.org/ontology/distributor>
Production company|<http://www.wikidata.org/prop/direct/P272> |<http://dbpedia.org/property/productionCompanies>
As you can see there are double mappings for the `Genre` properties and `Production Company`.
The `*/groundtruth/info` routes can be used to decide whether one of the properties can be removed.
The one with the weaker properties should always be removed.
## Submission 3
In this part the matches of the records are searched.
This is done in 2 levels. The first is that `owl:sameAs` is used directly.
This allows to recognize linked data from DBPedia to Wikidata.
On the second level the `Titles` and `Directors` of the movies are compared.
Since the directors cannot be resolved directly, `owl:sameAs` is resolved and looked at again.
Thus the `URIs` of the directors in DBPedia can be resolved to Wikidata.
The overlaps can then be created under `/coreferences` and viewed under `/static/coreferences/coreferences.txt`.
## Presentation
It is important to note that the Genre and Production Company properties in DBPedia were created from multiple sources. For the Production Company it is important to note that the distributor is the same as a Production Company.
\ No newline at end of file
The remaining points will be discussed during the presentation.
\ No newline at end of file
......@@ -36,7 +36,7 @@
window.onload = function () {
// Begin Swagger UI call region
const ui = SwaggerUIBundle({
url: "../static/swagger/gui.yaml",
url: "../static/swagger/gui_elements.yaml",
dom_id: '#swagger-ui',
deepLinking: true,
presets: [
......
images/swagger.png

290 KiB

......@@ -8,5 +8,5 @@ Genre,<http://www.wikidata.org/prop/direct/P136>,<http://purl.org/dc/terms/subje
Genre,<http://www.wikidata.org/prop/direct/P136>,<http://dbpedia.org/ontology/genre>
Duration,<http://www.wikidata.org/prop/direct/P2047>,<http://dbpedia.org/ontology/runtime>
Description,<http://schema.org/description>,<http://dbpedia.org/ontology/abstract>
Production company,<http://dbpedia.org/ontology/P272>,<http://dbpedia.org/ontology/distributor>
Production company,<http://dbpedia.org/ontology/P272>,<http://dbpedia.org/property/productionCompanies>
\ No newline at end of file
Production company,<http://www.wikidata.org/prop/direct/P272>,<http://dbpedia.org/ontology/distributor>
Production company,<http://www.wikidata.org/prop/direct/P272>,<http://dbpedia.org/property/productionCompanies>
\ No newline at end of file
......@@ -9,7 +9,7 @@ info:
paths:
/dbpedia/groundtruth:
get:
summary: Assmble the DBPedia groundtruth with english labels.
summary: Assemble the DBPedia groundtruth with english labels.
description: >-
This function assembles the groundtruth of DBPedia.
It searches for all Comedy Films which have at least one director who is born after or on 1970.
......@@ -84,7 +84,7 @@ paths:
/wikidata/groundtruth:
get:
summary: Assmble the Wikidata groundtruth with english labels in one step.
summary: Assemble the Wikidata groundtruth with english labels in one step.
description: >-
This function assembles the groundtruth of Wikidata.
It searches for all Comedy Films which have at least one director who is born after or on 1970.
......@@ -107,7 +107,7 @@ paths:
/wikidata/groundtruth/advanced:
get:
summary: Assmble the Wikidata groundtruth with english labels in advanced. Take a long time 2-3 minutes.
summary: Assemble the Wikidata groundtruth with english labels in advanced. Take a long time 2-3 minutes.
description: >-
This function assembles the groundtruth of Wikidata.
It searches for all Comedy Films which have at least one director who is born after or on 1970.
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment