Data and Knowledge Engineering HP
Task 1)
In this section I am going to shown some statistics of data along with the development of the queries.
Wikidata
For the data in Wikidata I managed to work out a structure of the data.
First all Enities are searched which are a film or a subform of a film.
The exact structure can be created from any of the green paths P, if P ends at a movie or a movie genre.
Especially important is that a movie genre is a top category for movies.
Therefore, these can be used to obtain more film entities that are not directly marked as film.
For example, Mulan (wd:Q537407)
is a movie that is marked as animated feature film (wd:Q29168811)
.
There are gradations of the terms film e.g. animated feature film (wd:Q29168811)
separates into feature film (wd:Q24869)
and animated film (wd:Q202866)
. These sub-concepts in turn result in the term film (wd:Q11424)
.
At first all entities X are searched which were marked directly as film.
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
SELECT DISTINCT ?movie
WHERE{
# X instances of film
{?movie wdt:P31 wd:Q11424.}
}
GROUP BY ?movie
This led to a total of 229738
unique entities.
These collected results are extended with the entities marked with a subclass of film.`
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
SELECT DISTINCT ?movie
WHERE{
# node who instances of films.
{?movie wdt:P31 wd:Q11424.}
UNION
# X instances of Y; Y subclass of films.
{?movie wdt:P31 [wdt:P279 wd:Q11424].}
}
GROUP BY ?movie
This brings together a total of 260305
unique entities. This way a gain of 30567
entities is acquired.
Then all entities that have been marked with a subclass of a film genre will be added.
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
SELECT DISTINCT ?movie
WHERE{
# node who instances of films.
{?movie wdt:P31 wd:Q11424.}
UNION
# X instances of Y; Y subclass of films.
{?movie wdt:P31 [wdt:P279 wd:Q11424].}
UNION
# X instances of Y; Y instance of film genre.
{?movie wdt:P31 [wdt:P31 wd:Q201658].}
}
GROUP BY ?movie
This adds 261471
unique entities. Therefore 1166
new entities have been added.
Then all entities X marked with a term derived from a subclass of film can be added.
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
SELECT DISTINCT ?movie
WHERE{
# node who instances of films.
{?movie wdt:P31 wd:Q11424.}
UNION
# X instances of Y; Y subclass of films.
{?movie wdt:P31 [wdt:P279 wd:Q11424].}
UNION
# X instances of Y; Y instance of film genre.
{?movie wdt:P31 [wdt:P31 wd:Q201658].}
UNION
# X instances of Y; Y instance of or subclass of Z; Z is Y; Y subclass of film.
{?movie wdt:P31 [wdt:P31|wdt:P279 [wdt:P279 wd:Q11424]].}
}
GROUP BY ?movie
This again results in a gain of 19816
independent entities.
Thus up to this point a total of 281287
independent entities have been found.
Then all entities X marked with a term derived from a instance of a film genre can be added.
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
SELECT DISTINCT ?movie
WHERE{
# node who instances of films.
{?movie wdt:P31 wd:Q11424.}
UNION
# X instances of Y; Y subclass of films.
{?movie wdt:P31 [wdt:P279 wd:Q11424].}
# X instances of Y; Y instance of film genre.
UNION
{?movie wdt:P31 [wdt:P31 wd:Q201658].}
UNION
# X instances of Y; Y instance of or subclass of Z; Z is Y; Y subclass of film.
{?movie wdt:P31 [wdt:P31|wdt:P279 [wdt:P279 wd:Q11424]].}
UNION
# X instances of Y; Y instance of or subclass of Z; Z is Y; Y subclass of film genre.
{?movie wdt:P31 [wdt:P31|wdt:P279 [wdt:P31 wd:Q201658]].}
}
GROUP BY ?movie
This again results in a gain of 8723
independent entities.
Thus wikidata contains about 290010
many independent entities of a film.
There had been a project
In August 2018 a similar project was carried out on wikidata.
The researchers came to the conclusion that a total of 247075
independent film entities are contained in wikidata.
My results beat those of the researchers by 17%
.
This result is possible because more than 9 months have passed.
In addition, the result underlines the quality of the data I found.
From the set of all movies you can search for those that are directly marked with the category Comedy. This part is indicated by the red lines.
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
SELECT DISTINCT ?movie
WHERE{
# node who instances of films.
{?movie wdt:P31 wd:Q11424.}
UNION
# X instances of Y; Y subclass of films.
{?movie wdt:P31 [wdt:P279 wd:Q11424].}
# X instances of Y; Y instance of film genre.
UNION
{?movie wdt:P31 [wdt:P31 wd:Q201658].}
UNION
# X instances of Y; Y instance of or subclass of Z; Z is Y; Y subclass of film.
{?movie wdt:P31 [wdt:P31|wdt:P279 [wdt:P279 wd:Q11424]].}
UNION
# X instances of Y; Y instance of or subclass of Z; Z is Y; Y subclass of film genre.
{?movie wdt:P31 [wdt:P31|wdt:P279 [wdt:P31 wd:Q201658]].}
# Imagine a inner join
# X genre comedy.
{?movie wdt:P136 wd:Q40831.}
}
GROUP BY ?movie
There are only 208
movies marked directly with comedy.
This can be improved by adding movies whose genre is a subtype of comedy.
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
SELECT DISTINCT ?movie
WHERE{
# node who instances of films.
{?movie wdt:P31 wd:Q11424.}
UNION
# X instances of Y; Y subclass of films.
{?movie wdt:P31 [wdt:P279 wd:Q11424].}
# X instances of Y; Y instance of film genre.
UNION
{?movie wdt:P31 [wdt:P31 wd:Q201658].}
UNION
# X instances of Y; Y instance of or subclass of Z; Z is Y; Y subclass of film.
{?movie wdt:P31 [wdt:P31|wdt:P279 [wdt:P279 wd:Q11424]].}
UNION
# X instances of Y; Y instance of or subclass of Z; Z is Y; Y subclass of film genre.
{?movie wdt:P31 [wdt:P31|wdt:P279 [wdt:P31 wd:Q201658]].}
# Imagine a inner join
# X genre comedy.
{?movie wdt:P136 wd:Q40831.}
UNION
# X genre G; G subclass of comedy.
{?movie wdt:P136 [wdt:P279 wd:Q40831].}
}
GROUP BY ?movie
By considering an intermediate stage for genres, a total of 33393
new
comedy films are added to those already identified.
Thus a total of 33601
comedy movies are included in Wikidata.
This results in a percentage of about 0.1158%
of all identified movies.
The query was structured as follows:
DBPedia
In this section all comedy movies are extracted from DBPedia. The following scheme is used.
This scheme is much smaller than that of Wikidata, as no general concept or consistency can be identified. However, the films can be requested directly as such, as they are clearly described in the ontology of DBPedia. First, all entities of the type film are determined.
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX dbc: <http://dbpedia.org/resource/Category:>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dct: <http://purl.org/dc/terms/>
SELECT DISTINCT ?movie
WHERE{
# X type Comedy
{?movie rdf:type dbo:Film.}
}GROUP BY ?movie
This determines a total of 10000
independent entities.
This is a problem because the public endpoint limits the maximum result size to 10000 results.
For details see here.
There are comparable works that come on 87000
film entities.
See here for more information.
Next, out of the 10000
, all those entities are extracted that were directly titled with comedy.
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX dbc: <http://dbpedia.org/resource/Category:>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dct: <http://purl.org/dc/terms/>
SELECT DISTINCT ?movie
WHERE{
# X type Film
{?movie rdf:type dbo:Film.}
# Imagine a inner join
# X subject Comedy
{?movie dct:subject dbc:Comedy}
}GROUP BY ?movie
It detects 4
entities that are directly marked with Comedy.
For example one result is: dbpedia.org/resource/The_Clowns_(film)
.
Then all films whose subject is derived from comedy will be determined.
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX dbc: <http://dbpedia.org/resource/Category:>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dct: <http://purl.org/dc/terms/>
SELECT DISTINCT ?movie
WHERE{
# X type Film
{?movie rdf:type dbo:Film.}
# Imagine a inner join
# X subject Comedy
{?movie dct:subject dbc:Comedy}
UNION
# X subject Y; Y like "Comedy".
{?movie dct:subject ?y FILTER CONTAINS(lcase(str(?y)), "comedy").}
UNION
# X genre Y; Y like "Comedy".
{?movie dbo:genre ?y FILTER CONTAINS (lcase(str(?y)), "comedy").}
}GROUP BY ?movie
This results in 10000
comedy movies (oh wonder !).
The query was structured as follows:
Task 2
Next, all films from Wikidata and DBPedia are collected whose directors bore before or in 1970.
Wikidata
The search scheme of Wikidata will be extended by a director.
This is indicated by the blue arrows.
The first thing to be checked is how many comedy movies ever have a director.
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
SELECT DISTINCT ?movie
WHERE{
# node who instances of films.
{?movie wdt:P31 wd:Q11424.}
UNION
# X instances of Y; Y subclass of films.
{?movie wdt:P31 [wdt:P279 wd:Q11424].}
# X instances of Y; Y instance of film genre.
UNION
{?movie wdt:P31 [wdt:P31 wd:Q201658].}
UNION
# X instances of Y; Y instance of or subclass of Z; Z is Y; Y subclass of film.
{?movie wdt:P31 [wdt:P31|wdt:P279 [wdt:P279 wd:Q11424]].}
UNION
# X instances of Y; Y instance of or subclass of Z; Z is Y; Y subclass of film genre.
{?movie wdt:P31 [wdt:P31|wdt:P279 [wdt:P31 wd:Q201658]].}
# Imagine a inner join
# X genre comedy.
{?movie wdt:P136 wd:Q40831.}
UNION
# X genre G; G subclass of comedy.
{?movie wdt:P136 [wdt:P279 wd:Q40831].}
# X director exists
{?movie wdt:P57 []}
}GROUP BY ?movie
There are a total of 28964
comedy films that also have a director.
This leads to a loss of a total of 4637
comedy films that do not have a director.
Then it is examined how many of the comedy films with director also have a director with a date of birth.
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
SELECT DISTINCT ?movie #?director ?dob
WHERE{
# node who instances of films.
{?movie wdt:P31 wd:Q11424.}
UNION
# X instances of Y; Y subclass of films.
{?movie wdt:P31 [wdt:P279 wd:Q11424].}
# X instances of Y; Y instance of film genre.
UNION
{?movie wdt:P31 [wdt:P31 wd:Q201658].}
UNION
# X instances of Y; Y instance of or subclass of Z; Z is Y; Y subclass of film.
{?movie wdt:P31 [wdt:P31|wdt:P279 [wdt:P279 wd:Q11424]].}
UNION
# X instances of Y; Y instance of or subclass of Z; Z is Y; Y subclass of film genre.
{?movie wdt:P31 [wdt:P31|wdt:P279 [wdt:P31 wd:Q201658]].}
# Imagine a inner join
# X genre comedy.
{?movie wdt:P136 wd:Q40831.}
UNION
# X genre G; G subclass of comedy.
{?movie wdt:P136 [wdt:P279 wd:Q40831].}
# X director exists and who has a birthday
{?movie wdt:P57 [wdt:P569 ?dob].}
} GROUP BY ?movie
This results in a loss of 1060
comedy movies whose director has no date of birth.
This leaves 27903
comedy movies with directors and a date of birth.
Next, all films are selected that have at least one director with ambiguous birth years. As the task is limited to the year of birth only, ambiguous information about the month and the day will be ignored. The following query considers how many directors exist at all.
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
SELECT DISTINCT ?director
WHERE{
# node who instances of films.
{?movie wdt:P31 wd:Q11424.}
UNION
# X instances of Y; Y subclass of films.
{?movie wdt:P31 [wdt:P279 wd:Q11424].}
# X instances of Y; Y instance of film genre.
UNION
{?movie wdt:P31 [wdt:P31 wd:Q201658].}
UNION
# X instances of Y; Y instance of or subclass of Z; Z is Y; Y subclass of film.
{?movie wdt:P31 [wdt:P31|wdt:P279 [wdt:P279 wd:Q11424]].}
UNION
# X instances of Y; Y instance of or subclass of Z; Z is Y; Y subclass of film genre.
{?movie wdt:P31 [wdt:P31|wdt:P279 [wdt:P31 wd:Q201658]].}
# Imagine a inner join
# X genre comedy.
{?movie wdt:P136 wd:Q40831.}
UNION
# X genre G; G subclass of comedy.
{?movie wdt:P136 [wdt:P279 wd:Q40831].}
# X director exists and who has a birthday
{?movie wdt:P57 ?director.
?director wdt:P569 ?dob.}
} GROUP BY ?director
There are a total of 10693
directors for the 27903
comedy films.
Out of these directors, those are selected who have more than one date of birth.
For this purpose it is checked whether the smallest and largest data of the year of
birth correspond with each other. If this is not the case, the information is ambiguous.
Such directors should be ignored.
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
SELECT ?director (MIN(YEAR(?dob)) AS ?min_dob) (MAX(YEAR(?dob)) AS ?max_dob)
WHERE{
# node who instances of films.
{?movie wdt:P31 wd:Q11424.}
UNION
# X instances of Y; Y subclass of films.
{?movie wdt:P31 [wdt:P279 wd:Q11424].}
# X instances of Y; Y instance of film genre.
UNION
{?movie wdt:P31 [wdt:P31 wd:Q201658].}
UNION
# X instances of Y; Y instance of or subclass of Z; Z is Y; Y subclass of film.
{?movie wdt:P31 [wdt:P31|wdt:P279 [wdt:P279 wd:Q11424]].}
UNION
# X instances of Y; Y instance of or subclass of Z; Z is Y; Y subclass of film genre.
{?movie wdt:P31 [wdt:P31|wdt:P279 [wdt:P31 wd:Q201658]].}
# Imagine a inner join
# X genre comedy.
{?movie wdt:P136 wd:Q40831.}
UNION
# X genre G; G subclass of comedy.
{?movie wdt:P136 [wdt:P279 wd:Q40831].}
# X director exists and who has a birthday
{?movie wdt:P57 ?director.
?director wdt:P569 ?dob.}
} GROUP BY ?director
HAVING (?min_dob != ?max_dob)
All directors are grouped according to their ID.
Then the individual directors are applied against the smallest and largest year of birth.
This has the advantage that every director who has more than one year of birth has unequal
entries for the minimum and maximum year of birth.
Altogether 51
directors with ambiguous birth year are recognized.
The YEAR
operator can interpret dates like 20.century
of the director Troy Miller (wdt:Q3541033)
with 2000
.
Then all directors with a unique year of birth are searched.
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
SELECT ?director (MIN(YEAR(?dob)) AS ?min_dob) (MAX(YEAR(?dob)) AS ?max_dob)
WHERE{
# node who instances of films.
{?movie wdt:P31 wd:Q11424.}
UNION
# X instances of Y; Y subclass of films.
{?movie wdt:P31 [wdt:P279 wd:Q11424].}
# X instances of Y; Y instance of film genre.
UNION
{?movie wdt:P31 [wdt:P31 wd:Q201658].}
UNION
# X instances of Y; Y instance of or subclass of Z; Z is Y; Y subclass of film.
{?movie wdt:P31 [wdt:P31|wdt:P279 [wdt:P279 wd:Q11424]].}
UNION
# X instances of Y; Y instance of or subclass of Z; Z is Y; Y subclass of film genre.
{?movie wdt:P31 [wdt:P31|wdt:P279 [wdt:P31 wd:Q201658]].}
# Imagine a inner join
# X genre comedy.
{?movie wdt:P136 wd:Q40831.}
UNION
# X genre G; G subclass of comedy.
{?movie wdt:P136 [wdt:P279 wd:Q40831].}
# X director exists and who has a birthday
{?movie wdt:P57 ?director.
?director wdt:P569 ?dob.}
} GROUP BY ?director
HAVING (?min_dob = ?max_dob)
This determines 10627
directors with a unique year.
All directors born after or in 1970 can then be searched out.
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
SELECT ?director ?min_dob
WHERE {
SELECT ?movie ?director (MIN(YEAR(?dob)) AS ?min_dob) (MAX(YEAR(?dob)) AS ?max_dob)
WHERE{
# node who instances of films.
{?movie wdt:P31 wd:Q11424.}
UNION
# X instances of Y; Y subclass of films.
{?movie wdt:P31 [wdt:P279 wd:Q11424].}
# X instances of Y; Y instance of film genre.
UNION
{?movie wdt:P31 [wdt:P31 wd:Q201658].}
UNION
# X instances of Y; Y instance of or subclass of Z; Z is Y; Y subclass of film.
{?movie wdt:P31 [wdt:P31|wdt:P279 [wdt:P279 wd:Q11424]].}
UNION
# X instances of Y; Y instance of or subclass of Z; Z is Y; Y subclass of film genre.
{?movie wdt:P31 [wdt:P31|wdt:P279 [wdt:P31 wd:Q201658]].}
# Imagine a inner join
# X genre comedy.
{?movie wdt:P136 wd:Q40831.}
UNION
# X genre G; G subclass of comedy.
{?movie wdt:P136 [wdt:P279 wd:Q40831].}
# X director exists and who has a birthday
{?movie wdt:P57 ?director.
?director wdt:P569 ?dob.}
} GROUP BY ?movie ?director
HAVING (?min_dob = ?max_dob)
} GROUP BY ?director ?min_dob
HAVING (?min_dob >= 1970)
This will identify 1950
directors born in or after 1970.
In total, the corresponding films can then be selected by searching and grouping ?movie
.
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
SELECT DISTINCT ?movie
WHERE {
SELECT ?movie ?director ?min_dob
WHERE {
SELECT ?movie ?director (MIN(YEAR(?dob)) AS ?min_dob) (MAX(YEAR(?dob)) AS ?max_dob)
WHERE{
# node who instances of films.
{?movie wdt:P31 wd:Q11424.}
UNION
# X instances of Y; Y subclass of films.
{?movie wdt:P31 [wdt:P279 wd:Q11424].}
# X instances of Y; Y instance of film genre.
UNION
{?movie wdt:P31 [wdt:P31 wd:Q201658].}
UNION
# X instances of Y; Y instance of or subclass of Z; Z is Y; Y subclass of film.
{?movie wdt:P31 [wdt:P31|wdt:P279 [wdt:P279 wd:Q11424]].}
UNION
# X instances of Y; Y instance of or subclass of Z; Z is Y; Y subclass of film genre.
{?movie wdt:P31 [wdt:P31|wdt:P279 [wdt:P31 wd:Q201658]].}
# Imagine a inner join
# X genre comedy.
{?movie wdt:P136 wd:Q40831.}
UNION
# X genre G; G subclass of comedy.
{?movie wdt:P136 [wdt:P279 wd:Q40831].}
# X director exists and who has a birthday
{?movie wdt:P57 ?director.
?director wdt:P569 ?dob.}
} GROUP BY ?movie ?director
HAVING (?min_dob = ?max_dob)
} GROUP BY ?movie ?director ?min_dob
HAVING (?min_dob >= 1970)
} GROUP BY ?movie
Thus a total of 2887
Comdey films are recognized which have at least one
author whose year of birth is after or in the year 1970.
An example of the quality of the results can be found in the computer animated comdey film Brave (wdt:Q126796)
.
This has the three directors: Steve Purcell (wdt:Q1390504)
with the birth dates 1959
and 1961
and Mark Andrews (wdt:Q1408804)
with the birth date 1970
and Brenda Chapman (wdt:Q429715)
who was born on 1 November 1962
.
DBPedia
The same is done for the comedy movies in DBPedia. First, all comedy movies with a director are searched.
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX dbc: <http://dbpedia.org/resource/Category:>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dct: <http://purl.org/dc/terms/>
SELECT ?director
WHERE{
# X type Film
{?movie rdf:type dbo:Film.}
# Imagine a inner join
# X subject Comedy
{?movie dct:subject dbc:Comedy}
UNION
# X subject Y; Y like "Comedy".
{?movie dct:subject ?y FILTER CONTAINS(lcase(str(?y)), "comedy").}
UNION
# X genre Y; Y like "Comedy".
{?movie dbo:genre ?y FILTER CONTAINS (lcase(str(?y)), "comedy").}
{?movie dbo:director ?director}
}GROUP BY ?director
In total 8565
many comedy movies with one director are found.
Then again all valid directors with a valid year of birth before or in 1970 will be elected.
The following query will be edited it can be seen in the diagram.
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX dbc: <http://dbpedia.org/resource/Category:>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
SELECT DISTINCT ?movie
WHERE {
SELECT DISTINCT ?movie ?min_year ?max_year
WHERE {
SELECT DISTINCT ?movie ?director (MIN(?year) AS ?min_year) (MAX(?year) AS ?max_year)
WHERE{
# X type Film
{?movie rdf:type dbo:Film.}
# Imagine a inner join
# X subject Comedy
{?movie dct:subject dbc:Comedy}
UNION
# X subject Y; Y like "Comedy".
{?movie dct:subject ?y FILTER CONTAINS(lcase(str(?y)), "comedy").}
UNION
# X genre Y; Y like "Comedy".
{?movie dbo:genre ?y FILTER CONTAINS (lcase(str(?y)), "comedy").}
# X director Person; Person birth date year
{?movie dbo:director ?director.
?director dbo:birthDate ?dob.
BIND(xsd:integer(substr(xsd:string(?dob), 0, 4)) as ?year) .
}
} GROUP BY ?movie ?director
}GROUP BY ?movie HAVING (?min_year = ?max_year)
}GROUP BY ?movie HAVING (?min_year >= 1970)
As a result, 1433
comedy films with a director born after or in 1970 were found.