GSoC_2016_Progress_Simone

The Table Extractor

This project has the aim to build a module of the extraction framework with the capability of extract useful rdf formatted data from tables of a wiki page. This kind of structures are really particular: they convey data in a semi-structured way. The first approach is to take a domain with interesting data and build a python script to retrieve them out.

Mentors:

Marco Fossati
Claudia Diamantini
Domenico Potena
Emanuele Storti

Project Timeline:

06/12/2016 - This week I have worked on the Analyzer module, on the statistics script and on the Selector. The statistics.py script now has a major update: it continues to call JSONpedia service until it gets back a useful response. Even if with this kind of concept time of execution can be really expanded, it is important to have clear results.

06/05/2016 - Selector Module completed. It takes 2 parameters:

wiki_chapter e.g en/it/fr and so on. Default "en"
tag/where_clause to identify a collection of resources. Default "all" that stands for all wiki pages. Once the parameters has been tested the selector collects a list of resources and write them into a file (.txt). I guess it is useful to keep a trace of resources found by the selector and to make tests to modules that usually would be after this one (eg. the Analyzer).

05/28/2016 - I have found that would be a better choice to test the parameters passed to the python package, in order to be sure of quality. NEW MODULE param_test. It tests and set 2 parameters which are: the chapter of wiki considered and the query to target out a scope. The query could be either a tag (such as "soccer" for soccer players, dir for "directors", "act" for actors and so on) or a real SPARQL where statement. In the second case the user has to be sure to set a correct and useful query.

05/23/2016 - Start of coding - I have started by engineering the algorithm - so I have found out there would be 3 principal modules: A selector of resources, an analyzer module and a utilities module (used by the other modules). The possibility to use JSONpedia as a module is deferred, due to some incompatibilities. If it is possible I will try later during GSoC.

05/22/2016 - Little Report on CB period - The CB period is already over, but it was really useful: I have been able to contact my mentors with success. They helped a lot clarifying targets and strategies. They have been available immediately since the beginning of CB period. I had to email some other community members in order to get acquainted with tools they contributed to develop. I found out they were very helpful. I really hope this type of collaboration would last during all the GSoC. My firsts impressions are very positive. Let the real work begin!

05/19/2016 - Second Meeting - I show the statistics result to my mentors. As I was running out of domains idea, they helped me discussing which kind of Wiki pages could have major tables number ( or tables with more interesting infos). So my first step is to evaluate these domains either in it.wiki or in en.wiki: -Soccer Players -Political Elections -Music Artist’s Discography -Writers -Motorsports -Drivers of motorsports -Basketball seasons and Players -Statistics on Awards (eg. Oscar, Nobel, Grammy and so on) -Actors and filmographies

05/15/2016 - I have started a collaboration with Feddie to extend the capabilities of statistics.py. Now it has the possibility to count lists. We are working together on building up some extraction tools (extraction framework itself, JSONpedia) in order to run them from our own machines. This can be useful to edit parts of tools’ docs o maybe to make our own guide to them.

05/08/2016 - Later on this week I started a little python script (statistics.py) to interact with Wiki and Dbpedia’s chapter in order to do some statistics. This script, you can find it in “Table Extractor”’s repo on github, is able to show how many tables there are in a domain of Wiki pages.

05/03/2016 - First Meeting - We started pointing out some simple rules I've to follow during the GSoC (eg. how and when I will have to report on my work). My DBpedia mentor (marfox) showed my repo page on Github.com and the possibility to maintain a “progress page” here in the extraction framework’s wiki. Therefore my mentors gave me some project goals. Then we discussed a strategy and we all agreed, as a first approach, to start analyzing some particular domains (and wiki chapters too) of interest, trying to extract relevant data immediately.

04/27/2016 - Contact with mentor and co-mentors. I'm really excited to know that my project is starting to catch interest from community members.

04/22/2016 - My GSoC2016' proposal to DBpedia has been chosen! I can't wait to make first contact with community

GSoC_2016_Progress_Simone

The Table Extractor

Mentors:

Project Timeline:

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally