I am working on approaches for estimating semantic similarity between texts, machine learning
approaches to tagging of text and large-scale multilingual entity recognition and disambiguation.
I am also part of the core development team that develops the GATE NLP framework and have developed and contributed to a number of GATE plugins and other GATE related software as well as a number of other open-source projects.
I am interested in a wide range of research topics, including machine learning approaches for text tagging, learning to rank and metric learning, approaches to natural language processing based on imitation learning and cost-sensitive learning and the synergies between knowledge representation and ontologies on one hand and natural language processing on the other hand.
- Room G30, Regent Court, 211 Portobello, S1 4DP, UK
- Email: johann.petrak (AT) gmail.com
- Email: johann.petrak (AT) sheffield.ac.uk
- Email: johann (AT) jpetrak.org
- Skype: joh_pet
- GitHub / GitLab / BitBucket
- Google Scholar
- Twitter @johann_p
See also Google Scholar
- A Deep Neural Network Sentence Level Classification Method with Context Information
X Song, J Petrak, A Roberts. Proceedings of EMNLP 2018.
ArXiv version: [PDF] [BIB]
- Adapted TextRank for Term Extraction: A Generic Method of Improving Automatic Term Extraction Algorithms.
Z Zhang, J Petrak, D Maynard. Proceedings Semantics 2018.
- An extensible multilingual open source lemmatizer.
A Aker, J Petrak, F Sabbah. Proceedings of RANLP 2017.
- Using @Twitter conventions to improve #lod-based named entity disambiguation
G Gorrell, J Petrak, K Bontcheva
Proceedings of ESWC 2015.
- Analysis of named entity recognition and linking for tweets
L Derczynski, D Maynard, G Rizzo, M van Erp, G Gorrell, R Troncy, J Petrak, K Bontcheva
Information Processing & Management, 2015
- Random Indexing for Finding Similar Nodes within Large RDF Graphs.
D Damljanovic, J Petrak, M Lupu, H Cunningham, M Carlsson, G Engstrom, B Andersson.
A list of selected projects I participated in:
- Knowmak (2017/18): methods for ontology enrichment and keyword extraction for ontology-based topic annotation of documents
- Comrades (2016): Collective Platform for Community Resilience and Social Innovation during Crises. Development and deployment of a multilingual entity disambiguation and linking system as a service, based on GATE.
- DeCarboNet (2016): A Decarbonisation Platform for Citizen Empowerment and Translating Collective Awareness into Behavioural Change
- Trendminer (2013/14): large scale semantic annotation of news articles and tweets based on named-entities from Wikipedia/DBPedia and development of approaches for the disambiguation between candidate entities.
- Khresmoi (2013/14): large scale semantic annotation of health-related documents and web pages for semantic search; annotation of named entities and relations for anatomy and pathology terms in German radiology reports.
- OREX (2010-12): Ontology-based information extraction and search: Ontology-based recognition and extraction of relevant information from online job-ads and semantic search of the extracted information.
- LarKC (EU FP 7 Large-Scale Integrating Project) (2009): random indexing for the detection if semantically related nodes in large ontologies and linked biomedical data.
- INSPIRATION (K-Net COAST) (2009/10): structural analysis of medical dictations, automatic recognition of sentence boundaries in audio-transcripts.
Some selected software projects:
A GATE plugin for using various machine learning algorithms from withing GATE. It supports classification, regression and tagging tasks and allows the use of algorithms from LibSVM, Mallet, Weka (as external program), Scikit-Learn (external, Python), CostCLA (external, Python), Pytorch (external, Python) and Keras (external, Python)
A fast implementation of sparse float fectors based on
defaultdict(float)useful to speed up various machine learning algorithms. This is based on Liang Huang’s hvector library but works with Python 3.
A GATE plugin that brings two important properties to GATE pipelines: modularity and parametrizability. The plugin provides a new processing resource which makes it easy to include pipelines within pipelines while keeping each of the contained pipeline files separate. It also provides a new kind of controller which allows to override or set any runtime parameter or init parameter for any of the processing resources in the pipeline, or to set document features or enable or disable a PR within the pipline.
A GATE plugin which makes it easy to write Java code that gets executed in a pipeline. The Java code gets compiled on the fly and there is no need to restart GATE or reload the pipeline when the Java program is modified.
A GATE plugin which provides processing resources for very flexible matching of text using nestable Java regular expressions, and for very fast and compact use of gazetteer lists for matching either document text or text extracted from annotation features (similar to what the FlexibleGazetteer does).
A GATE plugin which can connect to the TagMe web API to annotate documents.
A GATE plugin to create term frequency, document frequency and tf*idf stats for a corpus. This plugin can be run multi-threaded using GCP.
A library that simplifies the interaction between GATE processing resources and external software. The interaction can be done either by starting a separate process and communicating through pipes with the process or by communicating with a separate server. So far this is mainly used for enabling the GATE machine learning plugin, gateplugin-LearningFramework to use Weka, Scikit-Learn, Keras, Pytorch and other external tools.
A GATE plugin that makes it easier to handle graphs of annotations i.e. annotations representing trees, coreference chains, candidate lists or anything where one annotation needs to refer to one or more other annotations in some way.
A GATE plugin which provides the ability to carry out evaluations from within a pipeline.
A GATE plugin which makes it easy to add or update annotations based on looking up information in a JDBC table.
A GATE plugin for loading and saving documents in a number of additional formats: GZIP compressed GATE XML, GATE XML Snappy compressed, Java Object serialized, Java Object serialized with Snappy compression, Java Object serialized with GZIP compression.
A GATE plugin which allows to write Scala code that gets executed in a pipeline from within GATE.
A GATE plugin which can connect to the Stanford CoreNLP server to annotate documents.
A GATE plugin which can connect to the Google NLP Service to annotate documents.
A very simple script for tracking issues. This is meant to be used from within a git repository and will simply manage issues by creating a new file for each issue in a subdirectory of the repository.
A simple python script to add or replace license headers to all files in a directory tree of source files.
A useful and flexible command line script to run a GATE pipeline on documents in a directory.
A GATE plugin which adds lemmata to tokens based on their universial dependencies POS-tags. This currently works for English, German, French, Italian, Dutch and Spanish, though for some languages only wiktionary-based lookups are used while others use a morphological transducer. This is based on the code by Ahmet Aker: http://staffwww.dcs.shef.ac.uk/people/A.Aker/activityNLPProjects.html