Jorge Teixeira
Porto, Porto, Portugal
3 mil seguidores
+ de 500 conexões
Sobre
I have a passionate customer-centric attitude and a solid vision for product innovation…
Atividades
3 mil seguidores
Experiência
Formação acadêmica
Publicações
-
POPSTAR at RepLab 2013: Name ambiguity resolution on Twitter
CLEF2013
Filtering tweets relevant to a given entity is an important task for online reputation management systems. This contributes to a reliable analysis of opinions and trends regarding a given entity. In this paper we describe our participation at the Filtering Task of RepLab 2013. The goal of the competition is to classify a tweet as relevant or not relevant to a given entity. To address this task we studied a large set of features that can be generated to describe the relationship between an…
Filtering tweets relevant to a given entity is an important task for online reputation management systems. This contributes to a reliable analysis of opinions and trends regarding a given entity. In this paper we describe our participation at the Filtering Task of RepLab 2013. The goal of the competition is to classify a tweet as relevant or not relevant to a given entity. To address this task we studied a large set of features that can be generated to describe the relationship between an entity and a tweet. We explored different learning algorithms as well as, different types of features: text, keyword similarity scores between entities metadata and tweets, Freebase entity graph and Wikipedia. The test set of the competition comprises more than 90000 tweets of 61 entities of four distinct categories: automotive, banking, universities and music. Results show that our approach is able to achieve a Reliability of 0.72 and a Sensitivity of 0.45 on the test set, corresponding to an F-measure of 0.48 and an Accuracy of 0.908.
Outros autoresVer publicação -
Tokenizing Micro-Bloging Messages using a Text Classification Approach
Proceedings of the 4th Workshop on Analytics for Noisy Unstructured Text Data AND'10
The automatic processing of microblogging messages may be prob-
lematic, even in the case of very elementary operations such as
tokenization. The problems arise from the use of non-standard lan-
guage, including media-specific words (e.g. “2day”, “gr8”, “tl;dr”,
“loool”), emoticons (e.g. “(ò_ó)”, “(=ˆ-ˆ=)”), non-standard letter
casing (e.g. “dr. Fred”) and unusual punctuation (e.g. “.... ..”,
“!??!!!?”, “„,”). Additionally, spelling errors are abundant (e.g.
“I;m”), and we…The automatic processing of microblogging messages may be prob-
lematic, even in the case of very elementary operations such as
tokenization. The problems arise from the use of non-standard lan-
guage, including media-specific words (e.g. “2day”, “gr8”, “tl;dr”,
“loool”), emoticons (e.g. “(ò_ó)”, “(=ˆ-ˆ=)”), non-standard letter
casing (e.g. “dr. Fred”) and unusual punctuation (e.g. “.... ..”,
“!??!!!?”, “„,”). Additionally, spelling errors are abundant (e.g.
“I;m”), and we can frequently find more than one language (with
different tokenization requirements) in the same short message.
For being efficient in such environment, manually-developed rule-
based tokenizer systems have to deal with many conditions and ex-
ceptions, which makes them difficult to build and maintain. We
present a text classification approach for tokenizing Twitter mes-
sages, which address complex cases successfully and which is rel-
atively simple to set up and maintain. For that, we created a cor-
pus consisting of 2500 manually tokenized Twitter messages —
a task that is simple for human annotators — and we trained an
SVM classifier for separating tokens at certain discontinuity char-
acters. For comparison, we created a baseline rule-based system
designed specifically for dealing with typical problematic situa-
tions. Results show that we can achieve F-measures of 96% with
the classification-based approach, much above the performance ob-
tained by the baseline rule-based tokenizer (85%). Also, subse-
quent analysis allowed us to identify typical tokenization errors,
which we show that can be partially solved by adding some addi-
tional descriptive examples to the training corpus and re-training
the classifier.Outros autores -
Complete list of publications available at Google Scholar
https://scholar.google.com/citations?user=EF9Otn0AAAAJ&hl=en
Projetos
-
LeanBigData (FP7)
-
Ver projetoLeanBigData will deliver a Big Data platform that is ultra-efficient, improving today’s best effort systems by at least one order of magnitude in efficiency, reducing the amount resources required to process a set of data or allowing us to process more data with the same amount of resources as today.
-
StreamLine (H2020)
-
Ver projetoSTREAMLINE will address the competitive advantage needs of European online media businesses (EOMB) by delivering fast reactive analytics suitable in solving a wide array of problems, including addressing customer retention, personalized recommendation, and more broadly targeted services. STREAMLINE will develop cross-sectorial analytics drawing on multi‐source data originating from online media consumption, online games, telecommunications services, and multilingual web content. STREAMLINE…
STREAMLINE will address the competitive advantage needs of European online media businesses (EOMB) by delivering fast reactive analytics suitable in solving a wide array of problems, including addressing customer retention, personalized recommendation, and more broadly targeted services. STREAMLINE will develop cross-sectorial analytics drawing on multi‐source data originating from online media consumption, online games, telecommunications services, and multilingual web content. STREAMLINE partners face big and fast data challenges. They serve over 100 million users, offer services that produce billions of events, yielding over 10TB of data daily, and possess over a PB of data at rest. Their business use-cases are representative of EOMB, which cannot be handled efficiently & effectively by state-of-the-art technologies, as a consequence of system and human latencies.
-
Máquina do Tempo
-
"Máquina do Tempo" (time machine) is a dynamic web tool that allows you to interactively navigate through the last 25 years of portuguese news until today. Networks of co-occurrences of public personalities on news are the starting point for such journey. Also, additional information such as jobs, roles and citations are available for more than 100 thousand personalities. All information is automatically extracted based on Natural Language Processing and Machine Learning techniques.
Outros criadoresVer projeto -
International Conference: New Job Opportunities in Translation and Interpreting - Challenges for University Programmes and Language Services Providers
-
Member of the Organizing Committee.
An analysis and debate regarding the advancements of linguistic technology and its impact on language service providers. How does machine translation, vast amounts of available data, and a tightly connected society affect the work of professional linguists, translators and other inter-language workers, and how should they prepare for the current transformation.Outros criadoresVer projeto -
Grande Área
-
Global stats on the World Cup. Real time comparison between teams on field. Search for teams or players and compare their performance on three different levels: efficiency, discipline and experience.
Outros criadoresVer projeto -
International Conference: Language and the Law - Bridging the Gaps
-
Member of the Organizing Committee.
Language and the Law – Bridging the Gaps is the first International Conference to be jointly sponsored by ALIDI (the newly formed Association for Language and Law for Speakers of Portuguese) and the IAFL, (the International Association of Forensic Linguists).Outros criadoresVer projeto -
Um País Como Nós
-
Um país como nós é uma ferramenta interativa que estabelece uma relação entre cada um de nós e os "números" das estatísticas do seu concelho e do país.
Outros criadoresVer projeto -
REACTION - Retrieval, Extraction and Aggregation Computing Technology for Integrating and Organizing News
-
REACTION (funded by FCT, UT Austin - Portugal Program) is an initiative for developing a computational journalism platform (mostly) for Portuguese.
The project is developing information extraction, social media mining and information visualisation technologies for assisting journalists in the production of news articles.
Role: "Web Community Sensing" work-package leader.
Outros criadoresVer projeto -
Twitteuro
-
A website that reflects international Twitter activity related to the Euro 2012 competition.
It shows what teams are buzzing with interest, which players are the most popular, which game generates the most comments, and how people react to the events during the games.Outros criadoresVer projeto -
International Conference: 3rd European Conference of the International Association of Forensic Linguists
-
Member of the Organizing Committee.
3rd European Conference of the International Association of Forensic Linguists on the theme of Forensic Linguistics: Bridging the Gap(s) between Language and the LawOutros criadoresVer projeto -
Twitómetro
-
A website depicting user interest and opinion on the candidates to the approaching elections.
The data used originated on Twitter posts of Portuguese users.Outros criadoresVer projeto
Reconhecimentos e prêmios
-
Best Teacher Award for PGBIA - Business Intelligence and Analytics Postgraduate Programme
Porto Business School
-
Time Machine: Entity-Centric Search and Visualization of News Archives
Best Demo Award ECIR 2016
"We present a dynamic web tool that allows interactive search and visualization of large news archives using an entity-centric approach. Users are able to search entities using keyword phrases expressing news stories or events and the system retrieves the most relevant entities to the user query based on automatically extracted and indexed entity profiles. From the computational journalism perspective, TimeMachine allows users to explore media content through time using automatic identification…
"We present a dynamic web tool that allows interactive search and visualization of large news archives using an entity-centric approach. Users are able to search entities using keyword phrases expressing news stories or events and the system retrieves the most relevant entities to the user query based on automatically extracted and indexed entity profiles. From the computational journalism perspective, TimeMachine allows users to explore media content through time using automatic identification of entity names, jobs, quotations and relations between entities from co-occurrences networks extracted from the news articles. TimeMachine demo is available at http://maquinadotempo.sapo.pt/."
Reference: Pedro Saleiro, Jorge Teixeira, Carlos Soares, Eugénio Oliveira, TimeMachine: Entity-Centric Search and Visualization of News Archives in Advances in Information Retrieval: 38th European Conference on IR Research, ECIR 2016, Padua, Italy, March 20-23, 2016., pp. 845-848, Springer International Publishing, 2016 -
"A Bootstrapping Approach for Training a NER with Conditional Random Fields"
Nominee for Best Paper Award
Jorge Teixeira, Luís Sarmento, Eugénio Oliveira. (2011) “A Bootstrapping Approach for Training a NER with Conditional Random Fields” Progress in Artificial Intelligence (LNAI 7026), 15th Portuguese Conference on Artificial Intelligence, EPIA 2011, Lisbon, Portugal, October 10-13
Idiomas
-
Portuguese
Nível nativo ou bilíngue
-
English
Nível avançado
-
Spanish
Nível avançado
-
Italian
Nível básico a intermediário
Organizações
-
New Job Opportunities in Translation and Interpreting - Challenges for University Programmes and Language Services Providers
Member of the Organizing Committee
- o momento -
3rd European Conference of the International Association of Forensic Linguists
Member of the Organising Committee
-