Page MenuHomePhabricator

[Tracking] Research, test, and deploy new language analyzers
Open, LowPublic

Description

Two ways to start:

  • Languages that we really want to make big improvements on because we don't support them well (e.g. spaceless languages)
  • Test analysers that we know to be very mature (e.g. there's a Polish analyser that @dcausse knows about and likes)

Things to consider:

  • How much better the analyser is than what we've got
  • Maintainability of the code of the analyser
  • [add more!]

Languages/analyzers to consider (from T155549):

Previously a 2016/17 Q3 Goal.
Previously a 2016/17 Q4 Goal.
Currently a 2017/18 Q1 Goal.

Related Objects

StatusSubtypeAssignedTask
InvalidNone
OpenNone
ResolvedTJones
ResolvedTJones
ResolvedEBernhardson
ResolvedTJones
ResolvedTJones
ResolvedTJones
ResolvedTJones
ResolvedNone
ResolvedTJones
ResolvedTJones
ResolvedTJones
ResolvedEBernhardson
ResolvedTJones
ResolvedTJones
ResolvedGehel
Resolveddcausse
Resolveddebt
ResolvedTJones
DeclinedTJones
ResolvedTJones
ResolvedTJones
OpenNone
OpenNone
ResolvedTJones
ResolvedTJones
ResolvedTJones
OpenNone
ResolvedTJones
ResolvedTJones
Resolveddebt
Resolveddebt
ResolvedTJones
ResolvedTJones
ResolvedTJones
ResolvedTJones
ResolvedTJones
ResolvedTJones
ResolvedGehel
ResolvedTJones
ResolvedTJones
ResolvedTJones
ResolvedTJones
ResolvedTJones
ResolvedTJones
ResolvedTJones
ResolvedTJones
ResolvedTJones
ResolvedTJones

Event Timeline

Deskana renamed this task from [EPIC] Research, test, and deploy new language analysers to [Epic, Q3 Goal] Research, test, and deploy new language analysers.Jan 3 2017, 7:44 PM
Deskana triaged this task as Medium priority.
Deskana moved this task from needs triage to Current work on the Discovery-Search board.
Deskana added a project: Epic.
This comment was removed by TJones.

HebMorph was recommended by @Matanya. It was investigated some time ago by Matanya and Nik (@Manybubbles). It's being actively developed and Matanya knows the developer.

While researching analyzers, I came across others. I didn't really investigate most of them, so this list is just a starting point for anyone who wants to look more closely at any of these.

General
https://www.elastic.co/guide/en/elasticsearch/plugins/5.1/analysis.html (ES 5.1)
list of Elastic Analysis Plugins (internal and 3rd party)—Japanese, several for Chinese, Polish, Ukrainian, Hebrew, Russian, English, Vietnamese, & some technical ones.

Polish
See T154516.

Chinese
See T158202.

Ukrainian
See T160105.

Hebrew
See T162739.

Japanese
https://www.elastic.co/guide/en/elasticsearch/plugins/5.1/analysis-kuromoji.html (v5.1.2)
https://www.elastic.co/guide/en/elasticsearch/plugins/master/analysis-kuromoji.html (v6.0.0a)
test here (v?): http://www.atilika.org/

Vietnamese
https://github.com/duydo/elasticsearch-analysis-vietnamese (3 months)
linked by Elastic

Thai
https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-icu.html
ICU Anlaysis plugin, "including better analysis of Asian languages"
Mentioned elsewhere that it covers Thai as well.

Phonetic analysis
https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-phonetic.html (v5.1.2)
https://www.elastic.co/guide/en/elasticsearch/plugins/master/analysis-phonetic.html (v6.0.0.a)
“Soundex, Metaphone, and a variety of other algorithms”, presumably English

Misc
https://github.com/yakaz/elasticsearch-analysis-combo (2 years)
combines multiple language analyzers

TJones renamed this task from [Epic, Q3 Goal] Research, test, and deploy new language analysers to [Epic, Q3 Goal, Q4 Goal] Research, test, and deploy new language analysers.Apr 11 2017, 9:01 PM
TJones renamed this task from [Epic, Q3 Goal, Q4 Goal] Research, test, and deploy new language analysers to [Epic, Q1 Goal] Research, test, and deploy new language analyzers.Jul 12 2017, 1:25 PM
TJones updated the task description. (Show Details)
EBjune renamed this task from [Epic, Q1 Goal] Research, test, and deploy new language analyzers to [Epic] Research, test, and deploy new language analyzers.Oct 26 2017, 5:04 AM
Gehel renamed this task from [Epic] Research, test, and deploy new language analyzers to [Tracking] Research, test, and deploy new language analyzers.Sep 9 2020, 2:52 PM
Gehel lowered the priority of this task from Medium to Low.
MPhamWMF subscribed.

Closing out low/est priority tasks over 6 months old with no activity within last 6 months in order to clean out the backlog of tickets we will not be addressing in the near term. Please feel free to reopen if you think a ticket is important, but bare in mind that given current priorities and resourcing, it is unlikely for the Search team to pick up these tasks for the indefinite future. We hope that the requested changes have either been addressed by or made irrelevant by work the team has done or is doing -- e.g. upgrading Elasticsearch to a newer version will solve various ES-related problems -- or will be subsumed by future work in a more generalized way.

RhinosF1 removed a project: Discovery-Search.
RhinosF1 subscribed.

Re-opening tasks and removing from team workboard per IRC feedback given yesterday and discussion with MPham.