Tisane API

Name: Tisane API
Brand: Tisane Labs
SKU: tisane

NLP for 30 languages: entities, topics, sentiment, problematic content, language detection, and more.

Tisane Labs

Response example. If you want to see more check the documentation.

Free Plan $0.00 Monthly Subscribe

100 Requests / Monthly

Free for Lifetime

No Credit Card Required

Entity Extraction

These are the types of the entities Tisane extracts:

person with optional subtypes: fictional_character (characters from books, movies, etc.), important_person (VIPs, celebrities, historical figures, politicians, etc.), spiritual_being (gods, spirits, etc.)

organization

place (note: countries are marked as a place + organization combo)

time_range

date

time

amount_of_money

phone number (domestic or international)

role - social role (profession, rank, etc.)

crypto addresses, with optional subtypes: bitcoin, ethereum, monero, monero_payment_id, litecoin, dash

credit_card numbers with subtypes (visa, mastercard, american_express, diners_club, discovery, jcb, unionpay)

website

software

filename

ip_address, subtypes: v4, v6

mac_address

username

Detected entities are located under the entities_summary section.

Language Identification

Detects the language of the utterance. It is possible (with reservation) to segment a fragment using several languages.

To invoke automatic detection when parsing or translating, use * as a language code. Use lang_detect_segmentation_regex setting to define custom language detection fragment boundaries. For example, if multiple languages may be used in different sentences in the same text, define the regex as: (([\r\n]|[.!?][ ])) .

To detect the language only, invoke POST detectLanguage method.

Sentiment Analysis

Tisane provides sentiment analysis both at the document level and as a breakdown by aspects / facets (so-called "aspect-based sentiment analysis"). If you're looking to understand what exactly the customers like or don't like about your product offering, the breakdown is stored in the sentiment_expressions section.

If you're looking for a document-level sentiment score, set the document_sentiment toggle to true in settings. Also, if you're processing reviews, set format to review for better results.

Note that sentiment analysis and detection of problematic content are not the same. Sentiment can be negative and not be a personal attack or hate speech; on the other hand, criminal activity or sexual advances may not necessarily carry negative sentiment.

Topics

Tisane extracts topics (subjects) in IAB, IPTC standards, as Wikidata IDs, or native Tisane labels / family IDs.

If you prefer more topics, set optimize_topics toggle to true.

Problematic Content

Tisane tags the following types of problematic content:

personal_attack - cyberbullying, ad hominem attacks, insults (not necessarily involving profanities)
bigotry - hate speech targeting protected classes (not political factions)
sexual_advances
profanity - plain or obfuscated profanities
criminal_activity - attempts to buy or sell illegal items, stolen data, documents, etc. (Not conversations about crime)
allegation - claims about objectionable conduct
external_contact - attempts to establish contact (e.g. invitations to exchange phones, emails, handles or set up meetings)
mental_issue - mental issues and suicidal ideation
disturbing - disturbing depictions of violence
adult_only - adult only topics (e.g. in minors-only communities)
provocation - attempts to provoke a community

...and more, supporting slang and misspellings.

Low-level Data

With the low-level data, you can:

segment text in languages without spaces (Chinese, Japanese, Thai)
decompound text in languages using compounds or agglutination (German, Dutch, Norwegian, Hungarian, Turkish)
detect multiword expressions (MWEs) and lexical chunks
find noun phrases, verb phrases, prepositional phrases, main clauses, etc.
tag parts of speech
get extensive grammatical info, e.g. verb tense, plurality, gender
detect questions, negations
find semantic roles (e.g. agents and patients) in active and passive voice
split sentences
locate categories of concepts not included in entity types

To obtain parse trees and individual words, set parses or words toggles to true.

Translation

Built-in machine translation (POST /transform) is intended to provide unfiltered translation for potentially offensive, profane, and slang-abundant content.

Supported Languages

Currently we support over 30 languages:

English
Afrikaans
Albanian
Arabic
Danish
German
Spanish
Persian
Finnish
French
Hebrew
Hindi
Hungarian
Indonesian
Italian
Japanese
Korean
Malay
Dutch
Norwegian
Polish
Pashto
Portuguese
Russian
Swedish
Simplified Chinese
Traditional Chinese
Cantonese
Tagalog (including Taglish)
Thai
Turkish
Urdu
Vietnamese

Integrate with the API

In summary, the POST /parse method has 3 attributes: content, language, and settings. All 3 attributes are mandatory.

For example:
{"language": "en", "content": "hello", "settings": {}}

Read on for more info on the response and the settings specs. The method doc pages contain snippets of code for your favorite languages and platforms.

Response Reference

The response of the POST /parse method contains several sections displayed or hidden according to the settings provided.

The common attributes are:

text (string) - the original input
reduced_output (boolean) - if the input is too big, and verbose information like the lexical chunk was requested, the verbose information will not be generated, and this flag will be set to true and returned as part of the response
sentiment (floating-point number) - a number in range -1 to 1 indicating the document-level sentiment. Only shown when document_sentiment setting is set to true.
signal2noise (floating-point number) - a signal to noise ranking of the text, in relation to the array of concepts specified in the relevant setting. Only shown when the relevant setting exists.

Abusive or Problematic Content

The abuse section is an array of detected instances of content that may violate some terms of use. NOTE: the terms of use in online communities may vary, and so it is up to the administrators to determine whether the content is indeed abusive. For instance, it makes no sense to restrict sexual advances in a dating community, or censor profanities when it's accepted in the bulk of the community.

The section exists if instances of abuse are detected and the abuse setting is either omitted or set to true.

Every instance contains the following attributes:

offset (unsigned integer) - zero-based offset where the instance starts
length (unsigned integer) - length of the content
sentence_index (unsigned integer) - zero-based index of the sentence containing the instance
text (string) - fragment of text containing the instance (only included if the snippets setting is set to true)
tags (array of strings) - when exists, provides additional detail about the abuse. For instance, if the fragment is classified as an attempt to sell hard drugs, one of the tags will be _harddrug.
type (string) - the type of the abuse
severity (string) - how severe the abuse is. The levels of severity are low, medium, high, and extreme
explanation (string) - when available, provides rationale for the annotation; set the explain setting to true to enable.

The currently supported types are:

personal_attack - an insult / attack on the addressee, e.g. an instance of cyberbullying. Please note that an attack on a post or a point, or just negative sentiment is not the same as an insult. The line may be blurred at times. See our Knowledge Base for more information.
bigotry - hate speech aimed at one of the protected classes. The hate speech detected is not just racial slurs, but, generally, hostile statements aimed at the group as a whole
profanity - profane language, regardless of the intent
sexual_advances - welcome or unwelcome attempts to gain some sort of sexual favor or gratification
criminal_activity - attempts to sell or procure restricted items, criminal services, issuing death threats, and so on
external_contact - attempts to establish contact or payment via external means of communication, e.g. phone, email, instant messaging (may violate the rules in certain communities, e.g. gig economy portals, e-commerce portals)
adult_only - activities restricted for minors (e.g. consumption of alcohol)
mental_issues - content indicative of suicidal thoughts or depression
allegation - claimed knowledge or accusation of a misconduct (not necessarily crime)
provocation - content likely to provoke an individual or a group
disturbing - graphic descriptions that may disturb readers
no_meaningful_content - unparseable gibberish without apparent meaning
data_leak - private data like passwords, ID numbers, etc.
spam - (RESERVED) spam content
generic - undefined

Sentiment Analysis

The sentiment_expressions section is an array of detected fragments indicating the attitude towards aspects or entities.

The section exists if sentiment is detected and the sentiment setting is either omitted or set to true.

Every instance contains the following attributes:

offset (unsigned integer) - zero-based offset where the instance starts
length (unsigned integer) - length of the content
sentence_index (unsigned integer) - zero-based index of the sentence containing the instance
text (string) - fragment of text containing the instance (only included if the snippets setting is set to true)
polarity (string) - whether the attitude is positive, negative, or mixed. Additionally, there is a default sentiment used for cases when the entire snippet has been pre-classified. For instance, if a review is split into two portions, What did you like? and What did you not like?, and the reviewer replies briefly, e.g. The quiet. The service, the utterance itself has no sentiment value. When the calling application is aware of the intended sentiment, the default sentiment simply provides the targets / aspects, which will be then added the sentiment externally.
targets (array of strings) - when available, provides set of aspects and/or entities which are the targets of the sentiment. For instance, when the utterance is, The breakfast was yummy but the staff is unfriendly, the targets for the two sentiment expressions are meal and staff. Named entities may also be targets of the sentiment.
reasons (array of strings) - when available, provides reasons for the sentiment. In the example utterance above (The breakfast was yummy but the staff is unfriendly), the reasons array for the staff is ["unfriendly"], while the reasons array for meal is ["tasty"].
explanation (string) - when available, provides rationale for the sentiment; set the explain setting to true to enable.

Example:

"sentiment_expressions": [
        {
            "sentence_index": 0,
             "offset": 0,
             "length": 32,
             "polarity": "positive",
             "reasons": ["close"],
             "targets": ["location"]
         },
         {
            "sentence_index": 0,
             "offset": 38,
             "length": 29,
             "polarity": "negative",
             "reasons": ["disrespectful"],
             "targets": ["staff"]
         }
     ]

Entities

The entities_summary section is an array of named entity objects detected in the text.

The section exists if named entities are detected and the entities setting is either omitted or set to true.

Every entity contains the following attributes:

name (string) - the most complete name of the entity in the text of all the mentions
ref_lemma (string) - when available, the dictionary form of the entity in the reference language (English) regardless of the input language
type (string) - a string or an array of strings specifying the type of the entity, such as person, organization, numeric, amount_of_money, place. Certain entities, like countries, may have several types (because a country is both a place and an organization).
subtype (string) - a string indicating the subtype of the entity
mentions (array of objects) - a set of instances where the entity was mentioned in the text

Every mention contains the following attributes:

offset (unsigned integer) - zero-based offset where the instance starts
length (unsigned integer) - length of the content
sentence_index (unsigned integer) - zero-based index of the sentence containing the instance
text (string) - fragment of text containing the instance (only included if the snippets setting is set to true)

Example:

 "entities_summary": [
        {
            "type": "person",
             "name": "John Smith",
             "ref_lemma": "John Smith",
             "mentions": [
                {
                    "sentence_index": 0,
                     "offset": 0,
                     "length": 10 }
             ]
         }
    ,
         {
            "type": [ "organization", "place" ]
        ,
             "name": "UK",
             "ref_lemma": "U.K.",
             "mentions": [
                {
                    "sentence_index": 0,
                     "offset": 40,
                     "length": 2 }
             ]
         }
     ]

Entity Types and Subtypes

The currently supported entity types are:

person, with optional subtypes: fictional_character, important_person, spiritual_being
organization (note that a country is both an organization and a place)
place
time_range
date
time
hashtag
email
amount_of_money
phone phone number, either domestic or international, in a variety of formats
role (a social role, e.g. position in an organization)
software
website (URL), with an optional subtype: tor for Onion links; note that web services may also have the software type assigned
weight
bank_account only IBAN format is supported; subtypes: iban
credit_card, with optional subtypes: visa, mastercard, american_express, diners_club, discovery, jcb, unionpay
coordinates (GPS coordinates)
credential, with optional subtypes: md5, sha-1
crypto, with optional subtypes: bitcoin, ethereum, monero, monero_payment_id, litecoin, dash
event
file only Windows pathnames are supported; subtypes: windows, facebook (for images downloaded from Facebook)
flight_code
identifier
ip_address, subtypes: v4, v6
mac_address
numeric (an unclassified numeric entity)
username

Topics

The topics section is an array of topics (subjects, domains, themes in other terms) detected in the text.

The section exists if topics are detected and the topics setting is either omitted or set to true.

By default, a topic is a string. If topic_stats setting is set to true, then every entry in the array contains:

topic (string) - the topic itself
coverage (floating-point number) - a number between 0 and 1, indicating the ratio between the number of sentences where the topic is detected to the total number of sentences

Long-Term Memory

The memory section contains optional context to pass to the settings in subsequent messages in the same conversation thread. See Context and Long-Term Memory for more details.

Low-Level: Sentences, Phrases, and Words

Tisane allows obtaining more in-depth data, specifically:

sentences and their corrected form, if a misspelling was detected
lexical chunks and their grammatical and stylistic features
parse trees and phrases

The sentence_list section is generated if the words or the parses setting is set to true.

Every sentence structure in the list contains:

offset (unsigned integer) - zero-based offset where the sentence starts
length (unsigned integer) - length of the sentence
text (string) - the sentence itself
corrected_text (string) - if a misspelling was detected and the spellchecking is active, contains the automatically corrected text
words (array of structures) - if words setting is set to true, generates extended information about every lexical chunk. (The term "word" is used for the sake of simplicity, however, it may not be linguistically correct to equate lexical chunks with words.)
parse_tree (object) - if parses setting is set to true, generates information about the parse tree and the phrases detected in the sentence.
nbest_parses (array of parse objects) if parses setting is set to true and deterministic setting is set to false, generates information about the parse trees that were deemed close enough to the best one but not the best.

Words

Every lexical chunk ("word") structure in the words array contains:

type (string) - the type of the element: punctuation for punctuation marks, numeral for numerals, or word for everything else
text (string) - the text
offset (unsigned integer) - zero-based offset where the element starts
length (unsigned integer) - length of the element
corrected_text (string) - if a misspelling is detected, the corrected form
lettercase (string) - the original letter case: upper, capitalized, or mixed. If lowercase or no case, the attribute is omitted.
stopword (boolean) - determines whether the word is a stopword
grammar (array of strings or structures) - generates the list of grammar features associated with the word. If the feature_standard setting is defined as native, then every feature is an object containing a numeral (index) and a string (value). Otherwise, every feature is a plain string

Advanced

For lexical words only:

role (string) - semantic role, like agent or patient. Note that in passive voice, the semantic roles are reverse to the syntactic roles. E.g. in a sentence like The car was driven by David, car is the patient, and David is the agent.
numeric_value (floating-point number) - the numeric value, if the chunk has a value associated with it
family (integer number) - the ID of the family associated with the disambiguated word-sense of the lexical chunk
definition (string) - the definition of the family, if the fetch_definitions setting is set to true
lexeme (integer number) - the ID of the lexeme entry associated with the disambiguated word-sense of the lexical chunk
nondictionary_pattern (integer number) - the ID of a non-dictionary pattern that matched, if the word was not in the language model but was classified by the nondictionary heuristics
style (array of strings or structures) - generates the list of style features associated with the word. Only if the feature_standard setting is set to native or description
semantics (array of strings or structures) - generates the list of semantic features associated with the word. Only if the feature_standard setting is set to native or description
segmentation (structure) - generates info about the selected segmentation, if there are several possibilities to segment the current lexical chunk and the deterministic setting is set to false. A segmentation is simply an array of word structures.
other_segmentations (array of structures) - generates info about the segmentations deemed incorrect during the disambiguation process. Every entry has the same structure as the segmentation structure.
nbest_senses (array of structures) - when the deterministic setting is set to false, generates a set of hypotheses that were deemed incorrect by the disambiguation process. Every hypothesis contains the following attributes: grammar, style, and semantics, identical in structure to their counterparts above; and senses, an array of word-senses associated with every hypothesis. Every sense has a family, which is an ID of the associated family; and, if the fetch_definitions setting is set to true, definition and ref_lemma of that family.

For punctuation marks only:

id (integer number) - the ID of the punctuation mark
behavior (string) - the behavior code of the punctuation mark. Values: sentenceTerminator, genericComma, bracketStart, bracketEnd, scopeDelimiter, hyphen, quoteStart, quoteEnd, listComma (for East-Asian enumeration commas like 、)

Parse Trees and Phrases

Every parse tree, or more accurately, parse forest, is a collection of phrases, hierarchically linked to each other.

At the top level of the parse, there is an array of root phrases under the phrases element and the numeric id associated with it. Every phrase may have children phrases. Every phrase has the following attributes:

type (string) - a Penn treebank phrase tag denoting the type of the phrase, e.g. S, VP, NP, etc.
family (integer number) - an ID of the phrase family
offset (unsigned integer) - a zero-based offset where the phrase starts
length (unsigned integer) - the span of the phrase
role (string) - the semantic role of the phrase, if any, analogous to that of the words
text (string) - the phrase text, where the phrase members are delimited by the vertical bar character. Children phrases are enclosed in brackets. E.g., driven|by|David or (The|car)|was|(driven|by|David).

Example:

"parse_tree": {
"id": 4,
"phrases": [
{
        "type": "S",
        "family": 1451,
        "offset": 0,
        "length": 27,
        "text": "(The|car)|was|(driven|by|David)",
        "children": [
                {
                        "type": "NP",
                        "family": 1081,
                        "offset": 0,
                        "length": 7,
                        "text": "The|car",
                        "role": "patient"
                },
                {
                        "type": "VP",
                        "family": 1172,
                        "offset": 12,
                        "length": 15,
                        "text": "driven|by|David",
                        "role": "verb"
                }
        ]
}

Context-Aware Spelling Correction

Tisane supports automatic, context-aware spelling correction. Whether it's a misspelling or a purported obfuscation, Tisane attempts to deduce the intended meaning, if the language model does not recognize the word.

When or if it's found, Tisane adds the corrected_text attribute to the word (if the words / lexical chunks are returned) and the sentence (if the sentence text is generated). Sentence-level corrected_text is displayed if words or parses are set to true.

Note that as Tisane works with large dictionaries, you may need to exclude more esoteric terms by using the min_generic_frequency setting.

Note that the invocation of spell-checking does not depend on whether the sentences and the words sections are generated in the output. The spellchecking can be disabled by setting disable_spellcheck to true. Another option is to enable the spellchecking for lowercase words only, thus excluding potential proper nouns in languages that support capitalization; to avoid spell-checking capitalized and uppercase words, set lowercase_spellcheck_only to true.

Settings Reference

The purpose of the settings structure is to:

provide cues about the content being sent to improve the results
customize the output and select sections to be shown
define standards and formats in use
define and calculate the signal to noise ranking

All settings are optional. To leave all settings to default, simply provide an empty object ({}).

Content Cues and Instructions

format (string) - the format of the content. Some policies will be applied depending on the format. Certain logic in the underlying language models may require the content to be of a certain format (e.g. logic applied on the reviews may seek for sentiment more aggressively). The default format is empty / undefined. The format values are:

review - a review of a product or a service or any other review. Normally, the underlying language models will seek for sentiment expressions more aggressively in reviews.
dialogue - a comment or a post which is a part of a dialogue. An example of a logic more specific to a dialogue is name calling. A single word like "idiot" would not be a personal attack in any other format, but it is certainly a personal attack when part of a dialogue.
shortpost - a microblogging post, e.g. a tweet.
longform - a long post or an article.
proofread - a post which was proofread. In the proofread posts, the spellchecking is switched off.
alias - a nickname in an online community.
search - a search query. Search queries may not always be grammatically correct. Certain topics and items, that we may otherwise let pass, are tagged with the search format.

disable_spellcheck (boolean) - determines whether the automatic spellchecking is to be disabled. Default: false.

lowercase_spellcheck_only (boolean) - determines whether the automatic spellchecking is only to be applied to words in lowercase. Default: false

min_generic_frequency (int) - allows excluding more esoteric terms; the valid values are 0 thru 10.

subscope (boolean) - enables sub-scope parsing, for scenarios like hashtag, URL parsing, and obfuscated content (e.g. ihateyou). Default: false.

lang_detect_segmentation_regex (string) - allows defining custom language detection fragment boundaries. For example, if multiple languages may be used in different sentences in the same text, you may want to define the regex as: (([\r\n]|[.!?][ ])) .

domain_factors (set of pairs made of strings and numbers) - provides a session-scope cues for the domains of discourse. This is a powerful tool that allows tailoring the result based on the use case. The format is, family ID of the domain as a key and the multiplication factor as a value (e.g. *"12345": 5.0*). For example, when processing text looking for criminal activity, we may want to set domains relevant to drugs, firearms, crime, higher: "domain_factors": {"31058": 5.0, "45220": 5.0, "14112": 5.0, "14509": 3.0, "28309": 5.0, "43220": 5.0, "34581": 5.0}. The same device can be used to eliminate noise coming from domains we know are irrelevant by setting the factor to a value lower than 1.

when (date string, format YYYY-MM-DD) - indicates when the utterance was uttered. (TO BE IMPLEMENTED) The purpose is to prune word senses that were not available at a particular point in time. For example, the words troll, mail, and post had nothing to do with the Internet 300 years ago because there was no Internet, and so in a text that was written hundreds of years ago, we should ignore the word senses that emerged only recently.

Output Customization

abuse (boolean) - output instances of abusive content (default: true)

sentiment (boolean) - output sentiment-bearing snippets (default: true)

document_sentiment (boolean) - output document-level sentiment (default: false)

entities (boolean) - output entities (default: true)

topics (boolean) - output topics (default: true), with two more relevant settings:

topic_stats (boolean) - include coverage statistics in the topic output (default: false). When set, the topic is an object containing the attributes topic (string) and coverage (floating-point number). The coverage indicates a share of sentences touching the topic among all the sentences.
optimize_topics (boolean) - if true, the less specific topics are removed if they are parts of the more specific topics. For example, when the topic is cryptocurrency, the optimization removes finance.

words (boolean) - output the lexical chunks / words for every sentence (default: false). In languages without white spaces (Chinese, Japanese, Thai), the tokens are tokenized words. In languages with compounds (e.g. German, Dutch, Norwegian), the compounds are split.

fetch_definitions (boolean) - include definitions of the words in the output (default: false). Only relevant when the words setting is true

parses (boolean) - output parse forests of phrases

deterministic (boolean) - whether the n-best senses and n-best parses are to be output in addition to the detected sense. If true, only the detected sense will be output. Default: true

snippets (boolean) - include the text snippets in the abuse, sentiment, and entities sections (default: false)

explain (boolean) - if true, a reasoning for the abuse and sentiment snippets is provided when possible (see the explanation attribute)

Standards and Formats

feature_standard (string) - determines the standard used to output the features (grammar, style, semantics) in the response object. The standards we support are:

ud: Universal Dependencies tags (default)
penn: Penn treebank tags
native: Tisane native feature codes
description: Tisane native feature descriptions

Only the native Tisane standards (codes and descriptions) support style and semantic features.

topic_standard (string) - determines the standard used to output the topics in the response object. The standards we support are:

iptc_code - IPTC topic taxonomy code
iptc_description - IPTC topic taxonomy description
iab_code - IAB topic taxonomy code
iab_description - IAB topic taxonomy description
native - Tisane domain description, coming from the family description (default)

sentiment_analysis_type (string) - the type of the sentiment analysis strategy. The values are:

products_and_services - most common sentiment analysis of products and services
entity - sentiment analysis with entities as targets
creative_content_review - reviews of creative content (RESERVED)
political_essay - political essays (RESERVED)

Context and Long-Term Memory

Human understanding of language is not a simple "sliding window" with scope limited to a sentence. Language is accompanied by gestures, visuals, and knowledge of the previous communication. Sometimes, code-words may be used to conceal the words' original meaning.

When detecting abuse, a name of an ethnicity or a religious group may be not offensive, but when superimposed over a picture of an ape or a pig, it is meant of offend. When translating from a language without gender distinctions in verbs (like English) to a language with distinctions (like Russian or Hebrew), there is no way to know from an utterance alone if the speaker is female. When a scammer is collecting details piecemeal over a series of utterances, knowledge of previous utterances is needed to take action.

Tisane's Memory module allows pre-initializing the analysis, as well as reassigning meanings, and more. The module is made of three simple components that are flexible enough for a variety of tasks:

Reassignments

Reassignments define the attributes to set based on other attributes. This allows to:

assign gender to 1st or 2nd person verbs, generating accurate translations
overwrite original meaning of a group of words with all their inflected forms to analyze code-words and secret language
add an additional feature or a hypernym to a family

and more, within a scope of a call.

The assign section is an array of structures defining:

if - conditions to match:
- regex - a regular expression (RE2 syntax)
- family - a family ID
- features - a list of feature values. A feature is a structure with an index and a value. For example: {"index":1, "value":"NOUN"}.
- hypernym - a family ID of a hypernym
then - attributes to assign
- family - a family ID
- features - a list of feature values. A feature is a structure with an index and a value. For example: {"index":1, "value":"NOUN"}.
- hypernym - a family ID of a hypernym

Examples:

the speaker is female: `"assign":[{"if":{"features":[{"index":9,"value":"1"}]},"then":{"features":[{"index":5,"value":"F"}]}}]
assume that a mention of a container refers to an illegal item: `"assign":[{"if":{"family":26888},"then":{"hypernym":123078}}]

Flags

An array of flag structures that add some context. A flag is a structure with an index and a value. For example: {"index":36, "value":"WFH"}.

Aside from the flags returned in the memory section of the response, these flags can be set:

{"index":36, "value":"PEBD"} (agents_of_bad_things) - the context is about a bad player or an agent responsible for bad things
{"index":36, "value":"BADANML"} (bad_animal) - the context is an animal that symbolizes bad qualities (e.g. pig, ape, snake, etc.)
{"index":36, "value":"BULKMSG"} (bulk_message) - the message was sent in bulk
{"index":36, "value":"DETHR"} (death_related) - the context is something related to death
{"index":36, "value":"EARNMUCH"} (make_money) - the context is related to making money
{"index":36, "value":"IDEP"} (my_departure) - the author of the text mentioned departing
{"index":36, "value":"SECO"} (sexually_conservative) - any attempt to exchange photos or anything that may be either sexual or non-sexual is to be deemed sexual
{"index":36, "value":"TRPA"} (trusted_party) - the author of the text claims to be a trusted party (e.g. a relative or a spouse)
{"index":36, "value":"WSTE"} (waste) - the context is about waste, organic or inorganic
{"index":36, "value":"WOPR"} (won_prize) - prize or money winning was mentioned or implied
{"index":36, "value":"WFH"} (work_from_home) - work from home was mentioned
{"index":5, "value":"ORG"} (organization) - an organization was mentioned
{"index":5, "value":"ROLE"} (role) - a role or a position was mentioned

Antecedents

The section contains structures to be used in coreference resolution. The attributes are:

family - the family ID of the antecedent
features - the list of features. Every feature is a structure with an index and a value. For example: {"index":36, "value":"WFH"}.

Signal to Noise Ranking

When we're studying a bunch of posts commenting on an issue or an article, we may want to prioritize the ones more relevant to the topic, and containing more reason and logic than emotion. This is what the signal to noise ranking is meant to achieve.

The signal to noise ranking is made of two parts:

Determine the most relevant concepts. This part may be omitted, depending on the use case scenario (e.g. we want to track posts most relevant to a particular set of issues).
Rank the actual post in relevance to these concepts.

To determine the most relevant concepts, we need to analyze the headline or the article itself. The headline is usually enough. We need two additional settings:

keyword_features (an object of strings with string values) - determines the features to look for in a word. When such a feature is found, the family ID is added to the set of potentially relevant family IDs.
stop_hypernyms (an array of integers) - if a potentially relevant family ID has a hypernym listed in this setting, it will not be considered. For example, we extracted a set of nouns from the headline, but we may not be interested in abstractions or feelings. E.g. from a headline like Fear and Loathing in Las Vegas we want Las Vegas only. Optional.

If keyword_features is provided in the settings, the response will have a special attribute, relevant, containing a set of family IDs.

At the second stage, when ranking the actual posts or comments for relevance, this array is to be supplied among the settings. The ranking is boosted when the domain, the hypernyms, or the families related to those in the relevant array are mentioned, when negative and positive sentiment is linked to aspects, and penalized when the negativity is not linked to aspects, or abuse of any kind is found. The latter consideration may be disabled, e.g. when we are looking for specific criminal content. When the abuse_not_noise parameter is specified and set to true, the abuse is not penalized by the ranking calculations.

To sum it up, in order to calculate the signal to noise ranking:

Analyze the headline with keyword_features and, optionally, stop_hypernyms in the settings. Obtain the relevant attribute.
When analyzing the posts or the comments, specify the relevant attribute obtained in step 1.

What functionality does Tisane API provide? + −

The main purpose of Tisane API is to detect problematic content. However, under the hood Tisane is designed to function as a complete NLU system. Tisane API is used to:

language identification
detect problematic content, such as:

personal attacks and cyberbullying
hate speech
profanity
sexual advances
criminal activity (selling, procuring restricted items like drugs, firearms, etc.)
suicidal thoughts

extract topics (formats: IAB, IPTC, Wikidata, native)
extract named entities
detect aspect-based sentiment for consumer goods and services
split sentences
tokenize Chinese, Japanese, Thai text
decompound German, Dutch, Norwegian words
provide parse trees
extract noun phrases, verb phrases, prepositional phrases
compare components of a name and validate that the name is real

All functions are available for all the supported languages.

We are not getting the response we are expecting. Can you help? + −

While we cannot guarantee 100% accuracy, please send us a note and we will look into it. We are constantly working on improving the accuracy of Tisane, and your collaboration is appreciated.

How do I locate a snippet based on the location data? + −

Different locations of the response use the same notation to allow finding the relevant text fragments. The three attributes are:

sentence_index – a zero-based index of the sentence where the fragment is located
offset – a zero-based position of the first character in the text fragment
length – a number of characters in the text fragment

In order to obtain the string:

Obtain the sentence text by traversing to the sentence_list node and picking a node with the index attribute equal to the value of sentence_index. (Alternatively, pick the node as an array member from the array of sentences in the sentence list.)
Obtain a substring from the sentence text with the specified offset and length.

The reason we do not provide an absolute offset is because many users need an actual sentence with the snippet for context.

Is my language supported? + −

Currently we support the following languages:

English
Afrikaans
Albanian
Arabic
Danish
German
Spanish
Persian
Finnish
French
Hebrew
Hindi
Hungarian
Indonesian
Italian
Japanese
Korean
Malay
Dutch
Norwegian
Polish
Pashto
Portuguese
Russian
Swedish
Simplified Chinese
Traditional Chinese
Cantonese
Tagalog (including Taglish)
Thai
Turkish
Urdu
Vietnamese

Is detection of abusive content the same as sentiment analysis? + −

No. These are two different things. Indeed, abusive and problematic content often overlaps with negative sentiment. But it’s not always the case.

For example, when someone is advertising “high-quality cocaine”, the sentiment is clearly positive, but it’s criminal activity that needs to be tagged as abuse.

When someone is exchanging contact details, the sentiment is neutral; however, it still can be marked as external_contact. Racial slurs and profanities can easily be used in utterances with positive sentiment, too.

On the other hand, “very negative sentiment” does not necessarily mean cyberbullying or hate speech. Even utterances like I hate him are not personal attacks, as they are not aimed at someone participating in the current conversation.

Is there a dashboard UI for moderation settings? + −

Our partner PubNub built and released a dashboard integrated with PubNub In-App Chat.

The PubNub dashboard demo is available here. The source code can be obtained at the GitHub Repository.

Do you extract names of countries? + −

Yes, but we don’t have a special type for that. The entity snippets are marked as both _place_ and _organization_ both for the sake of a standard and to avoid arguments in cases when the sovereignty or the status of a country are disputed.

Tisane API

Entity Extraction

Language Identification

Sentiment Analysis

Topics

Problematic Content

Low-level Data

Translation

Supported Languages

Integrate with the API

Response Reference

Abusive or Problematic Content

Sentiment Analysis

Entities

Entity Types and Subtypes

Topics

Long-Term Memory

Low-Level: Sentences, Phrases, and Words

Words

Advanced

Parse Trees and Phrases

Context-Aware Spelling Correction

Settings Reference

Content Cues and Instructions

Output Customization

Standards and Formats

Context and Long-Term Memory

Reassignments

Flags

Antecedents

Signal to Noise Ranking

Related Products