Articles

OpenNLP

March 30 2019 by oxnz

The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text.

The goal of the OpenNLP project will be to create a mature toolkit for the abovementioned tasks. An additional goal is to provide a large number of pre-built models for a variety of languages, as well as the annotated text resources that those models are derived from.

The Apache OpenNLP library contains several components, enabling one to build a full natural language processing pipeline.
- sentence detector
- tokenizer,
- name finder,
- document categorizer,
- part-of-speech tagger,
- chunker,
- parser,
- coreference resolution
interface
- API
```
try (InputStream modelIn = new FileInputStream("lang-model-name.bin")) {
  SomeModel model = new SomeModel(modelIn);
  ToolName toolName = new ToolName(model);
  String output[] = toolName.executeTask("example text");
}
```
- CLI
```
opennlp ToolName
opennlp ToolName help
opennlp ToolName lang-model-name.bin < input.txt > output.txt
```
Craft a search engine

March 18 2019 by oxnz

Architecture

Document -> DocumentParser -> Indexer -> index Query -> QueryParser-> Searcher -> IntentDetector -> QueryRewriter

Query Intent Detection
- benefits
  - display semantically enriched search results.
  - improve ranking results by triggering a vertical search engine in a certain domain
- challenging task
  - queries are usually short
  - requires more context beyond the keywords
  - number of intent categories could be very high
- approches
  - rule-based (precise while coverage is low, bad for scaling)
    - defining patterns for each intent class
    - defining discriminative features for queries to run statistical models
  - statistical methods
    - supervised/unsupervised
CNN

extract query vector representations as the feature for the query classification.

In this model, queries are represented as vectors so that semantically similar queries can be captured by embedding them into a vector space.

word vector representations(such as word2vec)
- supervised method
  - feature engineering (require domain knowledge)
  - lead to state-of-the-art systems
  - use various type of features
    - search sessions
    - click-through data
    - Wikipedia concepts
- CNN method
  - DO NOT engineering query features
  - use CNN to automatically extract query vectors as the feature
  - architecture
    1. traning the model parameters in the offline time
      - utlize the labeled queries to learn the parameters of CNN and the intent classifier
    2. running the model over new queries in the online time
```
# train
[Queries with intents] -> (CNN) -> [Query vectors with intents] -> [Classifier]
# predict
[New query] -> (CNN) -> [Query vector] -> (Classifier) -> [Predicted intent]
```
Search Session

References
- http://people.cs.pitt.edu/~hashemi/papers/QRUMS2016_HBHashemi.pdf
Interesting

March 4 2019 by oxnz
- 同态加密-Homomorphic encryption
https://www.microsoft.com/en-us/research/project/homomorphic-encryption/ https://www.microsoft.com/en-us/research/project/microsoft-seal/
Zookeeper

November 10 2017 by oxnz

Apache ZooKeeper, a distributed coordination service for distributed systems. By providing a robust implementation of a few basic operations, ZooKeeper simplifies the implementation of many advanced patterns in distributed systems.

Table of Contents
As a Distributed File System
- zNode
  - ephemeral zNodes
    - that will disappear when the session of its owner ends
    - typical use case is when using ZooKeeper for discovery of hosts in distributed system. Each server can then publish its IP address in an ephemeral node. If a server loose connectivity with ZooKeeper and fail to reconnect within the session timeout, its information is deleted
  - sequential zNodes
    - whose names are automatically assigned a sequence number suffix. this suffix is strictly growing and assigned by ZooKeeper when the zNode is created
    - An easy way of doing leader election with ZooKeeper is to let every server publish its information in a zNode that is both sequential and ephemeral. Then, whichever server has the lowest sequential zNode is the leader. If the leader or any other server for that matter, goes offline, its session dies and its ephemeral node is removed, and all other servers can observe who is the new leader.
As a Message Queue

registering watchers on zNodes. This allows clients to be notified of the next update to that zNode.

ZooKeeper gives guarantees about ordering. Every update is part of a total ordering. All clients might not be at the exact same point in time, but they will all see every update in the same order.

The CAP Theorem

Consistency, Availability and Partition tolerance are the the three properties considered in the CAP theorem. The theorem states that a distributed system can only provide two of these three properties. ZooKeeper is a CP system with regard to the CAP theorem. This implies that it sacrifices availabilty in order to achieve consistency and partition tolerance. In other words, if it cannot guarantee correct behaviour it will not respond to queries.

Consistency Algorithm

Zab like Paxos

References
- https://www.elastic.co/blog/found-zookeeper-king-of-coordination#operations-yet-another-system-to-manage
数据结构

October 11 2017 by oxnz
Mach Message

July 8 2017 by oxnz
```
User Tasks  <-> Mach Message <-> Kernel Services

Task <-> Msg <-> Port <-> Msg <-> Kernel
```
- send: Ordered
- send-once: unordered
Table of Contents
- Table of Contents
Redis

July 3 2017 by oxnz

Redis (written in ANSI C) is an open source (BSD licensed), in-memory data structure store, used as a database, cache and message broker. It supports data structures such as strings, hashes, lists, sets, sorted sets with range queries, bitmaps, hyperloglogs, geospatial indexes with radius queries and streams. Redis has built-in replication, Lua scripting, LRU eviction, transactions and different levels of on-disk persistence, and provides high availability via Redis Sentinel and automatic partitioning with Redis Cluster.

also referred as a data structure server(REmote DIctionary Server.)

Features
- NoSQL
- Transactions
  - serialized and atomic
  - corrupted log can remove the partial transaction and restart the server
  - check-and-set(CAS)(v2.2)
  - does not support roll backs
    - Redis commands can fail only if called with a wrong syntax,
    - or against keys holding the wrong data type
    - Redis is internally simplified and fast cause it does not need the ability to roll back
  - MULTI;...;EXEC/DISCARD
- Pub/Sub
- Lua scripting
- Keys with a limited time-to-live(ttl)
  - EXPIRE key TTL
- LRU eviction of keys
- Automatic failover
- LFU (Least Frequently Used) eviction mode (v4.0)
Limitations
- single threaded
- significant overhead for persistence
Elasticsearch

July 2 2017 by oxnz

Elasticsearch API.
Kafka Primer

June 29 2017 by oxnz

Introduction

Kafka is an open source system and also a distributed system is built to use Zookeeper. The basic responsibility of Zookeeper is to build coordination between different nodes in a cluster. Since Zookeeper works as periodically commit offset so that if any node fails, it will be used to recover from previously committed to offset.

The ZooKeeper is also responsible for configuration management, leader detection, detecting if any node leaves or joins the cluster, synchronization, etc.
- Topic (a stream of messages belonging to the same type)
- Producer (can publish messages to a topic)
- Brokers (a set of servers where the publishes messages are stored)
- Consumer (that subscribes to various topics and pulls data from the brokers)
- Consumer Group (Every Kafka consumer group consists of one or more consumers that jointly consume a set of subscribed topics)
Every partition in Kafka has one server which plays the role of a Leader, and none or more servers that act as Followers. The Leader performs the task of all read and write requests for the partition, while the role of the Followers is to passively replicate the leader. In the event of the Leader failing, one of the Followers will take on the role of the Leader. This ensures load balancing of the server.

Replicas are essentially a list of nodes that replicate the log for a particular partition irrespective of whether they play the role of the Leader. On the other hand, ISR stands for In-Sync Replicas. It is essentially a set of message replicas that are synced to the leaders.

If a Replica stays out of the ISR for a long time, It means that the Follower is unable to fetch data as fast as data accumulated by the Leader.

Partitions are used for fail-over and parallel processing.

Table of Contents
- Introduction
- Table of Contents
C++ Libraries

June 15 2017 by oxnz

Table of Contents
- Table of Contents

Articles

March 30 2019 by oxnz

interface

March 18 2019 by oxnz

Architecture

Query Intent Detection

CNN

Search Session

References

March 4 2019 by oxnz

November 10 2017 by oxnz

Table of Contents

As a Distributed File System

As a Message Queue

The CAP Theorem

Consistency Algorithm

References

October 11 2017 by oxnz

July 8 2017 by oxnz

Table of Contents

July 3 2017 by oxnz

July 2 2017 by oxnz

June 29 2017 by oxnz

Introduction

Table of Contents

June 15 2017 by oxnz

Table of Contents