Articles
-
OpenNLP
by oxnz
The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text.
The goal of the OpenNLP project will be to create a mature toolkit for the abovementioned tasks. An additional goal is to provide a large number of pre-built models for a variety of languages, as well as the annotated text resources that those models are derived from.
The Apache OpenNLP library contains several components, enabling one to build a full natural language processing pipeline.
- sentence detector
- tokenizer,
- name finder,
- document categorizer,
- part-of-speech tagger,
- chunker,
- parser,
- coreference resolution
interface
- API
try (InputStream modelIn = new FileInputStream("lang-model-name.bin")) { SomeModel model = new SomeModel(modelIn); ToolName toolName = new ToolName(model); String output[] = toolName.executeTask("example text"); }
- CLI
opennlp ToolName opennlp ToolName help opennlp ToolName lang-model-name.bin < input.txt > output.txt
-
Craft a search engine
by oxnz
Architecture
Document -> DocumentParser -> Indexer -> index Query -> QueryParser-> Searcher -> IntentDetector -> QueryRewriter
Query Intent Detection
- benefits
- display semantically enriched search results.
- improve ranking results by triggering a vertical search engine in a certain domain
- challenging task
- queries are usually short
- requires more context beyond the keywords
- number of intent categories could be very high
- approches
- rule-based (precise while coverage is low, bad for scaling)
- defining patterns for each intent class
- defining discriminative features for queries to run statistical models
- statistical methods
- supervised/unsupervised
- rule-based (precise while coverage is low, bad for scaling)
CNN
extract query vector representations as the feature for the query classification.
In this model, queries are represented as vectors so that semantically similar queries can be captured by embedding them into a vector space.
word vector representations(such as
word2vec
)- supervised method
- feature engineering (require domain knowledge)
- lead to state-of-the-art systems
- use various type of features
- search sessions
- click-through data
- Wikipedia concepts
- CNN method
- DO NOT engineering query features
- use CNN to automatically extract query vectors as the feature
- architecture
- traning the model parameters in the offline time
- utlize the labeled queries to learn the parameters of CNN and the intent classifier
- running the model over new queries in the online time
- traning the model parameters in the offline time
# train [Queries with intents] -> (CNN) -> [Query vectors with intents] -> [Classifier] # predict [New query] -> (CNN) -> [Query vector] -> (Classifier) -> [Predicted intent]
Search Session
References
- benefits
-
Interesting
by oxnz
- 同态加密-Homomorphic encryption
https://www.microsoft.com/en-us/research/project/homomorphic-encryption/ https://www.microsoft.com/en-us/research/project/microsoft-seal/
-
Zookeeper
by oxnz
Apache ZooKeeper, a distributed coordination service for distributed systems. By providing a robust implementation of a few basic operations, ZooKeeper simplifies the implementation of many advanced patterns in distributed systems.
Table of Contents
- Table of Contents
- As a Distributed File System
- As a Message Queue
- The CAP Theorem
- Consistency Algorithm
- References
As a Distributed File System
- zNode
- ephemeral zNodes
- that will disappear when the session of its owner ends
- typical use case is when using ZooKeeper for discovery of hosts in distributed system. Each server can then publish its IP address in an ephemeral node. If a server loose connectivity with ZooKeeper and fail to reconnect within the session timeout, its information is deleted
- sequential zNodes
- whose names are automatically assigned a sequence number suffix. this suffix is strictly growing and assigned by ZooKeeper when the zNode is created
- An easy way of doing leader election with ZooKeeper is to let every server publish its information in a zNode that is both sequential and ephemeral. Then, whichever server has the lowest sequential zNode is the leader. If the leader or any other server for that matter, goes offline, its session dies and its ephemeral node is removed, and all other servers can observe who is the new leader.
- ephemeral zNodes
As a Message Queue
registering watchers on zNodes. This allows clients to be notified of the next update to that zNode.
ZooKeeper gives guarantees about ordering. Every update is part of a total ordering. All clients might not be at the exact same point in time, but they will all see every update in the same order.
The CAP Theorem
Consistency, Availability and Partition tolerance are the the three properties considered in the CAP theorem. The theorem states that a distributed system can only provide two of these three properties. ZooKeeper is a CP system with regard to the CAP theorem. This implies that it sacrifices availabilty in order to achieve consistency and partition tolerance. In other words, if it cannot guarantee correct behaviour it will not respond to queries.
Consistency Algorithm
Zab like Paxos
References
-
数据结构
by oxnz
-
Mach Message
by oxnz
User Tasks <-> Mach Message <-> Kernel Services Task <-> Msg <-> Port <-> Msg <-> Kernel
- send: Ordered
- send-once: unordered
Table of Contents
-
Redis
by oxnz
Redis (written in ANSI C) is an open source (BSD licensed), in-memory data structure store, used as a database, cache and message broker. It supports data structures such as strings, hashes, lists, sets, sorted sets with range queries, bitmaps, hyperloglogs, geospatial indexes with radius queries and streams. Redis has built-in replication, Lua scripting, LRU eviction, transactions and different levels of on-disk persistence, and provides high availability via Redis Sentinel and automatic partitioning with Redis Cluster.
also referred as a data structure server(REmote DIctionary Server.)
Features
- NoSQL
- Transactions
- serialized and atomic
- corrupted log can remove the partial transaction and restart the server
- check-and-set(CAS)(v2.2)
- does not support roll backs
- Redis commands can fail only if called with a wrong syntax,
- or against keys holding the wrong data type
- Redis is internally simplified and fast cause it does not need the ability to roll back
MULTI;...;EXEC/DISCARD
- Pub/Sub
- Lua scripting
- Keys with a limited time-to-live(ttl)
EXPIRE key TTL
- LRU eviction of keys
- Automatic failover
- LFU (Least Frequently Used) eviction mode (v4.0)
Limitations
- single threaded
- significant overhead for persistence
-
Elasticsearch
by oxnz
Elasticsearch API.
-
Kafka Primer
by oxnz
Introduction
Kafka is an open source system and also a distributed system is built to use Zookeeper. The basic responsibility of Zookeeper is to build coordination between different nodes in a cluster. Since Zookeeper works as periodically commit offset so that if any node fails, it will be used to recover from previously committed to offset.
The ZooKeeper is also responsible for configuration management, leader detection, detecting if any node leaves or joins the cluster, synchronization, etc.
- Topic (a stream of messages belonging to the same type)
- Producer (can publish messages to a topic)
- Brokers (a set of servers where the publishes messages are stored)
- Consumer (that subscribes to various topics and pulls data from the brokers)
- Consumer Group (Every Kafka consumer group consists of one or more consumers that jointly consume a set of subscribed topics)
Every partition in Kafka has one server which plays the role of a Leader, and none or more servers that act as Followers. The Leader performs the task of all read and write requests for the partition, while the role of the Followers is to passively replicate the leader. In the event of the Leader failing, one of the Followers will take on the role of the Leader. This ensures load balancing of the server.
Replicas are essentially a list of nodes that replicate the log for a particular partition irrespective of whether they play the role of the Leader. On the other hand, ISR stands for In-Sync Replicas. It is essentially a set of message replicas that are synced to the leaders.
If a Replica stays out of the ISR for a long time, It means that the Follower is unable to fetch data as fast as data accumulated by the Leader.
Partitions are used for fail-over and parallel processing.
Table of Contents
-
C++ Libraries
by oxnz
Table of Contents