Articles

  • OpenNLP

    The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text.

    The goal of the OpenNLP project will be to create a mature toolkit for the abovementioned tasks. An additional goal is to provide a large number of pre-built models for a variety of languages, as well as the annotated text resources that those models are derived from.

    The Apache OpenNLP library contains several components, enabling one to build a full natural language processing pipeline.

    • sentence detector
    • tokenizer,
    • name finder,
    • document categorizer,
    • part-of-speech tagger,
    • chunker,
    • parser,
    • coreference resolution

    interface

    • API
    try (InputStream modelIn = new FileInputStream("lang-model-name.bin")) {
      SomeModel model = new SomeModel(modelIn);
      ToolName toolName = new ToolName(model);
      String output[] = toolName.executeTask("example text");
    }
    
    • CLI
    opennlp ToolName
    opennlp ToolName help
    opennlp ToolName lang-model-name.bin < input.txt > output.txt
    
  • Craft a search engine

    Architecture

    Document -> DocumentParser -> Indexer -> index Query -> QueryParser-> Searcher -> IntentDetector -> QueryRewriter

    Query Intent Detection

    • benefits
      • display semantically enriched search results.
      • improve ranking results by triggering a vertical search engine in a certain domain
    • challenging task
      • queries are usually short
      • requires more context beyond the keywords
      • number of intent categories could be very high
    • approches
      • rule-based (precise while coverage is low, bad for scaling)
        • defining patterns for each intent class
        • defining discriminative features for queries to run statistical models
      • statistical methods
        • supervised/unsupervised

    CNN

    extract query vector representations as the feature for the query classification.

    In this model, queries are represented as vectors so that semantically similar queries can be captured by embedding them into a vector space.

    word vector representations(such as word2vec)

    • supervised method
      • feature engineering (require domain knowledge)
      • lead to state-of-the-art systems
      • use various type of features
        • search sessions
        • click-through data
        • Wikipedia concepts
    • CNN method
      • DO NOT engineering query features
      • use CNN to automatically extract query vectors as the feature
      • architecture
        1. traning the model parameters in the offline time
          • utlize the labeled queries to learn the parameters of CNN and the intent classifier
        2. running the model over new queries in the online time
    # train
    [Queries with intents] -> (CNN) -> [Query vectors with intents] -> [Classifier]
    # predict
    [New query] -> (CNN) -> [Query vector] -> (Classifier) -> [Predicted intent]
    

    Search Session

    References

  • Interesting

    • 同态加密-Homomorphic encryption

    https://www.microsoft.com/en-us/research/project/homomorphic-encryption/ https://www.microsoft.com/en-us/research/project/microsoft-seal/

  • Zookeeper

    Apache ZooKeeper, a distributed coordination service for distributed systems. By providing a robust implementation of a few basic operations, ZooKeeper simplifies the implementation of many advanced patterns in distributed systems.

    Table of Contents

    As a Distributed File System

    • zNode
      • ephemeral zNodes
        • that will disappear when the session of its owner ends
        • typical use case is when using ZooKeeper for discovery of hosts in distributed system. Each server can then publish its IP address in an ephemeral node. If a server loose connectivity with ZooKeeper and fail to reconnect within the session timeout, its information is deleted
      • sequential zNodes
        • whose names are automatically assigned a sequence number suffix. this suffix is strictly growing and assigned by ZooKeeper when the zNode is created
        • An easy way of doing leader election with ZooKeeper is to let every server publish its information in a zNode that is both sequential and ephemeral. Then, whichever server has the lowest sequential zNode is the leader. If the leader or any other server for that matter, goes offline, its session dies and its ephemeral node is removed, and all other servers can observe who is the new leader.

    As a Message Queue

    registering watchers on zNodes. This allows clients to be notified of the next update to that zNode.

    ZooKeeper gives guarantees about ordering. Every update is part of a total ordering. All clients might not be at the exact same point in time, but they will all see every update in the same order.

    The CAP Theorem

    Consistency, Availability and Partition tolerance are the the three properties considered in the CAP theorem. The theorem states that a distributed system can only provide two of these three properties. ZooKeeper is a CP system with regard to the CAP theorem. This implies that it sacrifices availabilty in order to achieve consistency and partition tolerance. In other words, if it cannot guarantee correct behaviour it will not respond to queries.

    Consistency Algorithm

    Zab like Paxos

    References

  • 数据结构

  • Mach Message

    User Tasks  <-> Mach Message <-> Kernel Services
    
    Task <-> Msg <-> Port <-> Msg <-> Kernel
    
    • send: Ordered
    • send-once: unordered

    Table of Contents

  • Redis

    Redis (written in ANSI C) is an open source (BSD licensed), in-memory data structure store, used as a database, cache and message broker. It supports data structures such as strings, hashes, lists, sets, sorted sets with range queries, bitmaps, hyperloglogs, geospatial indexes with radius queries and streams. Redis has built-in replication, Lua scripting, LRU eviction, transactions and different levels of on-disk persistence, and provides high availability via Redis Sentinel and automatic partitioning with Redis Cluster.

    also referred as a data structure server(REmote DIctionary Server.)

    Features

    • NoSQL
    • Transactions
      • serialized and atomic
      • corrupted log can remove the partial transaction and restart the server
      • check-and-set(CAS)(v2.2)
      • does not support roll backs
        • Redis commands can fail only if called with a wrong syntax,
        • or against keys holding the wrong data type
        • Redis is internally simplified and fast cause it does not need the ability to roll back
      • MULTI;...;EXEC/DISCARD
    • Pub/Sub
    • Lua scripting
    • Keys with a limited time-to-live(ttl)
      • EXPIRE key TTL
    • LRU eviction of keys
    • Automatic failover
    • LFU (Least Frequently Used) eviction mode (v4.0)

    Limitations

    • single threaded
    • significant overhead for persistence
  • Elasticsearch

    Elasticsearch API.

  • Kafka Primer

    Introduction

    Kafka is an open source system and also a distributed system is built to use Zookeeper. The basic responsibility of Zookeeper is to build coordination between different nodes in a cluster. Since Zookeeper works as periodically commit offset so that if any node fails, it will be used to recover from previously committed to offset.

    The ZooKeeper is also responsible for configuration management, leader detection, detecting if any node leaves or joins the cluster, synchronization, etc.

    • Topic (a stream of messages belonging to the same type)
    • Producer (can publish messages to a topic)
    • Brokers (a set of servers where the publishes messages are stored)
    • Consumer (that subscribes to various topics and pulls data from the brokers)
    • Consumer Group (Every Kafka consumer group consists of one or more consumers that jointly consume a set of subscribed topics)

    Every partition in Kafka has one server which plays the role of a Leader, and none or more servers that act as Followers. The Leader performs the task of all read and write requests for the partition, while the role of the Followers is to passively replicate the leader. In the event of the Leader failing, one of the Followers will take on the role of the Leader. This ensures load balancing of the server.

    Replicas are essentially a list of nodes that replicate the log for a particular partition irrespective of whether they play the role of the Leader. On the other hand, ISR stands for In-Sync Replicas. It is essentially a set of message replicas that are synced to the leaders.

    If a Replica stays out of the ISR for a long time, It means that the Follower is unable to fetch data as fast as data accumulated by the Leader.

    Partitions are used for fail-over and parallel processing.

    Table of Contents

  • C++ Libraries

    Table of Contents