Sample code of FindMedian Algorithm.
1 | chunkSize = 5 |
还在稳定自己的三观。
1 | chunkSize = 5 |
Recently I read the mongoDB’s documentation. Here I listed some important features of MongoDB that are different from other databases’.
It’s a NoSQL document
database.
The storage engine is the primary component of MongoDb responsible for managing data, both in memory and on disk
. You can choose the appropriate storage engine to suit your application.
MongoDB use write-ahead
logging to on-dis
journal files to provide durability and prevent failture. It uses GridFS to handle large file.
Most of the SQL operations work well including aggreation. The new MongoDB has join feature but the speed probably won’t be that good compared with others.
The basic structure is Database.Collection.Documents(usually Json Format)
. Each document has a unique id
used as a primary key.
Indexes are special structures in MongoDB, which stores a small portion of the data set in an easy to traverse form. There are various kinds of indexes that MongoDB supports.
MongoDB also support geospatical
data querying as well as reference link
between two different documents.
Profiler
is used to track commend’s performance.Change streams
allow applications to access real-time data changes without the complexity and risk of tailing the log.
The replication design in MongoDB is really similar to Kafka. It has a primary node for all the write operations and can use any node for read operations. Heartbeat
is used among nodes to show the node’s current condition. When the primary node fails, an eligible secondary node would hold an election
to elect itself the new primary. Other secondary nodes would check various conditions such as whether their have stable connections with the candidate without long delay in data transformation. If the candidate has more than half vote, it will be selected as the primary node. There are also nodes called Arbiter
that are considered as vote only.
The procedure of storing data records across multiple machines. We carefully choose a sharding key to distribute the documents in a collections. The sharding key can't be changed.
I haven’t used it much and learned it only by reading documentation, so here are simply some random notes I have so for.
transactions
to ensure atomicity for updates to multiple documents or consistency between reads to multiple documents.Sometimes the interviewer will ask about your previous project background, you don’t want them to hear something like “You know what? I used a lot of fancy technology for that project, but I know nothing about these technologies except how to find out the API that I need.”
The notes below is written during my last internship after reading the Kafka Documentation. The notes are presented as interview questions to help those who still struggle with finding a nice job. (AKA me)
Kafka is a message broker project that is written in Scala. It uses Zookeeper to store offset values of messages. It uses TCP to manage communication between the clients and the servers. It relies heavily on the filesystem and dis for storing and caching messages. The most popular client for Kafka is in Java.
Topic
The message stream that belongs to the same pattern.
Producer
It helps in publishing messages to the topic.
Broker
This is a set of various servers where all published data is stored.
Consumer
It subscribes to the different topics and fetch data from the brokers.
A record represents a single data entry pushed by producer. A topic partition is an ordered, immutable sequence of records that is continually appended to a structured commit log. Partition has a very similar data structure with Queue. The records in the partitions are each assigned a sequential id number called offset
taht uniquely identifies each record within the partition. A topic
contains multiple partitions. It can be seen as a channel that categories different producers’ input.
The topic partition can be seen as a unit of parallelism in Kafka stream. Therefore the throughput of most operations are bounded by the number of partitions.
The Kafka Producer first declares some specfic server lists and topic. Then it can start to upload data to some topic. We can customize the destination of our data stream, but be aware to balance the number of records among different partitions. The sending process can be done asynchronously. We can also choose when to send the next record. It can be determined by whether the leader partition has commited or whether all replications have commited.
The Kafka Stream can manage data from an input topic to an output topic. The pipeline process is based on MapReduce and can perform all the common database operations.
The Kafka Consumer pulls data from brokers. It can subscribe to some specific topic or retrive all data. Data pulling always starts from the last commited offset. The commit operation updates the consumer’s offset in the partition. When and how we call commit
determines what kind of message delivery guarantees we want to have. We can also set up consumer group name for each consumer. The consumers within a same group read one data and two different consumers groups read two copies of the data.
At Most Once
It means that we can either successfully get our message or we lose it. When we commits right after pulling out data, server might be down before we finish processing it. When we restart the server, since we have already updated the partition offset, the consumer will continue to read records from the new offset and thus we lost our data.
At Least Once
It means that we can either successfully get our message or we get duplicated data. When we commits after finishing process the data, server might be down between the time we finish processing the data and we commit the offsets. When we restart the server, the consumer will start reading data from our last commited offset and process the data. This leads to process the data twice.
Exactly Once
It means that we can always get our message once. To hold that, we need to add both commiting and processing data into a transaction. Although exactly once
seems to be the perfect solution, holding a transaction might reduce the processing speed. The most common strategy right now is at least once
.
Usually more partitions mean higher throughput. However, more paritions in one topic lead to more leader partitions for one topic. When a broker is shut down uncleanly, partitions with a leader on that broker become temporaily unavailable, and they need to wait till Kafka move the leaders to some other replicas. If the broker is shut down uncleanly, the observed unavailability could be proportional to the number of partitions. Although in general, unclean failures are rare, but for those who care about availablity in those cases, the number of partitions can’t be too high.
Also, if the number of replication is high, we might experience some latency between pushing data and polling data since we need to sync among all replications for every write.
We can set the number of replication for each topic partition. Each partition has a single leader and zero or more followers. All writes go to the leader of the partition and read can go to any replication. Usually the leader should distributed evenly among brokers so that one broken down server won’t affect much.
Messages sent by a producer to a particular topic partition will be appended in the order they are sent. That is, if a record M1 is sent by the same producer as a record M2, and M1 is sent first, then M1 will have a lower offset than M2 and appear earlier in the log.
A consumer instance sees records in the order they are stored in the log.
For a topic with replication factor N, we will tolerate up to N-1 server failures without losing any records committed to the log.
Sometimes the interviewer will ask about your previous project background, you don’t want them to hear something like “You know what? I used a lot of fancy technology for that project, but I know nothing about these technologies except how to find out the API that I need.”
The notes below is written during my last internship after reading the Redis Documentation. The notes are presented as interview questions to help those who still struggle with finding a nice job. (AKA me)
Redis = REmote Dictionary Server. It is a key-value database where values can contain complex data types with atomic operations defined on those data types.
Redis is an in-memory database, so it represents a trade off where very very high write and read speed is achieved with the limitation of data sets that can’t be larger than memory.
“The data determines the upper bound of the machine learning model. The machine learning models approximate that upper bound.” I heard these words from my mentor during my first week internship in 个推. She was talking about how important it is to do data cleaning and feature engineering before building up sophisticated machine learning model.
In this article, I will talk about common strategies to examine your dataset during machine learning model construction.
First, check the data source quality. Does the dataset come from an authoritative organization or from a personal GitHub account? Usually we can collect data from various ways. It’s important to give different weights to different data source when you want to have a comprehensive understanding about your data.
Second, examine the data quantity. How many data entries are there? What is the dimension of each entry? How many missing data are there? Usually we count the missing data in three dimensions: by row, by column or by table. By answering these questions, you will have a general idea about the model complexity, the computing resources for each steps, focus between approximation error and estimation error.
Third, examine the data type. Is it a static dataset or a dynamic one(time series)? How many discrete features and continuous features are there?
Usually these four kinds of statistics indicators will give a comprehensive understanding about the dataset.
The outliers can be seen as irregular data points in our dataset and can be caused by various reasons. For example, people who use fake identity shouldn’t be included in an user portrait research. Here I summarize main differences among three common ways to measure outliers.
It’s important to construct graph when you evaluate your dataset. Histogram is usually used in regression and Scatter is usually used in classification. Box is also commonly used in data visualization.
Runtime | when |
---|---|
$n^d$ | $d > log_b a$ |
$n^d log^k n$ | $d = log_b a$ |
$n^{log_b^a}$ | $d < log_b a$ |
There are two types of randomized algorithms: Las Vegas algorithm and Monte-Carlo algorithm. The Las Vegas algorithm will always output the correct answer, but they are allowed to have large runtime for a small probability. The Monte-Carlo algorithms will always have a low runtime, but they are allowed to give wrong return values for a small probability.
Typical examples including quick sort, universal hashing, bloom filter and stream algorithm
Minmax duality; Max flow problems
Any problem can reduce to boolean combinational circuit, and the circuit value problem can reduce to a LP problem. Thus everything that is solvable can reduce to LP.
Confusion matrix is a specific table layout that allows visualization of the performance of an algorithm, typically a supervised learning one. Here is a basic example:
Here is a prediction result from a classification model.
prediction | 1 | 2 | 3 | 1 | 1 | 1 | 3 |
---|---|---|---|---|---|---|---|
true | 1 | 2 | 2 | 2 | 3 | 2 | 1 |
Use confusion matrix to interpret it and we get:
1 | 2 | 3 | True | |
---|---|---|---|---|
1 | 1 | 2 | 1 | |
2 | 0 | 1 | 0 | |
3 | 1 | 0 | 0 | |
Prediction |
To interpret this matrix, take look at the forth row. This mean there’s one time that we predict 1 into 3. Then we introduction a defination called table of confusion. It’s a table with two rows and two columns that reports the number of false positives, false neagatives, true positives and true negatives. Take look at our next example.
actual 1 | actual other than 1 | |
---|---|---|
predict 1 | 1(true positive) | 3(false positive) |
predict other than 1 | 1(true negative) | 2(false negative) |
So far everything is still simple. We introduce four new concepts to evaluate the performance of our model.
Although all these concepts and formula are still not so hard to understand, when and which concept we should use becomes complicated. I will discuss the main differences, advantages and disadvantages of each measurement, and how to combine them together to evaluate model performance.
(Positive Predictive Value) Precision represents “how many selected items are relevant”, and (True Positive Rate) Recall represents “how many relevant items are selected”. The former observes TP number in chosen dataset and the latter observes TP in original dataset. We usually use the harmonic average of both to measure test’s accuracy (F1 score).
$$F1 = \frac{2 \times recall \times precision}{recall + precision}$$
The reason we use harmonic average is that the closer precision and recall are, the more reliable our measurement is. We can relate this formula to dice similarity coefficient, which is used for comparing the similarity of two samples.
The main drawback of using F-score is listed below:
A ROC space depicts relative trade-offs between true positive and false positive. In binary classification , we predict a score for each instance. Given a threshold parameter T, the instance is classified as positive if X > T and negative otherwise. Usually as threshold gets looser, TPR increases and FPR increases as well. Therefore, a ROC space is defined by FPR and TPR as x and y axes at various threshold settigs. We usually use AUC(Area Under the Curve) to summarize the ROC curve.
AUC can be interpret as the probability that the classifier will assign a higher score to a randomly chosen positive example than to a negative example. The main drawback of using AUC is listed below:
Accuracy is one of the most common used stastics data. The formula is pretty clear so let’s go straight to its drawback. The main disadvantage is that iff TN is large, the influence of TP becomes minor.
where $\overline{P}$ represents the mean of prediction, $P_n$ represents the nth prediction, $\overline{S}$ represents the mean of observation and $S_n$ represents the nth obeservation.
The fomula above can be transformed to following one, which is what we commonly see.
Matthew Correlation Coefficient use the correlation between the prediction and observation to evaluate the model performance. It’s so far seen as one of the best measurements for binary classification, since it takes into account the balance ratios of the four confusion matrix categories.
Generalize the confusion matrix trades off some features in data. Under different circumstances, we evaluate four confusion matrix categories differently. Each measurement pays attention to some particular aspects in dataset. Therefore, it’s important to know when and how to use these tools to get a comprehensive understanding about our model.