findMedian Sample Code

Posted on 2019-07-24 Edited on 2019-07-25 In CS , Algorithm

Sample code of FindMedian Algorithm.

chunkSize = 5

def findMedian(lst):
	if len(lst) == 0:
		return None
	if len(lst) % 2 == 0:
		num1 = findKth(lst, len(lst)//2)
		num2 = findKth(lst, len(lst)//2 - 1)
		num = (num1 + num2) * 1.0 / 2
	else:
		num = findKth(lst, len(lst)//2)

	print("result: " + str(num))
	return num

def findKth(lst, k):
	if len(lst) == 0:
		return None

	if len(lst) == 1:
		return lst[0]

	medians = []
	for i in range(0, len(lst), chunkSize):
		chunk = lst[i:min(i + chunkSize, len(lst))]
		chunk.sort()
		median = chunk[len(chunk)//2]
		medians.append(median)

	medianOfMedians = findKth(medians, len(medians)//2)
	# print("medianofmedians: " + str(medianOfMedians))

	high = []
	low = []
	same = []

	for num in lst:
		if num < medianOfMedians:
			low.append(num)
		elif num > medianOfMedians:
			high.append(num)
		else:
			same.append(num)

	if len(low) > k:
		return findKth(low, k)
	elif len(low) + len(same) > k:
		return medianOfMedians
	else:
		return findKth(high, k - len(low) - len(same))



assert findMedian([]) == None

assert findMedian([2]) == 2

assert findMedian([5,3]) == 4.0

assert findMedian([1,3]) == 2.0

assert findMedian([5,2,6]) == 5

assert findMedian([5,3,1]) == 3

assert findMedian([5,6,0,1]) == 3.0

assert findMedian([5,2,6,2,10,5]) == 5.0

assert findMedian([5,2,6,2,10,5,14]) == 5

assert findMedian([5,2,6,23,2,10,5,14]) == 5.5

assert findMedian([5,2,6,23,2,10,2, 5,14]) == 5

assert findMedian([5,2,6,23,2,10,2, 5,14, 5]) == 5.0

assert findMedian([5,2,6,23,2,10,2, 5,14, 12, 1349, 2, 349, 5, 10]) == 6

Mongodb Overview

Posted on 2019-07-24 Edited on 2019-07-25

Recently I read the mongoDB’s documentation. Here I listed some important features of MongoDB that are different from other databases’.

What is MongoDB?

It’s a NoSQL document database.

How is data stored in MongoDB? How does MongoDB provide durability? How does MongoDB handle large files?

The storage engine is the primary component of MongoDb responsible for managing data, both in memory and on disk. You can choose the appropriate storage engine to suit your application.
MongoDB use write-ahead logging to on-dis journal files to provide durability and prevent failture. It uses GridFS to handle large file.

What kind of operations can MongoDB perform?

Most of the SQL operations work well including aggreation. The new MongoDB has join feature but the speed probably won’t be that good compared with others.

Any interesting features in MongoDB structure? What are indexes in MongoDB?

The basic structure is Database.Collection.Documents(usually Json Format). Each document has a unique id used as a primary key.

Indexes are special structures in MongoDB, which stores a small portion of the data set in an easy to traverse form. There are various kinds of indexes that MongoDB supports.

MongoDB also support geospatical data querying as well as reference link between two different documents.

What are some common administration tools used in MongoDB?

Profiler is used to track commend’s performance.
Change streams allow applications to access real-time data changes without the complexity and risk of tailing the log.

How replication works in MongoDB?

The replication design in MongoDB is really similar to Kafka. It has a primary node for all the write operations and can use any node for read operations. Heartbeat is used among nodes to show the node’s current condition. When the primary node fails, an eligible secondary node would hold an election to elect itself the new primary. Other secondary nodes would check various conditions such as whether their have stable connections with the candidate without long delay in data transformation. If the candidate has more than half vote, it will be selected as the primary node. There are also nodes called Arbiter that are considered as vote only.

What is sharding in MongoDB?

The procedure of storing data records across multiple machines. We carefully choose a sharding key to distribute the documents in a collections. The sharding key can't be changed.

What other interesting features do you think MongoDB has?

I haven’t used it much and learned it only by reading documentation, so here are simply some random notes I have so for.

It’s really convenient to create and run function in MongoDB.
You can customize prompt to for example display your database and hostname.
It accepts retryable write that allows drivers to automatically retry certain write operation a single time if they encounter network errors.
MongoDB also support transactions to ensure atomicity for updates to multiple documents or consistency between reads to multiple documents.

What are the drawbacks of MongoDB?

It doesn’t support Join.
It is a memory hog. Because MongoDB stores key names for each value pairs. There is data redundancy since there is no join.
The scaling ability and pipeline operations are not so well compared with other NoSQL solutions.
The transaction can only happen at the document level.

Kafka Overview

Posted on 2019-07-24 Edited on 2019-07-25 In CS , Message Queue

Sometimes the interviewer will ask about your previous project background, you don’t want them to hear something like “You know what? I used a lot of fancy technology for that project, but I know nothing about these technologies except how to find out the API that I need.”

The notes below is written during my last internship after reading the Kafka Documentation. The notes are presented as interview questions to help those who still struggle with finding a nice job. (AKA me)

What is Kafka?

Kafka is a message broker project that is written in Scala. It uses Zookeeper to store offset values of messages. It uses TCP to manage communication between the clients and the servers. It relies heavily on the filesystem and dis for storing and caching messages. The most popular client for Kafka is in Java.

What are the main components of Kafka?

Topic
The message stream that belongs to the same pattern.
Producer
It helps in publishing messages to the topic.
Broker
This is a set of various servers where all published data is stored.
Consumer
It subscribes to the different topics and fetch data from the brokers.

What is the data structure of Kafka stream?

A record represents a single data entry pushed by producer. A topic partition is an ordered, immutable sequence of records that is continually appended to a structured commit log. Partition has a very similar data structure with Queue. The records in the partitions are each assigned a sequential id number called offset taht uniquely identifies each record within the partition. A topic contains multiple partitions. It can be seen as a channel that categories different producers’ input.

How does data streamming work in Kafka?

The topic partition can be seen as a unit of parallelism in Kafka stream. Therefore the throughput of most operations are bounded by the number of partitions.

The Kafka Producer first declares some specfic server lists and topic. Then it can start to upload data to some topic. We can customize the destination of our data stream, but be aware to balance the number of records among different partitions. The sending process can be done asynchronously. We can also choose when to send the next record. It can be determined by whether the leader partition has commited or whether all replications have commited.

The Kafka Stream can manage data from an input topic to an output topic. The pipeline process is based on MapReduce and can perform all the common database operations.

The Kafka Consumer pulls data from brokers. It can subscribe to some specific topic or retrive all data. Data pulling always starts from the last commited offset. The commit operation updates the consumer’s offset in the partition. When and how we call commit determines what kind of message delivery guarantees we want to have. We can also set up consumer group name for each consumer. The consumers within a same group read one data and two different consumers groups read two copies of the data.

What kind of message delivery guarantees do Kafka has?

At Most Once
It means that we can either successfully get our message or we lose it. When we commits right after pulling out data, server might be down before we finish processing it. When we restart the server, since we have already updated the partition offset, the consumer will continue to read records from the new offset and thus we lost our data.
At Least Once
It means that we can either successfully get our message or we get duplicated data. When we commits after finishing process the data, server might be down between the time we finish processing the data and we commit the offsets. When we restart the server, the consumer will start reading data from our last commited offset and process the data. This leads to process the data twice.
Exactly Once
It means that we can always get our message once. To hold that, we need to add both commiting and processing data into a transaction. Although exactly once seems to be the perfect solution, holding a transaction might reduce the processing speed. The most common strategy right now is at least once.

How to choose the number of partitions in one topic?

Usually more partitions mean higher throughput. However, more paritions in one topic lead to more leader partitions for one topic. When a broker is shut down uncleanly, partitions with a leader on that broker become temporaily unavailable, and they need to wait till Kafka move the leaders to some other replicas. If the broker is shut down uncleanly, the observed unavailability could be proportional to the number of partitions. Although in general, unclean failures are rare, but for those who care about availablity in those cases, the number of partitions can’t be too high.

Also, if the number of replication is high, we might experience some latency between pushing data and polling data since we need to sync among all replications for every write.

How does Kafka handle server break down? (TODO)

We can set the number of replication for each topic partition. Each partition has a single leader and zero or more followers. All writes go to the leader of the partition and read can go to any replication. Usually the leader should distributed evenly among brokers so that one broken down server won’t affect much.

What does Kafka guarantees at a high-level?

Messages sent by a producer to a particular topic partition will be appended in the order they are sent. That is, if a record M1 is sent by the same producer as a record M2, and M1 is sent first, then M1 will have a lower offset than M2 and appear earlier in the log.
A consumer instance sees records in the order they are stored in the log.
For a topic with replication factor N, we will tolerate up to N-1 server failures without losing any records committed to the log.

TBD 4.7 replication

Redis Overview

Posted on 2019-07-24 Edited on 2019-07-25 In CS , Database

The notes below is written during my last internship after reading the Redis Documentation. The notes are presented as interview questions to help those who still struggle with finding a nice job. (AKA me)

What is Redis?

Redis = REmote Dictionary Server. It is a key-value database where values can contain complex data types with atomic operations defined on those data types.

How is data stored in Redis?

Redis is an in-memory database, so it represents a trade off where very very high write and read speed is achieved with the limitation of data sets that can’t be larger than memory.

What are the limitation of Redis?

It is single threaded.
It has significant overhead for persistence.
It is not deployed widely.

Some interesting features I found

Redis is implemented by Linked List.
Redis can be used as a message queue or pub-sub server.
It supports Hyperloglogs. It is a probabilistic data structure used in order to count unique things that trade memory for precision.
Redis automatically create and remove keys.

Evaluate Data for Machine Learning

Posted on 2019-01-25 Edited on 2019-07-25 In CS , Machine Learning

“The data determines the upper bound of the machine learning model. The machine learning models approximate that upper bound.” I heard these words from my mentor during my first week internship in 个推. She was talking about how important it is to do data cleaning and feature engineering before building up sophisticated machine learning model.

In this article, I will talk about common strategies to examine your dataset during machine learning model construction.

Dataset Overview

First, check the data source quality. Does the dataset come from an authoritative organization or from a personal GitHub account? Usually we can collect data from various ways. It’s important to give different weights to different data source when you want to have a comprehensive understanding about your data.

Second, examine the data quantity. How many data entries are there? What is the dimension of each entry? How many missing data are there? Usually we count the missing data in three dimensions: by row, by column or by table. By answering these questions, you will have a general idea about the model complexity, the computing resources for each steps, focus between approximation error and estimation error.

Third, examine the data type. Is it a static dataset or a dynamic one(time series)? How many discrete features and continuous features are there?

Use Statistic Methods

Usually these four kinds of statistics indicators will give a comprehensive understanding about the dataset.

Central Tendency: Mean, Median, Mode
Relative Position: Max, Min, Quantile, IOR
Variance and Covariance
Outliers: Z-Score, Dbscan and Isolation Forest

Define Outliers

The outliers can be seen as irregular data points in our dataset and can be caused by various reasons. For example, people who use fake identity shouldn’t be included in an user portrait research. Here I summarize main differences among three common ways to measure outliers.

Z-Score
1. Description
  - (data - mean) / var
2. Advantage
  - Good on gaussian distribution
3. Disadvantage
  - Not good for high dimensional feature space
Dbscan
1. Description
  - Count the number of neighbors
2. Advantage
  - No assumption on the distribution of feature space
  - good on multi-dimensional features
3. Disadvantage
  - Need to scale featureshard to select the optimal parameters
Isolation forest
1. Description
  - randomly split sample by some feature
2. Advantage
  - Less parameter and therefore it’s robust and easy to optimize
  - no scaling needed
  - no assumption to distribution needed
3. Disadvantage
  - Training time can be very long

Data Visualization

It’s important to construct graph when you evaluate your dataset. Histogram is usually used in regression and Scatter is usually used in classification. Box is also commonly used in data visualization.

CS170 Review

Posted on 2018-12-15 Edited on 2019-07-25 In CS , Algorithm

Runtime Analysis

Notation
$\Omega() < \theta() < O()$
The master theorem
$T(n) = a T(\frac{n}{b}) + n^d log^k n$

Runtime	when
$n^d$	$d > log_b a$
$n^d log^k n$	$d = log_b a$
$n^{log_b^a}$	$d < log_b a$

Typical Examples
Mergesort, Find Median and the fast fourier transformation.

Randomized Algorithm

There are two types of randomized algorithms: Las Vegas algorithm and Monte-Carlo algorithm. The Las Vegas algorithm will always output the correct answer, but they are allowed to have large runtime for a small probability. The Monte-Carlo algorithms will always have a low runtime, but they are allowed to give wrong return values for a small probability.

Typical examples including quick sort, universal hashing, bloom filter and stream algorithm

Graph

LinkedList and Matrix Representation
Deep First Search
1. A directed graph has a circle if and only if its depth-first search reveals a back edge.
2. Topological Sort: In a dag, every edge leads to a vertex with a lower post number.
3. Find Strongly Connected Components: Run DFS on reversed graph and run DFS again on the highest post numbers.
Breath First Search
1. Dijkstra’s Algorithm: Runtime $|V| log |V| + |E|$
2. Bellman-Ford Algorithm: Runtime $|V||E|$. Work on negative edges. Can detect negative circle.

Greedy Algorithm

A special case in dynamic programming. There are two main ways to prove the correctness of the greedy approach: staying ahead or exchange arguments.
MST:
Kruskal’s algorithm: Union Find. Cut Property. Runtime $|E| log |E|$. Space: $|V|$
Prim algorithm: Runtime $|V| log |V| + |E|$. Space: $|V|$
Other examples:
Huffman Encoding, Horn Formula and Set Cover.

Dynamic Programming

For $x_1, x_2, …, x_n$, the subproblem is $x_1, x_2, …, x_i$.
For $x_1, x_2, …, x_n$, the subproblem is $x_i, x_{i+1}, …, x_j$.
For $x_1, x_2, …, x_n; y_1, y_2, …, y_n$, the subproblem is $x_1,…, x_i, y_1,…, y_j$.
For a tree problem, the subproblem is node $i$ and its children and grandchildren.

Linear Programming

Minmax duality; Max flow problems

Any problem can reduce to boolean combinational circuit, and the circuit value problem can reduce to a LP problem. Thus everything that is solvable can reduce to LP.

NP Problems

Some definations
1. Search Problem: Any proposed solution can be checked for correctness in polynomial time
2. P: Search problems that can be solved in polynomial time
3. NP: All search problems
4. NP hard: problems that All NP problems can reduce to
5. NP complete: problems that are both NP and NP hard
Reduction:
All NP -> CSAT -> SAT -> 3SAT -> Independent Set and 3D Matching

Local Improvement

Backtracking: Construct subproblem that can be checked quickly.
Simulated Annealing

Common Evaluation ways for Binary Classification Model

Posted on 2018-07-17 Edited on 2019-07-25 In CS , Data

Confusion Matrix and Table of Confusion

Confusion matrix is a specific table layout that allows visualization of the performance of an algorithm, typically a supervised learning one. Here is a basic example:

Here is a prediction result from a classification model.

prediction	1	2	3	1	1	1	3
true	1	2	2	2	3	2	1

Use confusion matrix to interpret it and we get:

	1	2	3
1	1	2	1
2	0	1	0
3	1	0	0
Prediction

To interpret this matrix, take look at the forth row. This mean there’s one time that we predict 1 into 3. Then we introduction a defination called table of confusion. It’s a table with two rows and two columns that reports the number of false positives, false neagatives, true positives and true negatives. Take look at our next example.

	actual 1	actual other than 1
predict 1	1(true positive)	3(false positive)
predict other than 1	1(true negative)	2(false negative)

So far everything is still simple. We introduce four new concepts to evaluate the performance of our model.

Although all these concepts and formula are still not so hard to understand, when and which concept we should use becomes complicated. I will discuss the main differences, advantages and disadvantages of each measurement, and how to combine them together to evaluate model performance.

F1 Score

(Positive Predictive Value) Precision represents “how many selected items are relevant”, and (True Positive Rate) Recall represents “how many relevant items are selected”. The former observes TP number in chosen dataset and the latter observes TP in original dataset. We usually use the harmonic average of both to measure test’s accuracy (F1 score).

$$F1 = \frac{2 \times recall \times precision}{recall + precision}$$

The reason we use harmonic average is that the closer precision and recall are, the more reliable our measurement is. We can relate this formula to dice similarity coefficient, which is used for comparing the similarity of two samples.

$$DSC = \frac{2 |X \cap Y|}{|X| + |Y|}$$

The main drawback of using F-score is listed below:

It focus on only one class. It reveals the discriminative power of each feature independently from others, but it does not indicate anything on the combination of both features.
It’s biased to the majority class. For example if there are 90% male candidates and 10% female candidates and our model can simply predict all candidates are male, which will give us 100% recall and 90% precision.
It doesn’t take into account the True Negative. TN can change arbitrarily without changing F-measure.

ROC, AUC and Informedness

A ROC space depicts relative trade-offs between true positive and false positive. In binary classification , we predict a score for each instance. Given a threshold parameter T, the instance is classified as positive if X > T and negative otherwise. Usually as threshold gets looser, TPR increases and FPR increases as well. Therefore, a ROC space is defined by FPR and TPR as x and y axes at various threshold settigs. We usually use AUC(Area Under the Curve) to summarize the ROC curve.

$$AUC = \int_{-\infty}^{\infty} TPR(T) FPR’(T) dT$$

AUC can be interpret as the probability that the classifier will assign a higher score to a randomly chosen positive example than to a negative example. The main drawback of using AUC is listed below:

It doesn’t work well with small dataset, since it generlizes a step function into a continuous curve.
It ignores the fact that the curve is about the tradeoffs between the different systems.
AUC doesn’t tell you the costs of different kinds of errors. It will treat a 10000 dollar fraud the same as a 10 dollar fraud.
It does not account for false negative cases.
It’s also biased to the total amount of positive and negative cases.

Accuracy

$$Accuracy = \frac{TP + TN}{TP + FP + TN + FN}$$

Accuracy is one of the most common used stastics data. The formula is pretty clear so let’s go straight to its drawback. The main disadvantage is that iff TN is large, the influence of TP becomes minor.

Matthew Correlation Coefficient

$$MCC =
\frac{\sum_{n} (P_n - \overline{R}) (S_n - \overline{S})}{\sqrt{\sum_n (P_n - \overline{P})^2 \sum_n (P_n - \overline{P})^2}}
$$

where $\overline{P}$ represents the mean of prediction, $P_n$ represents the nth prediction, $\overline{S}$ represents the mean of observation and $S_n$ represents the nth obeservation.

The fomula above can be transformed to following one, which is what we commonly see.

$$MCC = \frac{TP \times TN - FP \times FN}{TP + TN + FP + FN}$$

Matthew Correlation Coefficient use the correlation between the prediction and observation to evaluate the model performance. It’s so far seen as one of the best measurements for binary classification, since it takes into account the balance ratios of the four confusion matrix categories.

Conclusion

Generalize the confusion matrix trades off some features in data. Under different circumstances, we evaluate four confusion matrix categories differently. Each measurement pays attention to some particular aspects in dataset. Therefore, it’s important to know when and how to use these tools to get a comprehensive understanding about our model.

Reference:

乔为 William

There are some bugs in the navigate bar currently. Try to use browser "back" button when encounting blank pages.

GitHub Email Zhihu Linkedin Facebook