## Introduction

In Lab 1 “Rock Climber”, you learned how to climb rocks (well, sort of). You got overly excited and couldn’t wait to tell your friends about your newly acquired skill on Twitter. Obviously, you needed an excellent hashtag for your tweet. No problem! We will find you one.

In this assignment, you will be exposed to several array searching and sorting concepts and implementations, and asymptotic analyses in general. Work individually. Follow the guidelines in the syllabus for any discussions with others. You may work in pairs on extra credit contests, but then the extra credit will be divided among the two of you. Extra credit entries must be bug free to be eligible and turned in on time (no late days).

## Files

After downloading the assignment tarball from the course website, extract the files by running:

```
tar -xvf lab2-handout.tgz
```

from a terminal window. Some of the files worth looking at are listed below.

Files you won’t modify:

```
Makefile
lib/*.h
lib/*.c
lib/words
```

Files you may modify (and the files denoted by * will be submitted for grading):

```
* duplicates.c
* trending.c
* validity.c
duplicates_test.c
trending_test.c
validity_test.c
```

Additionally, you should create a file called:

```
written.pdf
```

which contains the answers to the written parts of the assignment. For your convenience, we’ve provided a LATEX template (written.tex).

## Submission

To submit your programming assignment, first generate a handin archive lab2-handin.tgz with the command

```
make package
```

then submit your lab2-handin.tgz file to your Subversion repository (svn) for the class. Once you’ve completed all problems, you should also submit your written.pdf to Gradescope.

Note that your code will not be graded until after the due date. It is your responsibility to test your code and proofread your writeup thoroughly before submitting.

## #Hashtag

What is a hashtag you ask? Well, it is “a word or phrase preceded by the symbol # that classifies or categorizes the accompanying text (such as a tweet)”. For the purpose of this lab, we will ignore the preceding “#” sign, because they appear in front of every hashtag. Furthermore, throughout the lab, we will assume that the hashtags are case-insensitive. For example, #OhWhatABeautifulMorning is represented as the string “ohwhatabeautifulmorning”, or more precisely, a char array in C, of course.

## #WhatHashtagsHaveIUsed

In this part of the assignment, given a sorted array of hashtags (i.e. an array of strings sorted in alphabetical order) that you have used since your first tweet, you will return a new sorted array of distinct hashtags. In other words, the new array should contain the same strings as the original array, but all strings are now distinct.

### Task 1

In duplicates.c, implement the function

1 | int number_distinct(char **hashtags, int n) |

that counts the number of distinct strings among the first n elements of the sorted array hashtags.

For example, for the following array

1 | animals = {"bobcat", "bobcat", "chipmunk", "eagle", "eagle", "eagle", "tortoise"} |

`number_distinct(animals, 7)`

should evaluate to 4, for there are four kinds of animal in the array, namely bobcat, chipmunk, eagle and tortoise. Your solution should have O(n), a.k.a. linear, asymptotic running time.

You may use the C string library. In particular, you may find the following functions useful in string.h:

1 | // Comparing string |

### Task 2

In duplicates.c, implement the function

1 | int remove_duplicates(char **hashtags, char **result, int n) |

which stores a new sorted array in result with all duplicates in the sorted subarray of hashtags removed, and returns the number of distinct strings in your result. Similar to `number_distinct`

, we only consider the first n elements in the array. So calling `remove_duplicates(animals, result, 5)`

should return 3 and store `result = {"bobcat", "chipmunk", "eagle"}`

. Again, your solution should have O(n) asymptotic running time.

To test your code for the above functions, you can simply add your own test cases to the file duplicates_test.c, and run the following command:

```
>> make duplicates
```

## #WhatsTrendingNow

Now you want to see what are the trending hashtags on the Internet. In this part, you will implement some functions that analyze the frequencies of hashtags of an input feed. The expected hashtags will be represented by a sorted array of strings hashtags of length n (i.e. just like before, we only consider the first n elements of the array), and we will maintain another array of integers frequencies, where each element frequencies[i] is the number of times we have seen hashtags[i] in the feed so far, where i ∈ [0, n).

### Task 1

In trending.c, implement the function

1 | int count_frequencies(char **hashtags, int *frequencies, int n, char **feed, int feedlength, bool fast) |

which should evaluate to the number of occurrences of strings in feed that do not appear in the array hashtags, and should update the frequency counts in frequencies with the number of times each string in feed appears. Note that feedlength is the size of the subarray we are considering. So for example, when it is given

1 | hashtags = {"bobcat", "chipmunk", "eagle", "tortoise"} |

and the corresponding frequencies

1 | frequencies = {2,1,3,1} |

then the input feed

1 | feed = {"tortoise", "tortoise", "Hare", "Hare", "Hare"} |

would cause the frequency count for “tortoise” to be incremented by 2, and would cause 3 to be returned, because “Hare” was not in hashtags. So frequencies now becomes {2,1,3,3}.

Note that a precondition of `count_frequencies`

is that the hashtags must be sorted, a fact you should exploit. Your function should use the linear search algorithm when fast is set to false and it should use the binary search algorithm when fast is true. You can implement this choice with a simple if statement that decides which function to call duplicating a lot of code is unnecessary and unhelpful. Feel free to use two library functions we implemented for you in lib/array_util.c:

1 | // Linear search |

both of which will return the index where string s appearred in subarray A of length n, and return -1 if s did not appear in the subarray.

### Task 2

In trending.c, implement the function

1 | void top_three(char **hashtags, int *frequencies, char **result, int n) |

which should find the three most common hashtags among the subarray hashtags of length n, according to their counts in frequencies, and store them in result by the order of most frequent to least frequent, i.e. result[0] is the most common hashtag, result[1] is the second common one, and result[2] is the third.

You implementation should have O(n) asymptotic running time, and furthermore, it should loop over the array frequencies only once.

Just like the previous tasks, you can test your code by adding your own test cases to the file trending_test.c, and then running:

```
make trending
```

## #IsItValidThough

As an experienced Twitter user, you can recognize a hashtag in a jiffy, but your friend, who is unfamiliar with Twitter, doesn’t know how to insert spaces between the words in your hashtag to interpret it appropriately. You write down “#imadventurous”, which obviously means “I’m adventurous”. He first tries to scan through until he first sees a whole word, insert a space, and then continue scanning. Depending on his vocabulary, he might see “I mad vent” and then get stuck because there is no such word as “urous”.

Your friend is desperate to find out what exactly your hashtag means, so you need to give him an algorithm for determining whether it is possible to insert spaces into a spaceless string to produce a sequence of valid English words.

### Task 1

In validity.c, implement the function

```
bool is_valid_hashtag(char *hashtag)
```

which should evaluate to true if hashtag can be split into a sequence of valid English words and false otherwise. Your implementation should be recursive. You may find the library functions in lib/string_util.c useful:

1 | // Look up dictionary |

where `is_word`

looks the word s up in the dictionary (in lib/words) and returns a Boolean indicating whether it is a valid word. is_word is case-insensitive; substring returns a pointer to the resulting string dest of the input string src starting at index i and of length len.

### Task 2

With a little twist to the algorithm, you can in fact achieve O(n^2) asymptotic running time for the above problem. First person/pair who submit a working solution.

Again, you can test your code by adding your own test cases to the file validity_test.c, and then running:

```
>> make validity
```