Python代写:CS4117 Music Artist Lyrics Model

代写NLP作业,对音乐的作者进行分类与识别。作业提供了框架以及相关文档,按照要求一步一步往下写即可。

Core Description

For the core, you will implement a program that creates a model of a music artist’s lyrics. This model receives lyric data as input and ultimately generates new lyrics in the style of that artist. To do this, you will leverage an NLP concept called an n-gram and use an NLP technique called language modeling.
Your understanding of the linked concepts and definitions is crucial to your success, so make sure to understand n-grams, language modeling, Python dictionaries as taught in the warmup, and classes and inheritance in Python before attempting to implement the core.
The core does not require you to include any external libraries beyond what has already been included for you. Use of any other external libraries is prohibited on this part of the project.

Core Structure

In the language-models/folder, you will find four files which contain class definitions: nGramModel.py, unigramModel.py, bigramModel.py, and trigramModel.py. You must complete the prepData, weightedChoice, and getNextToken functions in nGramModel.py. You must also complete the trainModel, trainingDataHasNGram, and getCandidateDictionary functions in each of the other three files.
In the root CreativeAI repository, there is a file called generate.py, which will be the driver for generating both lyrics and music. For the core, you will implement the trainLyricsModels, selectNGramModel, generateSentence, and runLyricsGenerator functions; these functions will be called, directly or indirectly, by main, which is written for you.
We recommend that you implement the functions in the order they are listed in the spec; start with prepData and work your way down to runLyricsGenerator.

Getting New Lyrics (Optional)

If your group chooses to use lyrics from an artist other than the Beatles, you can use the web scraper we have written to get the lyrics of the new artist and save them in the data/lyrics directory for you. A web scraper is a program that gets information from web pages: ours, which lives in the data/scrapers directory.
If you navigate to the data/scrapers folder and run the lyricsWikiaScraper.py file, you will be prompted to input the name of an artist. If that artist is found on lyrics.wikia.com, the program will make a folder in the data/lyrics directory for that artist, and save each of the artist’s songs as a .txt file in that folder.

Explanation of Functions to Implement

prepData

The purpose of this function is to take input data in the form of a list of lists, and return a copy of that list with symbols added to both ends of each inner list.
For the core, these inner lists will be sentences, which are represented as lists of strings. The symbols added to the beginning of each sentence will be ^::^ followed by ^:::^, and the symbol added to the end of each sentence will be $:::$. These are arbitrary symbols, but make sure to use them exactly and in the correct order.
For example, if the function is passed this list of lists:

[ ['hey', 'jude'], ['yellow', 'submarine'] ]

Then it would return a new list that looks like this:

[ ['^::^', '^:::^', 'hey', 'jude', '$:::$'], ['^::^', '^:::^', 'yellow', 'submarine', '$:::$'] ]

The purpose of adding two symbols at the beginning of each sentence is so that you can look at a trigram containing only the first English word of that sentence. This captures information about which words are most likely to begin a sentence; without these symbols, you would not be able to use the trigam model at the beginning of sentences because there would be no trigrams to look at until the third word.
The purpose of adding a symbol to the end of each sentence is to be able to generate sentence endings. If you ever see $:::$ while generating a sentence in the generateSentence function, you know the sentence is complete.

trainModel

This function trains the NGramModel child classes on the input data by building their dictionary of n-grams and respective counts, self.nGramCounts. Note that the special starting and ending symbols also count as words for all NGramModels, which is why you should use the return value of prepData before you create the self.nGramCounts dictionary for each language model.

  • For the unigram model, self.nGramCounts will be a one-dimensional dictionary of {unigram: unigramCount} pairs, where each unique unigram is somewhere in the input data, and unigramCount is the number of times the model saw that particular unigram appear in the data. The unigram model should not consider the special symbols ‘^::^’ and ‘^:::^’ as words, but it should consider the ending symbol $:::$ as a word. The bigram and trigram modles should consider all special symbols as words.
  • For the bigram model, the dictionary will be two-dimensional. It will be structured as {unigramOne: {unigramTwo: bigramCount}}, where bigramCount is the count of how many times this model has seen unigramOne + unigramTwo appear as a bigram in the input data. For example, if the only song you were looking at was Strawberry Fields Forever, part of the BigramModel’s self.nGramCounts dictionary would look like this:
1
2
3
4
self.nGramCounts = {
'strawberry' : {'fields' : 10},
'fields' : {'forever' : 6, '$:::$' : 4}
}
  • For the trigram model, the dictionary will be three-dimensional. It will be structured as {unigramOne: {unigramTwo: {unigramThree: trigramCount}}}, where trigramCount is the count of how many times this model has seen unigramOne + unigramTwo + unigramThree appear as a trigram in the input data.

trainingDataHasNGram

This function takes a sentence, which is a list of strings, and returns True if a particular language model can be used to determine the next token to add to that sentence, given the n-grams that the language model has stored in self.nGramCounts. If not, it returns False.

  • For the unigram model, this function returns True if the unigram model knows about any words at all: in other words, its self.nGramCounts dictionary is not empty. This is because a unigram model only looks at the frequency of single words, regardless of their context.
  • For the bigram model, this function returns True if the model has seen the last word in the current sentence at the start of a bigram. Hint: which “dimension” of the bigram model’s self.nGramCounts would contain this information?
  • For the trigram model, this function returns True if the model has seen the second-to-last and last words in the current sentence at the start of a trigram, in that order.

getCandidateDictionary

This function returns a dictionary of candidate next words to be added to the current sentence. More specifically, it returns the set of words that are legal to follow the sentence passed in, given the particular language model’s training data. So it looks at the sentence, figures out what word the model thinks can follow the last words in the sentence, and returns that set of words and counts. Note: when you write this function, you may assume that that the trainingDataHasNGram function for this specific language model instance has returned True.
For each n-gram model, this function will look at the last n - 1 words in the current sentence, index into self.nGramCounts using those words, and return a dictionary of possible n-th words and their counts. For example, the unigram model is an n-gram model for which n = 1, so the unigram model looks at the previous 0 words in the sentence. Therefore, the unigram model sees every word in its training data as a candidate; in other words, the unigram model version of getCandidateDictionary should return its entire self.nGramCounts dictionary. Based on this knowledge, what dictionaries should the bigram and trigram models return?
Hint: the indexing method you use here will be syntactically very similar to what you did in trainingDataHasNGram.

weightedChoice

This function takes a dictionary as input and chooses a key in that dictionary to return, using the dictionary’s values to compute probabilities of each key being chosen. It will be used in getNextToken as a helper function in choosing a next word for a randomly generated sentence.
Suppose your input dictionary was this:

{'north': 4, 'south': 1, 'east': 3, 'west': 2}

Here’s how to choose a key to return from the above dictionary. First, make two lists: one to represent all the keys in the dictionary, and one to represent all the values in the dictionary. Here’s a table where the “token” column is our list of keys, and the “count” column is our list of values.

For example, using the table above, say we generated 7 as our random number. Then we see the word at index 2, “east”, is the first word with a cumulative count greater than 7, since the word’s cumulative value is 8. Therefore, we return the token at index 2.

getNextToken

This function should be very short. It does three things:

  1. Call getCandidateDictionary with the current sentence as an argument.
  2. Pass the return value of getCandidateDictionary to weightedChoice.
  3. Return whatever weightedChoice returns.
    At a high level, the effect of doing this is getting a list of candidate next words for the current sentence, choosing a next word for the sentence based on the weights of each candidate word, and then finally returning that chosen word.

trainLyricsModels

This function takes a name of a directory where an artist’s lyrics are stored - for example, the_beatles - and loads those lyrics, using an instance of the prewritten DataLoader class, into a list called dataLoader.lyrics. It then makes a list of language models and trains those language models using the text in dataLoader.lyrics. Remember that you wrote functions to train instances of language models!
This function ultimately returns the list of trained language models.

selectNGramModel

This function takes a trained list of language models and a sentence, which is a list of strings. It first checks if the trigram model in the list can be used to pick the next word for the sentence; if so, it returns the trigram model. If not, it attempts to back off to a bigram model and returns the bigram model if possible. As a last resort, it returns the unigram model. Remember that you wrote a function that checks whether or not a particular NGramModel can be used to pick the next word for a sentence.

generateSentence

This function adds a word one at a time to a sentence until it decides that the sentence is done. Remember that you wrote a function to select a language model to use, and another function that picks a next word for a sentence using a language model!
Determining whether a sentence is done can come about one of two ways: either the sentenceTooLong function, which is written for you, returns True because the sentence is too long, or the next token chosen for the sentence is the special ending symbol $:::$. If either of these two events happen, the sentence is done.
This function returns a list of strings representing a sentence. The returned list should not contain any of the starting or ending symbols!

runLyricsGenerator

This function takes a list of trained language models and calls the functions you wrote above to generate a verse one, a verse two, and a chorus. Each verse/chorus is a list containing 4 sentences, where sentences are lists of strings. Note: when you call generateSentence in this function, you can choose what value you would like desiredLength (the goal length of the sentence) to be - play around with different values and see what gets the best output.
After you create the verseOne, verseTwo, and chorus lists, pass those lists into the printSongLyrics function, which is written for you. This will print out the song that you created in a nice song-like format.

Explanation of Given Functions

sentenceTooLong

This function takes two parameters, an integer desiredLength and an integer currentLength. desiredLength is how the length that you would like your sentence to be. It would be unnatural if all sentences were the same length, so sentenceTooLong enables your sentence to end at the desired length, plus or minus a small amount. The random.gauss function generates a random number close to currentLength; if the random number is larger than desiredLength, the function returns True. Otherwise the function returns False.
Increasing the value of STDEV in this function will increase the randomness, leading to more varying sentence lengths.

printSongLyrics

This function takes three parameters which are lists of lists of strings: verseOne, verseTwo, and chorus. It then prints out the song in this order: verse one, chorus, verse two, chorus.

getUserInput

This function takes three parameters: teamName, which should be the name of your group; lyricsSource, which should be the name of the artist that you’re generating lyrics for; and musicSource, which should be the name of the source from which you got your music data for the reach.
The function returns a user’s choice between 1 and 3, looping while the user does not input a valid choice. Choice 1 is for generating lyrics; choice 2 is for generating music; and choice 3 is to quit the program.

main

This function first trains instances of language models on the lyrics and music data by calling the trainLyricsModels and trainMusicModels functions. Then, it calls getUserInput and uses the return value of that function to either generate new lyrics by calling runLyricsGenerator, or generate a song by calling runMusicGenerator. Note that the trainMusicModels and runMusicGenerator functions don’t need to be touched for the core.
At the beginning of main there are several string variables to hold your group’s name, the name of the artist you’re using, etc. Make sure to update these values with your team’s name and your choices of data.

How to Test Your Program

At the bottom of every file that you will edit for this project, you will find this line:

1
if __name__ == '__main__':

Below this line, you should add any testing code that you need. In fact, we have already provided you with a few lines for testing in each file - add to these as you develop your project, so you can ensure that your individual functions work before putting it all together.
To run your tests, just run the file that contains the tests as a Python program. For example, if you’re writing tests for unigramModel.py, you can either run that file as a Python program in PyCharm, or navigate to the directory where it’s stored via the command line and type:
python unigramModel.py
Note that in generate.py, the main function is called for you - this is the driver program that will ask you to input an option for either generating lyrics, generating music, or quitting the program. You probably will not want main to run when you’re testing smaller functions in generate.py, so you can comment main out for testing purposes. Just make sure to uncomment it and test it as a whole before submitting your final product.

Tips for Speeding Up Your Program

If your program is taking a long time to load the data and train the models, it’s likely that inefficiencies in your code are slowing down your program. The most common cause of inefficiency is too many nested loops in your trainModel functions. For example, if you have 10 words, and you run through the words once for each word in the list (i.e. 10 times), that will be 100 steps total, which is not too bad. But if you have 10,000 words in the dataset, and you look at each one 10,000 times, then that will be 100,000,000, which is bad.
Each version of the trainModel function can be written correctly with at most two levels of nested for loops, and a typical program should not take more than around 30 seconds to load. Try experimenting with different loop structures if your program is taking too long to load.

How to Run Your Program to Generate Lyrics

If you are using PyCharm, open generate.py and click “Run…” in the top navigation bar. If you are working from the command line, navigate to the root directory where your CreativeAI project is stored and type:
python generate.py
Even if you have not implemented any of the functions in the project, the starter code should work out of the box. Therefore, you can play around with it and get a feel for how the driver in main works.