Part 2: Key Takeaways from Claim Examples in the 2024 AI Patent Eligibility Guidance

This is the second post in a 3-part series. Example 48 is described below, which is directed to analyzing speech signals and separating desired speech from extraneous or background speech using AI.

Example 48. Speech Separation

Key Takeaway for Claim 1: Example 48, Claim 1 could be interpreted as receiving spoken audio, and deriving/calculating data using a mathematical formula. The disclosure beneficially describes a technical problem/solution that offers a particular speech-separation technique, but the improvement is not reflected in the claims, causing the claim to be ineligible.

Claim 1. A speech separation method comprising:

(a) receiving a mixed speech signal x comprising speech from multiple different sources s_n, where n ∈ {1, . . . N};

(b) converting the mixed speech signal x into a spectrogram in a time-frequency domain using a short time Fourier transform and obtaining feature representation X, wherein X corresponds to the spectrogram of the mixed speech signal x and temporal features extracted from the mixed speech signal x; and

(c) using a deep neural network (DNN) to determine embedding vectors V using the formula V = f_θ(X), where f_θ(X) is a global function of the mixed speech signal x.

Background for Claim 1:

Traditional computer-based speech separation techniques perform well in distinguishing and separating different classes of audio (e.g., human speech and background noise), but perform poorly in separating audio from sources belonging to the same class (e.g., speech from different speakers). The improved system receives as input, a mixed speech signal x, from an audio recording device (e.g., a microphone) recording an event. The system uses a deep neural network (DNN) to promote separation of the features during clustering by learning high-level feature representations of the signal x and mapping the feature representations to the embedding space. Each cluster represents a distinct speech source, thereby separating speech signals of different sources s_n, where n ∈ {1, . . . N}, identified in the mixed speech signal. The DNN converts feature representations X_t, obtained from the spectrograms S_t and the corresponding feature matrices FM_t, into multi-dimensional embedding vectors V. These embedding vectors V are assigned to the TF bins as a global function of the input signal (V = f_θ(X), where f_θ represents the function of the DNN).

Because the DNN assigns the embedding vectors V as a function of the entire input signal, the embedding vectors V take into account the global properties of the input signal. This allows the k distinct groups to correspond to the N sources identified in the mixed speech signal, thereby providing a superior speech separation. This feature of the invention is an improvement over traditional speech separation methods because it allows for blind speech separation (i.e., the system is not required to have prior knowledge of the number of speakers and does not need to be trained on speech from the different constituent sources of the mixed audio signal). Due to this, the DNN could be trained with mixed speech signals comprising a fewer number of speakers and could be used to separate speech signals from a larger number of sources.

SME Holding for Claim 1:

Claim 1 is ineligible.

Step 1: Under MPEP § 2106.03, the analysis determines whether the claim falls within any statutory category, including processes, machines, manufactures, and compositions of matter. Here, the claim recites receiving a mixed speech signal, converting the mixed speech signal, and using a DNN to determine embedding vectors, and thus claim 1 is a process. (Step 1: YES).

Step 2A, Prong One: Under MPEP § 2106.04(II), the claim “recites” a judicial exception when the judicial exception is “set forth” or “described” in the claim. Here, Claim Element (b) recites a mathematical operation to convert a signal from one domain to another using a specific transform function. In Claim Element (c), another mathematical calculation is recited. Claim Element (a) is recited at a high level of generality and it is also necessary to acquire data in order to perform a mathematical calculation (e.g., the abstract idea), so (a) is also considered an insignificant extra-solution activity. Claim Elements (b) and (c) recite abstract ideas. (Step 2A Prong One: YES).

Step 2B: Under MPEP § 2106.05, the analysis evaluates whether the claim as a whole amounts to “significantly more” than the recited exception i.e., whether any additional element, or combination of additional elements, adds an inventive concept to the claim. Generally, as a practice tip takeaway, if the claim cannot provide a “practical application” of the abstract idea, it cannot provide “significantly more” than the abstract idea. Indeed, most of the claim features are examples of an “insignificant extra-solution activity” in Step 2A, Prong Two, thus the analysis for eligibility also fails in Step 2B. (Step 2B: NO).

Applicant’s Sample Response to a Section 101 Rejection of Claim 1:

Respond to the rejection with claim amendments that more closely align the claim to Example 48, Claim 2 or Example 48, Claim 3. Also, when initially drafting the application, ensure that the technical improvement of the invention is described in the specification, as well as explicitly recited in the claim.

Key Takeaway for Claim 2: Example 48, Claim 2 reflects the technical improvement by adding steps to claim 1 that create a new speech signal that no longer contains extraneous speech signals from unwanted sources. Additionally, Claim 2 meaningfully limits the judicial exception to synthesizing speech waveforms and providing a mixed speech signal without undesired audio. This is in contrast to Claim 1 which merely describes high-level use of mathematical functions. Claim 2 is eligible under Step 2A Prong Two.

Claim 2. The speech separation method of claim 1 further comprising:

(d) partitioning the embedding vectors V into clusters corresponding to the different sources s_n;

(e) applying binary masks to the clusters to create masked clusters;

(f) synthesizing speech waveforms from the masked clusters, wherein each speech waveform corresponds to a different source s_n;

(g) combining the speech waveforms to generate a mixed speech signal x’ by stitching together the speech waveforms corresponding to the different sources s_n, excluding the speech waveform from a target source s_s such that the mixed speech signal x’ includes speech waveforms from the different sources s_n and excludes the speech waveform from the target source s_s; and

(h) transmitting the mixed speech signal x’ for storage to a remote location.

Background for Claim 2:

See claim 1. Additionally, the improved DNN learns the high-level feature representations of the input mixed speech signal x by converting these feature representations X_t, obtained from the spectrograms St and the corresponding feature matrices FM_t, into multi-dimensional embedding vectors V. These embedding vectors V are assigned to the TF bins as a global function of the input signal (V = f_θ(X), where f_θ represents the function of the DNN). The DNN assigns embedding vectors V to each TF region such that the Euclidean distances between embedding vectors of TF bins dominated by the same source are minimized, and the Euclidean distances between embedding vectors of TF bins dominated by different sources are maximized. Thus, embedding vectors V for all TF bins, representing the different sources, are calculated. Next, clustering is performed using a k-means clustering algorithm to separate different speech sources s_n, of the mixed signal. Embedding vectors V are clustered into k distinct groups, with each group representing a distinct speech source from s_n. The clustering algorithm arbitrarily chooses k initial centers C. Then, until the algorithm converges, embedding vectors V are assigned to their closest cluster center, and each center is moved to the mean of its currently assigned clustering subset. By the end of this process, the embedding vectors V are partitioned into clusters corresponding to the different constituent sources, s_n.

SME Holding for Claim 2:

Claim 2 is eligible.

Step 2A, Prong One: Under MPEP § 2106.04(II), the claim “recites” a judicial exception when the judicial exception is “set forth” or “described” in the claim. Here, Claim Elements (b) and (c) recite mathematical concepts, and Claim Element (d) places no limits on the partitioning process from being performed in the mind. Additionally, the claim merely uses the DNN as a tool to perform the otherwise mental process without extra steps/detail. See MPEP § 2106.04(a)(2)(III)(C). (Step 2A Prong One: YES).

Applicant’s Sample Response to a Section 101 Rejection of Claim 2:

Step 2A, Prong One: Under MPEP § 2106.04(II), the analysis determines whether the claim recites a judicial exception. The claim “recites” a judicial exception when the judicial exception is “set forth” or “described” in the claim. Here, the Office Action alleges that two abstract ideas are recited, including mathematical concepts and mental processes. In contrast to the Office Action’s assertion, claim 2 is directed to non-abstract steps performed on a specific computing device. However, even if the Office holds that the claims fall within an abstract idea, as discussed below regarding Step 2A, Prong Two, the claim integrates the recited judicial exception into a practical application.

Key Takeaway for Claim 3: By reflecting the technological improvements disclosed in the specification, and by describing how the technological improvements are achieved vis-à-vis Claim Elements (e) and (f), Claim 3 integrates the abstract idea recited in steps (b), (c), and (d) into a practical application of speech-to-text conversion, resulting in another eligible claim.

Claim 3. A non-transitory computer-readable storage medium having computer-executable instructions stored thereon, which when executed by one or more processors, cause the one or more processors to perform operations comprising:

(a) receiving a mixed speech signal x comprising speech from multiple different sources s_n, where n ∈ {1, . . . N}, at a deep neural network (DNN) trained on source separation;

(b) using the DNN to convert a time-frequency representation of the mixed speech signal x into embeddings in a feature space as a function of the mixed speech signal x;

(d) applying binary masks to the clusters to obtain masked clusters;

(e) converting the masked clusters into a time domain to obtain N separated speech signals corresponding to the different sources s_n; and

(f) extracting spectral features from a target source s_d of the N separated speech signals and generating a sequence of words from the spectral features to produce a transcript of the speech signal corresponding to the target source s_d.

Background for Claim 3:

In addition to the background provided with claims 1 and 2, binary time-frequency masks can be used to separate the signals by using a binary matrix to indicate which portions of a representation should be turned on or off. In audio processing, a binary mask is a matrix of binary values that correspond to sources such that it is multiplied with a spectrogram to include or exclude portions of the audio. The binary time-frequency mask for each speaker is obtained using the cluster assignments by assigning 1 to all the TF bins corresponding to the respective speaker and 0 to the rest of the TF bins. Inverse STFT converts the obtained separated signals into the time domain.

One application of this process is real-time speech transcription or transcription of recorded audio. For example, a user may request a transcript of a desired source signal s_d, within the mixed speech signal x, during playback of a recorded audio, using a graphical user interface (GUI). After an inverse STFT step, the speech signal from only the desired source s_d is transmitted to a speech-to-text system. The ASR or speech-to-text system extracts spectral features from the desired source s_d using conventional means and generates a sequence of words, which are then converted into text/transcript.

SME Holding for Claim 3:

Claim 3 is eligible.

Step 1: Under MPEP § 2106.03, the analysis determines whether the claim falls within any statutory category, including processes, machines, manufactures, and compositions of matter. Here, the claim recites a non-transitory computer-readable medium, which causes the one or more processors to perform a series of steps. The disclosure recites random-access memory, flash memory, and magnetic/optical storage as examples of a non-transitory computer-readable storage medium. Thus, the claim covers only statutory embodiments of a computer-readable medium in light of the disclosure and not a transitory signal. Thus, claim 3 is directed to a “manufacture” invention. (Step 1: YES).

Step 2A, Prong One: Under MPEP § 2106.04(II), the analysis determines whether the claim recites a judicial exception. The claim “recites” a judicial exception when the judicial exception is “set forth” or “described” in the claim. Here, Claim Elements (a)-(d) recite mathematical calculations. Claim Elements (e) and (f) do not recite a judicial exception, because they cannot be performed in the human mind and the processes do not specify mathematical formulae, calculations, or relationships. However, since Claim Elements (a)-(d) recite mathematical calculations, which are abstract ideas, the analysis proceeds to Prong Two. (Step 2A, Prong One: YES).

Step 2A, Prong Two: Under MPEP § 2106.04(d), the analysis determines whether the claim as a whole integrates the recited judicial exception into a practical application of the exception or whether the claim is “directed to” the judicial exception. Here, Claim Element (a) is recited at a high level of generality. Also, it is necessary to acquire data in order to perform a mathematical calculation (e.g., the abstract idea), so (a) is also considered an insignificant extra-solution activity. Claim Element (b) generally “applies” the abstract idea. Claim Elements (e) and (f) recite additional features that integrate the abstract idea recited in (b), (c), and (d) into a practical application of speech-to-text conversion. For example, as recited in the background, existing systems poorly distinguish conversations between individuals of interest from unwanted utterances due to their inability in distinguishing different speech sources belonging to the same class, thereby resulting in poor quality transcriptions of the recorded speech. The disclosure states that this invention offers an improvement over existing speech-separation methods by providing a particular speech-separation technique that solves the problem of separating speech from different speech sources belonging to the same class, and also performing well with inter-speaker variability within the same audio class for transcriptions. The disclosure states that the invention derives embeddings by the DNN based on the global properties of the input signal, which is an improvement over prior art speech separation methods. In addition, the invention uses both temporal and spatial features of the speech signal; this feature of the invention helps a downstream conventional speech-to-text system reduce the gap in transcription performance for accented speakers over traditional speech-to-text methods. Claim Elements (e) and (f) reflect these technical improvements by reciting details of how the DNN aids in the assignment of clusters to correspond to the sources identified in the mixed speech signal, which are then converted into separate speech signals in the time domain. This process generates a sequence of words from the spectral features, thereby making individual transcription of each separated speech signal possible. Thus, the claim as a whole integrates the judicial exception into a practical application such that the claim is not directed to the judicial exception (Step 2A, Prong Two: YES). The claim is eligible.

Applicant’s Sample Response to a Section 101 Rejection of Claim 3:

In contrast to the Office Action’s assertion, claim 3 is directed to non-abstract steps performed on a specific computing device. However, even if the Office holds that the claims fall within an abstract idea, as discussed below regarding Step 2A, Prong Two, the claim integrates the recited judicial exception into a practical application.

Step 2A, Prong Two: Under MPEP § 2106.04(d), the analysis determines whether the claim as a whole integrates the recited judicial exception into a practical application of the exception or whether the claim is “directed to” the judicial exception. Here, Claim Elements (e) and (f) recite additional features that integrate the alleged abstract idea into a practical application of speech-to-text conversion. The analysis of these Claim Elements require an evaluation of the specification and the claim to ensure that a technical explanation of the asserted improvement is present in the specification and that the claim, as a whole, reflects the asserted improvement. For example, as recited in the background, existing systems poorly distinguish conversations between individuals of interest from unwanted utterances due to their inability in distinguishing different speech sources belonging to the same class. Claim Elements (e) and (f) recite an improvement over existing speech-separation methods by providing a particular speech-separation technique that solves the problem of separating speech from different speech sources belonging to the same class. The disclosure states that the invention derives embeddings by the DNN based on the global properties of the input signal, which is an improvement over prior art speech separation methods. Additionally, claim 3 recites both temporal and spatial features of the speech signal to help a downstream conventional speech-to-text system reduce the gap in transcription performance for accented speakers over traditional speech-to-text methods. These steps reflect the improvement described in the background. Thus, the claim as a whole integrates the judicial exception into a practical application such that the claim is not directed to the judicial exception (Step 2A, Prong Two: YES). The claim is eligible under Step 2A Prong Two and the analysis does not proceed to Step 2B.

Related Services:

Patents | AI

About the Authors:

Melissa Patterson focuses on the preparation and prosecution of patent applications, licensing, and litigation, particularly in innovative technologies such as software, computer and mechanical devices, AR/VR headsets, mobile communications, artificial intelligence, robotics, blockchain, insurance, healthcare records management, geophysical management, automotive technologies, and financial networks. She has successfully prosecuted over 1,000 U.S. and international patent matters, collaborating extensively with foreign counsel.

Hector Agdeppa focuses his practice on intellectual property law, particularly preparing and prosecuting patent applications. With extensive experience in electrical, mechanical, and computer software arts, as well as medical device innovation, He leverages his industry knowledge to guide clients through patent portfolio management and, when desired, monetization strategies.

Mark Catanese is a patent prosecution attorney in Dickinson Wright San Diego office. His practice focuses on the preparation and prosecution of patent applications for a diverse clientele, ranging from startups to Fortune 500 companies. His experience spans consumer electronics, wireless communications, satellite technologies, computer hardware and software, machine learning and artificial intelligence, robotic systems, LED and LCDs display technologies, augmented and virtual reality, optical systems, semiconductor devices, medical devices, and automotive systems.